Tag Archives: Exadata

Using Secure Fabric for network isolation in KVM environments on Exadata

Exadata storage software version 20.1 introduces a new feature called “Secure Fabric” for KVM based multi cluster deployments (Exadata X8M). It enables network isolation between multiple tenants (i.e. KVM VMs based RAC clusters). This feature aligns with Infiniband Partitioning on OVM based systems. There are customers who in such scenarios want that VMs of one RAC shouldn’t be able to see traffic of the other RAC VMs. This feature achieves that. Similar to Pkeys in IB switches, here it uses a double VLAN tagging system where the first tag identiefies the network partition and the second tag is used to denote membership level of the VM. Exadata documention has more details.

The minimum Exadata software version needed to enable this feature is 20.1. This release comes with RoCE switches firmware version 7.0(3)I7(8).

Starting Jun 2020, OEDA supports this configuraion and this feature can be enabled in OEDA itself. To enable it in OEDA, under Cluster Networks click on the Advanced button and you will see the Enable Secure Fabric option.

Once this option is enabled, you will see VLANs enabled for the private network. While doing the deployment, OneCommand will take care of the configuration needed.

As per documentation, at present there is no way to enable it on existing systems except doing a re-deployment.

Exadata Virtualized DB node restore

There are two common scenarios when we may need this:

  • An existing DB node has crashed and is unrecoverable (due to some failure and non-availability of any backups. Though some of the things may need to be done even if the backups were available).
  • We have an existing Exadata rack that is virtualized. Now there is a new DB node and the existing clusters need to be extended to include the VMs on this new node.

I recently faced the first scenario where a virtualized DB node crashed and wasn’t recoverable. A bare metal DB node restore is a relatively simple procedure where we just have to reimage the node, create the needed directories, users etc and add it to the RAC cluster. In case of virtualization, the creation of VMs is an additional step that needs to be done. That makes it slightly more complex.

So the scenario is that we have an Exadata quarter rack where DB node1 has issues and needs to be reimaged and reconfigured. There are multiple VMs (so RAC clusters) created. As one of the DB node has gone down, each RAC cluster is running with one less instance. This failed node will need to be cleaned up from the RAC configuration before adding it back. Here are the steps that we need to follow to restore it back:

  1. Reimage the node using an ISO and make it ready for creation of User Domains (aka VMs)
  2. Create the required VMs
  3. Create the required users, setup ssh with other nodes
  4. Clear the failed node configuration from existing RAC clusters
  5. Add the newly created VMs back to the respective RAC clusters

Now let’s discuss these steps in detail.

  1. Reimage : The simplest way to reimage an Exadata node is to connect the ISO (We can download the ISO for the version we need from MOS note 888828.1) using ILOM, set the next boot device to CD-ROM, reboot/reset the node and let it boot from CD-ROM. Most of the installation part is automated and doesn’t ask any questions. Once it is done installing, ipconf starts in interactive mode and asks for all the information like Name servers, NTP servers, IP addresses and hostnames for various network interfaces etc. Once done, it will boot into the Linux partition. Since we need to virtualize the node, we need to switch it to OVS by running a script called /opt/oracle.SupportTools/switch_to_ovm.sh. It will reboot the node to OVS partition. Next step is to run reclaim /opt/opt/oracle.SupportTools/reclaim.sh -free -reclaim to reclaim the space used for bare metal partition. At this moment we are done with the reimaging part. To use ILOM in a browser and be able to access the console, we need a Java enabled Windows/Linux system. And if there is a firewall between that system and the server, this link lists the ports that need to be allowed in the firewall.
  2. VMs creation : Next step is the creation of VMs. We will use OneCommand to achieve this. In this case, we had the original XML file used for deployment. Now we need to edit that configuration and remove the existing node’s details from it. We can import the XML into OEDA, make the required changes and save the configuration files. This needs to be done carefully as a simple mistake like a duplicate IP may cause issues with the ASM/DBs running on the other node. Once this is done, we can download the OneCommand patch (MOS note 888828.1) and run the create VMs step of OneCommand. As we have only one node in the XML file, so it is not going to touch the existing configuration.
  3. Create users : Now we need to create the users on the newly created VMs. OneCommand’s create users step can be used here. It will create users on all the VMs. There are some things that we need to do manually here. First thing is to remove binaries from Grid & DB home. As we are going to use addnode.sh to add new nodes to existing RAC clusters, so binaries are going to be copied from an existing node. Then we need to change ownership of Grid & DB home directory tree to oracle:oinstall. Also for each VM, we need to setup passwordless ssh with the respective other VM (& vice versa) that is going to be part of the cluster.
  4. Clear failed node config : Next we need to clear the failed node’s configuration from each of the RAC clusters. That is pretty much the standard stuff we do in RAC.
  5. Add the new nodes : This again is just the standard addnode stuff we do in RAC.

I have used the terms VM and Node interchangeably here but the context should make it clear if I am referring to the physical node or a VM. There is another method to do this using OEDACLI and it is documented in Exadata documentation. That automates a lot of these things. Check this link for the details.

dbnodeupdate.sh appears to be stuck

I was patching an Exadata db node from 18.1.5.0.0.180506 to 19.3.2.0.0.191119. It had been more than an hour and dbnodeupdate.sh appeared to be stuck. Trying to ssh to the node was giving “connection refused” and the console had this output (some output removed for brevity):

[  458.006444] upgrade[8876]: [642/676] (72%) installing exadata-sun-computenode-19.3.2.0.0.191119-1...
<>
[  459.991449] upgrade[8876]: Created symlink /etc/systemd/system/multi-user.target.wants/exadata-iscsi-reconcile.service, pointing to /etc/systemd/system/exadata-iscsi-reconcile.service.
[  460.011466] upgrade[8876]: Looking for unit files in (higher priority first):
[  460.021436] upgrade[8876]: /etc/systemd/system
[  460.028479] upgrade[8876]: /run/systemd/system
[  460.035431] upgrade[8876]: /usr/local/lib/systemd/system
[  460.042429] upgrade[8876]: /usr/lib/systemd/system
[  460.049457] upgrade[8876]: Looking for SysV init scripts in:
[  460.057474] upgrade[8876]: /etc/rc.d/init.d
[  460.064430] upgrade[8876]: Looking for SysV rcN.d links in:
[  460.071445] upgrade[8876]: /etc/rc.d
[  460.076454] upgrade[8876]: Looking for unit files in (higher priority first):
[  460.086461] upgrade[8876]: /etc/systemd/system
[  460.093435] upgrade[8876]: /run/systemd/system
[  460.100433] upgrade[8876]: /usr/local/lib/systemd/system
[  460.107474] upgrade[8876]: /usr/lib/systemd/system
[  460.114432] upgrade[8876]: Looking for SysV init scripts in:
[  460.122455] upgrade[8876]: /etc/rc.d/init.d
[  460.129458] upgrade[8876]: Looking for SysV rcN.d links in:
[  460.136468] upgrade[8876]: /etc/rc.d
[  460.141451] upgrade[8876]: Created symlink /etc/systemd/system/multi-user.target.wants/exadata-multipathmon.service, pointing to /etc/systemd/system/exadata-multipathmon.service.

There was not much that I could do so just waited. Also created an SR with Oracle Support and they also suggested to wait. It started moving after some time and completed successfully. Finally when the node came up, i checked that there was an NFS mount entry in /etc/rc.local and that was what created the problem. For the second node, we commented this out and it was all smooth. Important to comment out all NFS entries during patching to avoid all such issues. I had commented the ones in /etc/fstab but the one in rc.local was an unexpected one.

Understanding grid disks in Exadata

Use of Exadata storage cells seems to be a very poorly understood concept. A lot of people have confusions about how exactly ASM makes uses of disks from storage cells. Many folks assume there is some sort of RAID configured in the storage layer whereas there is nothing like that. I will try to explain some of the concepts in this post.

Let’s take an example of an Exadata quarter rack that has 2 db and 3 storage nodes (node means a server here). Few things to note:

  • The space for binaries installation on db nodes comes from the local disks installed in db nodes (600GB * 4 (expandable to 8) configured in RAID5). In case you are using OVM, same disks are used for keeping configuration files, Virtual disks for VMs etc.
  • All of the ASM space comes from storage cells. The minimum configuration is 3 storage cells.

So let’s try to understand what makes a storage cell. There are 12 disks in each storage cell (latest X7 cells are coming with 10 TB disks). As I mentioned above that there are 3 storage cells in a minimum configuraiton. So we have a total of 36 disks. There is no RAID configured in the storage layer. All the redundancy is handled at ASM level. So to create a disk group:

  • First of all cell disks are created on each storage cell. 1 physical disk makes 1 cell disk. So a quarter rack has 36 cell disks.
  • To divide the space in various disk groups (by default only two disk groups are created : DATA & RECO; you can choose how much space to give to each of them) grid disks are created. grid disk is a partition on the cell disk. slice of a disk in other words. Slice from each cell disk must be part of both the disk groups. We can’t have something like say DATA has 18 disks out of 36 and the RECO has another 18. That is not supported. Let’s say you decide to allocate 5 TB to DATA grid disks and 4 TB to RECO grid disks (out of 10 TB on each disk, approx 9 TB is what you get as usable). So you will divide each cell disk into 2 parts – 5 TB and 4 TB and you would have 36 slices of 5 TB each and 36 slices of 4 TB each.
  • DATA disk group will be created using the 36 5 TB slices where grid disks from each storage cell constitute one failgroup.
  • Similarly RECO disk group will be created using the 36 4 TB slices.

What we have discussed above is a quarter rack scenario with High Capacity (HC) disks. There can be somewhat different configurations too:

  • Instead of HC disks, you can have the Extreme Flash (EF) configuration which uses flash cards in place of disks. Everything remains the same except the number. Instead of 12 HC disks there will be 8 flash cards.
  • With X3 I think, Oracle introduced an eighth rack configuration. In an eighth rack configuration db nodes come with half the cores (of quarter rack db nodes) and storage cells come with 6 disks in each of the cell. So here you would have only 18 disks in total. Everything else works in the same way.

Hope it clarified some of the doubts about grid disks.


dbca doesn’t list diskgroups

This is an Exadata machine running GI version 18.3.0.0.180717 and DB version 12.1.0.2.180717. On one of the DB nodes while running dbca, it doesn’t list the diskgroups. it works fine on the other node.

I cheked the dbca trace and found that the kfod command was failing. I tried to run it manually and got the same error:

[oracle@exadb01 ~]$ /u01/app/18.0.0.0/grid/bin/kfod op=groups verbose=true
KFOD-00300: OCI error [-1] [OCI error] [Could not fetch details] [-105777048]

KFOD-00105: Could not open pfile 'init@.ora'
[oracle@exadb01 ~]$

I ran it with strace then:

[oracle@exadb01 ~]$ strace /u01/app/18.0.0.0/grid/bin/kfod op=groups verbose=true
execve("/u01/app/18.0.0.0/grid/bin/kfod", ["/u01/app/18.0.0.0/grid/bin/kfod", "op=groups", "verbose=true"], [/* 18 vars */]) = 0
brk(0) = 0x2641000
.
.
.
.
.
open("/u01/app/18.0.0.0/grid/dbs/ab_+ASM1.dat", O_RDONLY) = -1 EACCES (Permission denied)
geteuid() = 1003
open("/u01/app/18.0.0.0/grid/rdbms/mesg/kfodus.msb", O_RDONLY) = 13
fcntl(13, F_SETFD, FD_CLOEXEC) = 0
lseek(13, 0, SEEK_SET) = 0
read(13, "\25\23\"\1\23\3\t\t\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"…, 280) = 280
lseek(13, 512, SEEK_SET) = 512
read(13, "\352\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"…, 512) = 512
lseek(13, 1024, SEEK_SET) = 1024
read(13, ".\1=\1E\1M\1X\1\352\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"…, 512) = 512
lseek(13, 1536, SEEK_SET) = 1536
read(13, "\n\0d\0\0\0D\0e\0\1\0e\0f\0\1\0\230\0g\0\1\0\306\0h\0\2\0\325\0"…, 512) = 512
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), …}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f43f85f2000
write(1, "KFOD-00300: OCI error [-1] [OCI "…, 78KFOD-00300: OCI error [-1] [OCI error] [Could not fetch details] [-132605848]
) = 78

The text in bold just before the kfod error caught my attention. When I checked actually oracle user wasn’t able to read the file. The permissions looked like this:

[root@exadb01 dbs]# ls -ltr
total 20
-rw-r--r-- 1 oragrid oinstall 3079 May 14 2015 init.ora
-rw-r--r-- 1 oragrid oinstall 587 Dec 12 15:33 initbackuppfile.ora
-rw-rw---- 1 oragrid asmadmin 1656 Dec 20 14:26 ab_+ASM1.dat
-rw-rw---- 1 oragrid oinstall 1544 Dec 20 14:26 hc_+APX1.dat
-rw-rw---- 1 oragrid oinstall 1544 Dec 21 16:57 hc_+ASM1.dat
[root@exadb01 dbs]#

Whereas on node2 they were like:

[oracle@exadb02 dbs]$ ls -ltr 
total 16
-rwxrwxrwx 1 oragrid oinstall 3079 Dec 12 14:52 init.ora
-rwxrwxrwx 1 oragrid oinstall 1544 Dec 21 16:57 hc_+ASM2.dat
-rw-rw---- 1 oragrid oinstall 1720 Dec 21 16:57 ab_+ASM2.dat
-rwxrwxrwx 1 oragrid oinstall 1544 Dec 21 16:57 hc_+APX2.dat
[oracle@exadb02 dbs]$

Since oracle user isn’t member of asmadmin group, it is not able to read the mentioned file. Changing the owner to oragrid:oinstall fixed the issue.