Failed to create voting files on disk group RECOC1

Long story short, faced this issue while running OneCommand for one Exadata system. The root.sh step (Initialize Cluster Software) was failing with the following error on the screen

Checking file root_dm01dbadm02.in.oracle.com_2017-04-27_18-13-27.log on node dm01dbadm02.somedomain.com
Error: Error running root scripts, please investigate…
Collecting diagnostics…
Errors occurred. Send /u01/onecommand/linux-x64/WorkDir/Diag-170427_181710.zip to Oracle to receive assistance.

Doesn’t make much sense. So let us check the log file of this step

2017-04-27 18:17:10,463 [INFO][  OCMDThread][        ClusterUtils:413] Checking file root_dm01dbadm02.somedomain.com_2017-04-27_18-13-27.log on node inx321dbadm02.somedomain.com
2017-04-27 18:17:10,464 [INFO][  OCMDThread][        OcmdException:62] Error: Error running root scripts, please investigate…
2017-04-27 18:17:10,464 [FINE][  OCMDThread][        OcmdException:63] Throwing OcmdException… message:Error running root scripts, please investigate…

So we need to go to root.sh log file now. That shows

Failed to create voting files on disk group RECOC1.
Change to configuration failed, but was successfully rolled back.
CRS-4000: Command Replace failed, or completed with errors.
Voting file add failed
2017/04/27 18:16:37 CLSRSC-261: Failed to add voting disks

Died at /u01/app/12.1.0.2/grid/crs/install/crsinstall.pm line 2068.
The command ‘/u01/app/12.1.0.2/grid/perl/bin/perl -I/u01/app/12.1.0.2/grid/perl/lib -I/u01/app/12.1.0.2/grid/crs/install /u01/app/12.1.0.2/grid/crs/install/root
crs.pl ‘ execution failed

Makes some senses but we can’t understand what happened while creating Voting files on RECOC1. Let us check ASM alert log also

NOTE: Creating voting files in diskgroup RECOC1
Thu Apr 27 18:16:36 2017
NOTE: Voting File refresh pending for group 1/0x39368071 (RECOC1)
Thu Apr 27 18:16:36 2017
NOTE: Attempting voting file creation in diskgroup RECOC1
NOTE: voting file allocation (replicated) on grp 1 disk RECOC1_CD_00_DM01CELADM01
NOTE: voting file allocation on grp 1 disk RECOC1_CD_00_DM01CELADM01
NOTE: voting file allocation (replicated) on grp 1 disk RECOC1_CD_00_DM01CELADM02
NOTE: voting file allocation on grp 1 disk RECOC1_CD_00_DM01CELADM02
NOTE: voting file allocation (replicated) on grp 1 disk RECOC1_CD_00_DM01CELADM03
NOTE: voting file allocation on grp 1 disk RECOC1_CD_00_DM01CELADM03
ERROR: Voting file allocation failed for group RECOC1
Thu Apr 27 18:16:36 2017
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_228588.trc:
ORA-15274: Not enough failgroups (5) to create voting files

So we can see the issue here. We can look at the above trace file also for more detail.

Now to why did this happen ?

The RECOC1 is a HIGH redundancy disk group which means that if we want to place Voting files there, it should have at least 5 failure groups. In this configuration there are only 3 cells and that doesn’t meet the minimum failure groups condition (1 cell = 1 failgroup in Exadata).

Now to how did it happen ?

This one was an Exadata X3 half rack and we planned to deploy it (for testing purpose) as 2 quarter racks : 1st cluster with db1, db2 + cell1, cell2, cell3 and 2nd cluster with db3, db4 + cell4, cell5, cell6, cell7. All the disk groups to be in High redundancy.

Before a certain 12.x Exadata software version it was not even possible to have all disk groups in High redundancy in a quarter rack as to have Voting disk in a High redundancy disk group you need to have a minimum of 5 failure groups (as mentioned above). In a quarter rack you have only 3 fail groups. With a certain 12.x Exadata software version a new feature quorum disks was introduced which made is possible to have that configuration. Read this link for more details. Basically we take a slice of disk from each DB node and add it to the disk group where you want to have the Voting file. 3 cells + 2 disks from DB nodes makes it 5 so all is good.

Now while starting with the deployment we noticed that db node1 was having some hardware issues. As we needed the machine for testing so we decided to build the first cluster with 1 db node only. So the final configuration of 1st cluster had 1 db node + 3 cells. We imported the XML back in OEDA, modified the cluster 1 configuration to 1 db node and generated the configuration files. That is where the problem started. The RECO disk group still was High redundancy but as we had only 1 db node at this stage so the configuration was not even a candidate for quorum disks. Hence the above error. Changing DBFS_DG to Normal redundancy fixed the issue as when DBFS_DG is Normal redundancy, OneCommand will place the Voting files there.

Ideally it shouldn’t happened as OEDA shouldn’t allow a configuration that is not doable. The case here is that as originally the configuration was having 2 db nodes + 3 cells so High redundancy for all disk groups was allowed in OEDA. While modifying the configuration when one db node was removed from the cluster, OEDA probably didn’t run the redundancy check on disk groups and it allowed the go past that screen. If you try to create a new configuration with 1 db node + 3 cells, it will now allow you to choose High redundancy for all disk groups. DBFS will remain in Normal redundancy. You cant change that.

OneCommand Step 1 error

Hit this silly issue while doing an Exadata deployment for a customer. Step 1 was giving the following error:

ERROR: 192.168.99.102 configured on dm01celadm01.example.com as dm01dbadm02 does not match expected value dm01dbadm02.example.com

I wasn’t able to make sense of it for quite some time until a colleague pointed out that the reverse lookup entries should be done for FQDN only. As it is clear in the above message reverse lookup of the IP 192.168.99.102 returns dm01dbadm02 instead of dm01dbadm02.example.com. Fixing this in DNS resolved the issue.

Actually the customer had done reverse lookup entries for both the hostname and FQDN. As the DNS can return the results in any order, so the error message was bit random. Whenever the the hostname was returned first, Step 1 gave an error. But when the FQDN was the first thing returned, there was no error in Step 1 for that IP.