ORA-12154 in Data Guard environment

Hit this silly issue in one of the data guard environments today. Primary is a 2 node RAC running 11.2.0.4 and standby is also a 2 node RAC. Archive logs from node2 aren’t shipping and the error being reported is

ORA-12154: TNS:could not resolve the connect identifier specified

We tried usual things like going to $TNS_ADMIN, checking the entry in tnsnames.ora and then also trying to connect using sqlplus sys@target as sysdba. Everything seemed to be good but logs were not shipping and the same problem was being reported repeatedly. As everything on node1 was working fine so it looked even more weird.

From the error it is clear that the issue is with tnsnames entry. Finally found the issue after some 30 mins. It was an Oracle EBS environment so the TNS_ADMIN was set to the standard $ORACLE_HOME/network/admin/*hostname* path (on both the nodes). On node1 there was no tnsnames.ora file in $ORACLE_HOME/network/admin so it was connecting to the standby using the Apps tnsnames.ora which was having the correct entry for standby. On node2 there was a file called tnsnames.ora in $ORACLE_HOME/network/admin but it was not having any entry for standby. It was trying to connect using that file (the default tns path) and failing with ORA-12154. Once we removed that file, it started using the Apps tnsnames.ora and logs started shipping.

Failed to create voting files on disk group RECOC1

Long story short, faced this issue while running OneCommand for one Exadata system. The root.sh step (Initialize Cluster Software) was failing with the following error on the screen

Checking file root_dm01dbadm02.in.oracle.com_2017-04-27_18-13-27.log on node dm01dbadm02.somedomain.com
Error: Error running root scripts, please investigate…
Collecting diagnostics…
Errors occurred. Send /u01/onecommand/linux-x64/WorkDir/Diag-170427_181710.zip to Oracle to receive assistance.

Doesn’t make much sense. So let us check the log file of this step

2017-04-27 18:17:10,463 [INFO][  OCMDThread][        ClusterUtils:413] Checking file root_dm01dbadm02.somedomain.com_2017-04-27_18-13-27.log on node inx321dbadm02.somedomain.com
2017-04-27 18:17:10,464 [INFO][  OCMDThread][        OcmdException:62] Error: Error running root scripts, please investigate…
2017-04-27 18:17:10,464 [FINE][  OCMDThread][        OcmdException:63] Throwing OcmdException… message:Error running root scripts, please investigate…

So we need to go to root.sh log file now. That shows

Failed to create voting files on disk group RECOC1.
Change to configuration failed, but was successfully rolled back.
CRS-4000: Command Replace failed, or completed with errors.
Voting file add failed
2017/04/27 18:16:37 CLSRSC-261: Failed to add voting disks

Died at /u01/app/12.1.0.2/grid/crs/install/crsinstall.pm line 2068.
The command ‘/u01/app/12.1.0.2/grid/perl/bin/perl -I/u01/app/12.1.0.2/grid/perl/lib -I/u01/app/12.1.0.2/grid/crs/install /u01/app/12.1.0.2/grid/crs/install/root
crs.pl ‘ execution failed

Makes some senses but we can’t understand what happened while creating Voting files on RECOC1. Let us check ASM alert log also

NOTE: Creating voting files in diskgroup RECOC1
Thu Apr 27 18:16:36 2017
NOTE: Voting File refresh pending for group 1/0x39368071 (RECOC1)
Thu Apr 27 18:16:36 2017
NOTE: Attempting voting file creation in diskgroup RECOC1
NOTE: voting file allocation (replicated) on grp 1 disk RECOC1_CD_00_DM01CELADM01
NOTE: voting file allocation on grp 1 disk RECOC1_CD_00_DM01CELADM01
NOTE: voting file allocation (replicated) on grp 1 disk RECOC1_CD_00_DM01CELADM02
NOTE: voting file allocation on grp 1 disk RECOC1_CD_00_DM01CELADM02
NOTE: voting file allocation (replicated) on grp 1 disk RECOC1_CD_00_DM01CELADM03
NOTE: voting file allocation on grp 1 disk RECOC1_CD_00_DM01CELADM03
ERROR: Voting file allocation failed for group RECOC1
Thu Apr 27 18:16:36 2017
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_228588.trc:
ORA-15274: Not enough failgroups (5) to create voting files

So we can see the issue here. We can look at the above trace file also for more detail.

Now to why did this happen ?

The RECOC1 is a HIGH redundancy disk group which means that if we want to place Voting files there, it should have at least 5 failure groups. In this configuration there are only 3 cells and that doesn’t meet the minimum failure groups condition (1 cell = 1 failgroup in Exadata).

Now to how did it happen ?

This one was an Exadata X3 half rack and we planned to deploy it (for testing purpose) as 2 quarter racks : 1st cluster with db1, db2 + cell1, cell2, cell3 and 2nd cluster with db3, db4 + cell4, cell5, cell6, cell7. All the disk groups to be in High redundancy.

Before a certain 12.x Exadata software version it was not even possible to have all disk groups in High redundancy in a quarter rack as to have Voting disk in a High redundancy disk group you need to have a minimum of 5 failure groups (as mentioned above). In a quarter rack you have only 3 fail groups. With a certain 12.x Exadata software version a new feature quorum disks was introduced which made is possible to have that configuration. Read this link for more details. Basically we take a slice of disk from each DB node and add it to the disk group where you want to have the Voting file. 3 cells + 2 disks from DB nodes makes it 5 so all is good.

Now while starting with the deployment we noticed that db node1 was having some hardware issues. As we needed the machine for testing so we decided to build the first cluster with 1 db node only. So the final configuration of 1st cluster had 1 db node + 3 cells. We imported the XML back in OEDA, modified the cluster 1 configuration to 1 db node and generated the configuration files. That is where the problem started. The RECO disk group still was High redundancy but as we had only 1 db node at this stage so the configuration was not even a candidate for quorum disks. Hence the above error. Changing DBFS_DG to Normal redundancy fixed the issue as when DBFS_DG is Normal redundancy, OneCommand will place the Voting files there.

Ideally it shouldn’t happened as OEDA shouldn’t allow a configuration that is not doable. The case here is that as originally the configuration was having 2 db nodes + 3 cells so High redundancy for all disk groups was allowed in OEDA. While modifying the configuration when one db node was removed from the cluster, OEDA probably didn’t run the redundancy check on disk groups and it allowed the go past that screen. If you try to create a new configuration with 1 db node + 3 cells, it will not allow you to choose High redundancy for all disk groups. DBFS will remain in Normal redundancy. You can’t change that.

OneCommand Step 1 error

Hit this silly issue while doing an Exadata deployment for a customer. Step 1 was giving the following error:

ERROR: 192.168.99.102 configured on dm01celadm01.example.com as dm01dbadm02 does not match expected value dm01dbadm02.example.com

I wasn’t able to make sense of it for quite some time until a colleague pointed out that the reverse lookup entries should be done for FQDN only. As it is clear in the above message reverse lookup of the IP 192.168.99.102 returns dm01dbadm02 instead of dm01dbadm02.example.com. Fixing this in DNS resolved the issue.

Actually the customer had done reverse lookup entries for both the hostname and FQDN. As the DNS can return the results in any order, so the error message was bit random. Whenever the the hostname was returned first, Step 1 gave an error. But when the FQDN was the first thing returned, there was no error in Step 1 for that IP.

Oracle RAC 12.1 – lsnodes exited with code 9

I was trying to do a 2 node RAC setup on Solaris 11.3 where Oracle Solaris Cluster 4.3 was already configured. Installed was running but the Cluster Node Information screen was appearing like this

error

The install log shows this:

INFO: Checking cluster configuration details

INFO: Found Vendor Clusterware. Fetching Cluster Configuration

INFO: Executing [/tmp/OraInstall2017-03-28_12-50-48PM/ext/bin/lsnodes]

with environment variables {TERM=xterm, LC_COLLATE=, SHLVL=3, JAVA_HOME=, XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt, SSH_CLIENT=172.16.64.55 56370 22, LC_NUMERIC=, LC_MESSAGES=, MAIL=/var/mail/oracle, PWD=/export/software/grid/grid, XTERM_VERSION=XTerm(320), WINDOWID=2097165, LOGNAME=oracle, _=*50727*/export/software/grid/grid/install/.oui, NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat, SSH_CONNECTION=172.16.64.55 56370 172.16.72.18 22, OLDPWD=/export/oracle, LC_CTYPE=, CLASSPATH=, PATH=/usr/bin:/usr/ccs/bin:/usr/bin:/bin:/export/software/grid/grid/install, LC_ALL=, DISPLAY=localhost:10.0, LC_MONETARY=, USER=oracle, HOME=/export/oracle, XTERM_SHELL=/bin/bash, XAUTHORITY=/tmp/ssh-xauth-mlq21a/xauthfile, A__z=”*SHLVL, XTERM_LOCALE=en_US.UTF-8, TZ=localtime, LC_TIME=, LANG=en_US.UTF-8}

INFO: Starting Output Reader Threads for process /tmp/OraInstall2017-03-28_12-50-48PM/ext/bin/lsnodes

INFO: The process /tmp/OraInstall2017-03-28_12-50-48PM/ext/bin/lsnodes exited with code 9

So we can see the problem. lsnodes is not able to list the nodes. Let us try to run that command manually.

-bash-4.1$ export PATH=PATH=/usr/bin:/usr/ccs/bin:/usr/bin:/bin:/export/software/grid/grid/install

-bash-4.1$ /tmp/OraInstall2017-03-28_12-50-48PM/ext/bin/lsnodes

ld.so.1: lsnodes: fatal: libskgxn2.so: open failed: No such file or directory

Killed

-bash-4.1$

So looks like it is not able to find this library called libskgxn2.so. If we do a find for this file name we can see that it is present in this directory /usr/cluster/lib/sparcv9/libskgxn2.so .

Some googling and MOS searches revealed that it expects the library to be present at /opt/ORCLcluster/lib. This directory doesn’t exist here. As a workaround we can create this directory manually and create symbolic link to file libskgxn2.so

The lsnodes command worked fine after this workaround and installer also shows both the nodes listed.

ERROR: SPFile in diskgroup <> does not match the specified spfile

Just a stupid error. Posting it so that someone else googling for the same thing can get a clue.

An ASM instance running with default parameters (no pfile, no spfile). Updated spfile for the instance with asmcmd spset command and bounced crs. After reboot also, it still wasn’t using spfile. Got puzzled and checked GPnP settings again. All looked good. Then in alert log came across this

ERROR: SPFile in diskgroup <> does not match the specified spfile +DATA/asm/asmparameterfile/registry.253.769187275

The problem was that while copying the spfile path the complete name didn’t get copied. The last character got missed. So the filename that it was looking for wasn’t there. Updating GPnP with correct filename and bouncing crs resolved the issue.