A customer is using an Exadata X8M-2 machine with multiple VMs (hence multiple clusters). I was working on adding a new storage cell to the configuration. After creating griddisks on the new cell and updating cellip.ora on all the VMs, I noticed that none of the clusters was able to see the new griddisks. I checked the usual suspects like if asm_diskstring was set properly, private network subnet mask on new cell was same as the old ones. All looked good. I started searching about the issue and stumbled upon some references mentioning ASM scoped security. I checked on one of the existing cells and that actually was the issue. The existing nodes had it enabled while the new one hadn’t. Running this command on an existing cell
We need to enable ASM scoped security on the new cell as well. There are three things that need to be done. We need to copy /etc/oracle/cell/network-config/cellkey.ora from an existing cell to the new cell, assign the key to the cell and then assign keys to the different ASM clusters. We can use these commands to do it
cellcli -e ASSIGN KEY FOR CELL 'c25a62472a160e28bf15a29c162f1d74'
cellcli -e ASSIGN KEY FOR ASMCLUSTER 'cluster1'='fa292e11b31b210c4b7a24c5f1bb4d32';
cellcli -e ASSIGN KEY FOR ASMCLUSTER 'cluster2'='b67d5587fe728118af47c57ab8da650a';
Once this is done, we need to tag the griddisks for appropriate ASM clusters. If the griddisks aren’t created yet, we can use this command to do it
cellcli -e CREATE GRIDDISK ALL HARDDISK PREFIX=sales, size=75G, availableTo='cluster1'
If the griddisks are already created, we can use the alter command to make this change
cellcli -e alter griddisk griddisk0,gridisk1,.....griddisk11 availableTo='cluster1';
Once this is done, we should be able to see new griddisks as CANDIDATE in v$asm_disk
A customer who is using an Exadata X8M-2 with multiple VMs had Smokescreen deployed in their company recently and they reported an issue that one of the Smokescreen decoy servers in their DC was seeing traffic from one of the Exadata VMs on a certain port. That was rather confusing as that port was the database listener port on that VM and why would a VM with Oracle RAC deployed try to access any random IP on the listener port. Also it was happening only for this VM. Nothing for so many other VMs.
We were just looking at the things and my colleague said that he had seen this IP somewhere and he started looking through the emails. In a minute, we found the issue as he found this IP mentioned in one of the emails. This was the VIP of this VM from where the traffic was reported to be originating. While reserving IPs for Smokescreen decoy servers, someone made the mess and re-used the IP that was already used for one of the VIPs of this RAC system !
I was working on configuring a new database for backup to ZDLRA and hit this issue while testing a controlfile backup via Enterprise Manager -> Schedule backup. It could happen in any environment.
Unable to connect to the database with SQLPlus, either because the database is down or due to an environment issue such as incorrectly specified...
If the database is up, check the database target monitoring properties and verify that the Oracle Home value is correct.
The 2nd line clearly tells the problem but since the Cluster Database status in EM was green, so it took me a while to figure it out. Issue turned out to be a missing / in the end of ORACLE_HOME specified in monitoring configuration of the cluster database. The DB home specified was /u01/app/oracle/product/220.127.116.11/dbhome_1 instead of /u01/app/oracle/product/18.104.22.168/dbhome_1/.
On the server the bash_profile has the home set as /u01/app/oracle/product/22.214.171.124/dbhome_1. When I tried to connect as sysdba there, it gave an error TNS : lost contact. Then I set the environment with .oraenv and I was able to connect. /etc/oratab had correct home specified as /u01/app/oracle/product/126.96.36.199/dbhome_1/. After comparing the value of ORACLE_HOME in these two cases, the issue was identified. Then I updated the ORACLE_HOME value in the target monitoring configuration in Enterprise Manager and it worked as expected.
It was actually funny. So thought about posting it that sometimes how we can miss the absolute basics. This customer is using a virtualized Exadata with multiple VMs. One VM hosts the database meant to be used for dbfs and another VM connects to this DB over IB to mount dbfs file system using dbfs_client. One day VMs were rebooted and due to some reason the dbfs filesystem didn’t mount on startup. It went on for few days and they couldn’t mount it. One day I got a chance to look at it and the error they were facing was:
File system already present at specified mount point /dbfs_direct
If you are familiar with Unix, this clearly indicates some problem with the directory where it is trying to mount the file system. I checked that and there were some files in /dbfs_direct. I moved those files and it was able to mount. Issue resolved.
Then after closing the session, I was thinking that what could have happened as this dbfs mount point has been in use for long. Then it struck me. It was being used to take some RMAN backups and the path was hard coded in the scripts. When it didn’t mount after the reboot (i don’t know why), someone ran that script and whatever directories it didn’t find, it probably created that complete directory structure and tried to write to a log file. Once /dbfs_direct had those files, it anyway was not going to mount.
To put it in bit of an Indian context, database is not your daughter-in-law that you can blame it for every performance issue that occurs in the environment. But it does happen. Most of the time it is the database that is blamed for all such issues. Many times, the issues are in some other layer like OS, network or storage.
Faced this issue recently at one of the customer sites where performance in one of the databases went down suddenly. It was a 2 node RAC on 188.8.131.52 running on Linux 7 using some kind of Hitachi SSD storage array. There were no changes as per DBA, application, OS and storage teams. But something must have changed somewhere. Otherwise why would performance degrade just like that. I & my colleague checked some details and found that something happened in the morning a day before. Starting from that point in time, the execution time for all the commonly run queries shot up. Generally speaking, when all the queries are doing bad and you are sure that nothing has been changed on the database side, the reasons could be outside the database. But being a DBA, it is not easy to prove that. We took AWRs from good and bad times and the wait events section looked like this:
Now there is something clearly and terribly wrong with the details in the second snippet and in the first look it appears to be an IO issue. Av Rd(ms) in the File IO Stats section of the AWR reports was also showing really bad numbers for most of the data files, which have been fine two days ago.
The conference calls continued and we were not reaching anywhere. Storage team as usual said that everything was fine and there were no issues. Finally the discussion moved to multipathing and the teams started checking in that direction. There were errors like this in /var/log/messages
multipathd: asm!.asm_ctl_vbg1: failed to get path uid
multipathd: asm!.asm_ctl_vbg6: failed to get path uid
multipathd: asm!.asm_ctl_vbg9: failed to get path uid
That meant there was a problem with one of the paths from the database nodes to storage. They disabled the bad path for both the DB nodes and voila ! IO performance was back on track. It was multipathing that needed to be fixed.
So it is always not the database. It is unfair to always blame the DBA !