Can’t su to oracle user

Last week, got this issue reported by a DBA that he wasn’t able to su to oracle user from root on a Oracle Base Database VM in OCI. The login of opc user worked fine and he could do sudo su to root but he couldn’t su to oracle. When he did it just came back to root shell.

[root@xxx ~]# su - oracle
Last login: Fri Jan 12 10:20:38 UTC 2023
[root@xxx ~]#

There was nothing relevant in /var/log/messages or /var/log/secure. I tried it for some other user and it worked fine. Then I suspected something with the profile of oracle user and voila ! The .bashrc looked like this

[root@xxx oracle]# pwd
/home/oracle
[root@xxx oracle]# more .bashrc
exit
[root@xxx oracle]#

So the moment it logged in with user oracle, there was an exit command in .bashrc and it used to exit. Problem solved. I don’t know who did it but it looks like a mischief done by someone.

Adding a new cell to Exadata with asm scoped security enabled

A customer is using an Exadata X8M-2 machine with multiple VMs (hence multiple clusters). I was working on adding a new storage cell to the configuration. After creating griddisks on the new cell and updating cellip.ora on all the VMs, I noticed that none of the clusters was able to see the new griddisks. I checked the usual suspects like if asm_diskstring was set properly, private network subnet mask on new cell was same as the old ones. All looked good. I started searching about the issue and stumbled upon some references mentioning ASM scoped security. I checked on one of the existing cells and that actually was the issue. The existing nodes had it enabled while the new one hadn’t. Running this command on an existing cell

cellcli -e list key detail

name:
key:		c25a62472a160e28bf15a29c162f1d74
type:		CELL

name:		cluster1
key:		fa292e11b31b210c4b7a24c5f1bb4d32
type:		ASMCLUSTER

name:		cluster2
key:		b67d5587fe728118af47c57ab8da650a	
type:		ASMCLUSTER 

We need to enable ASM scoped security on the new cell as well. There are three things that need to be done. We need to copy /etc/oracle/cell/network-config/cellkey.ora from an existing cell to the new cell, assign the key to the cell and then assign keys to the different ASM clusters. We can use these commands to do it

cellcli -e  ASSIGN KEY FOR CELL 'c25a62472a160e28bf15a29c162f1d74'
cellcli -e  ASSIGN KEY FOR ASMCLUSTER 'cluster1'='fa292e11b31b210c4b7a24c5f1bb4d32';
cellcli -e  ASSIGN KEY FOR ASMCLUSTER 'cluster2'='b67d5587fe728118af47c57ab8da650a';

Once this is done, we need to tag the griddisks for appropriate ASM clusters. If the griddisks aren’t created yet, we can use this command to do it

cellcli -e CREATE GRIDDISK ALL HARDDISK PREFIX=sales, size=75G, availableTo='cluster1'

If the griddisks are already created, we can use the alter command to make this change

cellcli -e alter griddisk griddisk0,gridisk1,.....griddisk11 availableTo='cluster1';

Once this is done, we should be able to see new griddisks as CANDIDATE in v$asm_disk

Smokescreen detects traffic from an Exadata VM

A customer who is using an Exadata X8M-2 with multiple VMs had Smokescreen deployed in their company recently and they reported an issue that one of the Smokescreen decoy servers in their DC was seeing traffic from one of the Exadata VMs on a certain port. That was rather confusing as that port was the database listener port on that VM and why would a VM with Oracle RAC deployed try to access any random IP on the listener port. Also it was happening only for this VM. Nothing for so many other VMs.

We were just looking at the things and my colleague said that he had seen this IP somewhere and he started looking through the emails. In a minute, we found the issue as he found this IP mentioned in one of the emails. This was the VIP of this VM from where the traffic was reported to be originating. While reserving IPs for Smokescreen decoy servers, someone made the mess and re-used the IP that was already used for one of the VIPs of this RAC system !

Unable to connect to the database with SQLPlus

I was working on configuring a new database for backup to ZDLRA and hit this issue while testing a controlfile backup via Enterprise Manager -> Schedule backup. It could happen in any environment.

Unable to connect to the database with SQLPlus, either because the database is down or due to an environment issue such as incorrectly specified...
If the database is up, check the database target monitoring properties and verify that the Oracle Home value is correct.

The 2nd line clearly tells the problem but since the Cluster Database status in EM was green, so it took me a while to figure it out. Issue turned out to be a missing / in the end of ORACLE_HOME specified in monitoring configuration of the cluster database. The DB home specified was /u01/app/oracle/product/11.2.0.4/dbhome_1 instead of /u01/app/oracle/product/11.2.0.4/dbhome_1/.

On the server the bash_profile has the home set as /u01/app/oracle/product/11.2.0.4/dbhome_1. When I tried to connect as sysdba there, it gave an error TNS : lost contact. Then I set the environment with .oraenv and I was able to connect. /etc/oratab had correct home specified as /u01/app/oracle/product/11.2.0.4/dbhome_1/. After comparing the value of ORACLE_HOME in these two cases, the issue was identified. Then I updated the ORACLE_HOME value in the target monitoring configuration in Enterprise Manager and it worked as expected.

File system already present at specified mount point /dbfs_direct

It was actually funny. So thought about posting it that sometimes how we can miss the absolute basics. This customer is using a virtualized Exadata with multiple VMs. One VM hosts the database meant to be used for dbfs and another VM connects to this DB over IB to mount dbfs file system using dbfs_client. One day VMs were rebooted and due to some reason the dbfs filesystem didn’t mount on startup. It went on for few days and they couldn’t mount it. One day I got a chance to look at it and the error they were facing was:

File system already present at specified mount point /dbfs_direct

If you are familiar with Unix, this clearly indicates some problem with the directory where it is trying to mount the file system. I checked that and there were some files in /dbfs_direct. I moved those files and it was able to mount. Issue resolved.

Then after closing the session, I was thinking that what could have happened as this dbfs mount point has been in use for long. Then it struck me. It was being used to take some RMAN backups and the path was hard coded in the scripts. When it didn’t mount after the reboot (i don’t know why), someone ran that script and whatever directories it didn’t find, it probably created that complete directory structure and tried to write to a log file. Once /dbfs_direct had those files, it anyway was not going to mount.