To put it in bit of an Indian context, database is not your daughter-in-law that you can blame it for every performance issue that occurs in the environment. But it does happen. Most of the time it is the database that is blamed for all such issues. Many times, the issues are in some other layer like OS, network or storage.
Faced this issue recently at one of the customer sites where performance in one of the databases went down suddenly. It was a 2 node RAC on 18.104.22.168 running on Linux 7 using some kind of Hitachi SSD storage array. There were no changes as per DBA, application, OS and storage teams. But something must have changed somewhere. Otherwise why would performance degrade just like that. I & my colleague checked some details and found that something happened in the morning a day before. Starting from that point in time, the execution time for all the commonly run queries shot up. Generally speaking, when all the queries are doing bad and you are sure that nothing has been changed on the database side, the reasons could be outside the database. But being a DBA, it is not easy to prove that. We took AWRs from good and bad times and the wait events section looked like this:
Now there is something clearly and terribly wrong with the details in the second snippet and in the first look it appears to be an IO issue. Av Rd(ms) in the File IO Stats section of the AWR reports was also showing really bad numbers for most of the data files, which have been fine two days ago.
The conference calls continued and we were not reaching anywhere. Storage team as usual said that everything was fine and there were no issues. Finally the discussion moved to multipathing and the teams started checking in that direction. There were errors like this in /var/log/messages
multipathd: asm!.asm_ctl_vbg1: failed to get path uid
multipathd: asm!.asm_ctl_vbg6: failed to get path uid
multipathd: asm!.asm_ctl_vbg9: failed to get path uid
That meant there was a problem with one of the paths from the database nodes to storage. They disabled the bad path for both the DB nodes and voila ! IO performance was back on track. It was multipathing that needed to be fixed.
So it is always not the database. It is unfair to always blame the DBA !
To be honest, Fernando Simon has already documented all the steps needed in ZDLRA patching . So this post is more like a reference post for me and it points to the links on his blog. One thing he could change though are the post titles. He also agrees 😉
ZDLRA patching is broadly divided into two parts. First part is where you patch the RA library and Grid & DB homes. Second part includes compute node & storage cell image patch and patches for IB/RoCE switches. Second part is exactly similar to Exadata except that it is bit restricted in terms of image versions that you can use. Only the versions that are certified for ZDLRA can be used. Also the RA library version and the Exadata image version should be compatible with each other. So if you are planning to patch only one part; RA library or the image, make sure that both the components stay compatible. The MOS note that has all these details is 1927416.1. This note should be the first place to go when you are planning to patch a ZDLRA. The steps for upgrade/patch, image patching are given in MOS note 2028931.1. There is another note 2639262.1 that discusses some of the known issues that you may face while doing the patching. It is important to review all three notes before you plan to patch.
The RA library patching part can be considered of two different types. This is an important difference. Make sure that you follow the right set of commands. When you are jumping between major versions say going from 12.x to 19.x, it is called an upgrade and the commands are like racli upgrade appliance –step=1. Fernando talks about this in detail in this post.
On the other hand, when you are not jumping between versions; say going from 19.x to 19.x only, it is called patching and the commands are like racli patch appliance –step=1. Fernando has discussed this in detail in this post.
The Exadata bit (image & switches patching) of it is exactly the same as we do in Exadata. Fernando talks about this in this post.
The RA library patching bit is pretty much automated and works fine most of the time. If you hit an issue, you may find the solution/workaround documented in one of the MOS notes.
Faced this while running installer for setting up a 2 node RAC setup (version 19.8) on an Oracle SuperCluster. The error reported in the log is:
[FATAL] [INS-44000] Passwordless SSH connectivity is not setup from the local node node1 to the following nodes:
[INS-06006] Passwordless SSH connectivity not set up between the following node(s): [node2]
From the error it appears that the ssh is not setup between two nodes but actually that is not the case. Here the error message is bit misleading. It turned out to be an issue with scp with openssh version 8.x. Running the setup with -debug option gives the clue:
<protocol error: filename does not match request>
The reason is a new check introduced in openssh version 8.x. It is explained here, here and here. MOS note 2555697.1 also talks about it.
Workaround is to pass the -T option to scp to ignore the new checks. You can rename the scp binary to something like scp.original and create a new shell script there like this:
mv scp scp.original
/usr/bin/scp.original -T $*
chmod 555 scp
This time, the install should succeed. You can revert the changes back once the install is done.