Database performance degradation due to multipath issues

To put it in bit of an Indian context, database is not your daughter-in-law that you can blame it for every performance issue that occurs in the environment. But it does happen. Most of the time it is the database that is blamed for all such issues. Many times, the issues are in some other layer like OS, network or storage.

Faced this issue recently at one of the customer sites where performance in one of the databases went down suddenly. It was a 2 node RAC on 12.1.0.2 running on Linux 7 using some kind of Hitachi SSD storage array. There were no changes as per DBA, application, OS and storage teams. But something must have changed somewhere. Otherwise why would performance degrade just like that. I & my colleague checked some details and found that something happened in the morning a day before. Starting from that point in time, the execution time for all the commonly run queries shot up. Generally speaking, when all the queries are doing bad and you are sure that nothing has been changed on the database side, the reasons could be outside the database. But being a DBA, it is not easy to prove that. We took AWRs from good and bad times and the wait events section looked like this:

Now there is something clearly and terribly wrong with the details in the second snippet and in the first look it appears to be an IO issue. Av Rd(ms) in the File IO Stats section of the AWR reports was also showing really bad numbers for most of the data files, which have been fine two days ago.

The conference calls continued and we were not reaching anywhere. Storage team as usual said that everything was fine and there were no issues. Finally the discussion moved to multipathing and the teams started checking in that direction. There were errors like this in /var/log/messages

multipathd: asm!.asm_ctl_vbg1: failed to get path uid
multipathd: asm!.asm_ctl_vbg6: failed to get path uid
multipathd: asm!.asm_ctl_vbg9: failed to get path uid

That meant there was a problem with one of the paths from the database nodes to storage. They disabled the bad path for both the DB nodes and voila ! IO performance was back on track. It was multipathing that needed to be fixed.

So it is always not the database. It is unfair to always blame the DBA !

Comments

Comment by Ravinder on 2021-03-22 17:51:55 +0530

Thanks for sharing this information !

do we have way where we can find this is not database issue . Issue is with network or stirage.

Comment by Sidhu on 2021-03-22 19:19:40 +0530

In this case it was kinda straight forward but that is not always the case. System level performance issues can be very complex to diagnose. AWR report and an ASH report are good starting points. You can also use Tanel Poder’s scripts like snapper and ashtop/dashtop and then move from there. He has made multiple videos and blog posts on use of these tools:

https://tanelpoder.com/videos/

Comment by supriyo77 on 2021-03-22 20:18:10 +0530

i had an issue with a db where server RAID battery had a problem.As a result I/O performance degraded and multiple events pop up . issue was identified by iotop command.

Comment by Sidhu on 2021-03-23 10:39:56 +0530

Cool !

Comment by sainats@gmail.com on 2021-03-26 12:31:28 +0530

Excellent Amar. Cool and simplified post.

Comment by Sidhu on 2021-03-28 10:23:40 +0530

Thanks Raj !

Comments#

Comment by Ravinder on 2021-03-22 17:51:55 +0530#

Comment by Sidhu on 2021-03-22 19:19:40 +0530#

Comment by supriyo77 on 2021-03-22 20:18:10 +0530#

Comment by Sidhu on 2021-03-23 10:39:56 +0530#

Comment by sainats@gmail.com on 2021-03-26 12:31:28 +0530#

Comment by Sidhu on 2021-03-28 10:23:40 +0530#