How did we solve our “log file sync” ?
Posted by Navdeep Saini on April 24, 2008
It was one of those intriguing days when for almost 2-3 days users/developers/managers has been bugging me for “possible” slowness in our EPR instance. Ofcourse I checked everything: no long running queries eating up resources, no sudden spurt of sessions, nothing in AWR report to show off, nothing to show that there is slowness in the database !!!
I do not normaly loose sleep over performance issues, but it was irritating when apps support guys woke me up in middle of night for two consecutive nights. One batch job (conc. request) very critical to business, was running 5 hours behind schedule. Obvious place to look for was dbconsole “top activity” and tracing the conc. request. The trace file showed that request was just running crazy (tail -f of trace) and dbconsole did not show any particular activity which might be causing the slowness in the database. I tkprofed (sys=no, sort=exeela,prsela,fchela) the trace and found nothing wrong with the sqls that were reported. No wait events like “db file sequential/scattered read”, apart from few occurances of “log file sync” event. Looking more carefuly in the “top sessions” in dbconsole it occured to me that indeed “log file sync” was reported occasionally but only for few seconds.
Observing more keenly I also found that trace of the sessions (tail -f of trace) was freezing occasionaly for few second and showing “log file sync” after it freezes. So it seems “log file sync” is my area of concern, albeit it was very miniscule in the overall database scenario.
As per metalink standard Note:34592.1 on “log file sync”:
The log file sync wait may be broken down into the following components:
1. Wakeup LGWR if idle
2. LGWR gathers the redo to be written and issue the I/O
3. Time for the log write I/O to complete
4. LGWR I/O post processing
5. LGWR posting the foreground/user session that the write has completed
6. Foreground/user session wakeup
Tuning advice based on log file sync component breakdown above:
Steps 2 and 3 are accumulated in the “redo write time” statistic. (i.e. as found under STATISICS section of Statspack and AWR)
Step 3 is the “log file parallel write” wait event. (Note 34583.1:”log file parallel write” Reference Note )
Steps 5 and 6 may become very significant as the system load increases. This is because even after the foreground has been posted it may take a some time for the OS to schedule it to run. May require monitoring from O/S level.
As mentioned earlier, my AWR report was not showing “redo write time” and “log file parallel write” to be that significant. As such Load on the machine was “normal” and there was nothing that could point me to a direction in which I could proceed.
The situation was becoming worse and batch job finally finished at 7:30 AM in the morning, 5:30 hours behind schedule. Blood hounds (business/apps support) were out after me and they were smelling blood. But I was not letting it go and turned to my internals knowledge. During my oracle days one of the thing I learned was that we can 10046 (sql trace) trace background sessions also. They will only show wait events in trace files as they normally do not fire any sqls. LGWR was my guy this time and I turned on the trace. VOILA !!! it was showing “log file parallel write” !!
However my hopes were dashed immiediately as it was normal for LGWR to post this event as it writes (IO) to redo logs. Again I was in dark and could not fathom what could be the reason for “log file sync”. One wild idea that was floating in my mind was that it could be that there are suddenly more commits in the sytem. Apps support guys purged almot 800 millions rows in workflow tables, as they fixed some issue with workflow becuase of which they were not able to purge WF tables for almost a year !!!. Could it be that there are now 1/2 the number of rows in WF tables which might have “oiled” the system and now it was “churing” out more !!! There is suddenly more commits in the database and causing the LGWR to lag behind…..
Keen observation is one common concept that many try to overlook and I was the one who was the victim of it that day. Looking at LGWR trace files again, I saw that elapsed time for “log file parallel write” was sometimes spiking. Simple unix sort on it showed that “elapsed time” was sometimes in secs where-as, as per oracle for 20k of redo it should not take more than 15 milliseconds. It was “log file parallel write” , a simple sutpid IO request on redo log files causing this delay. And since LGWR has to write to redo at every commit it was stalling the whole system !!!
Here is portion of lgwr trace showing “log file parallel write” spiking to sometime to 39 secs !!
WAIT #0: nam=’log file parallel write’ ela= 22324424 files=2 blocks=2 requests=2 obj#=-1 tim=13758521131245
WAIT #0: nam=’rdbms ipc message’ ela= 6595 timeout=294 p2=0 p3=0 obj#=-1 tim=13758521138897
WAIT #0: nam=’log file parallel write’ ela= 2343353 files=2 blocks=64 requests=4 obj#=-1 tim=13758521143128
WAIT #0: nam=’rdbms ipc message’ ela= 97448 timeout=290 p2=0 p3=0 obj#=-1 tim=13758521262712
WAIT #0: nam=’log file parallel write’ ela= 1294 files=2 blocks=80 requests=2 obj#=-1 tim=13758521264788
WAIT #0: nam=’rdbms ipc message’ ela= 9538 timeout=280 p2=0 p3=0 obj#=-1 tim=13758521275843
WAIT #0: nam=’log file parallel write’ ela= 909835 files=2 blocks=8 requests=2 obj#=-1 tim=13758521278533
Mount points for redo logs was targeted and I contacted my SAs for DMX system. They immiediately found that out of four fiber adaptors on database machine , which connects to storage array via SAN switch, one was performing little slower. They asked for some more time to investigate. I waited for almost 1 excrutiating hour and finally heard back from them. It was indeed the IO sub-system. One of the port on SAN switch was “dying” causing frequent IO queues. They swapped the port with another one and it fixed it !!!. So it was just one simple bad port on SAN switch which was giving frequent “heart burns” to the database.
Leason learned…on a very very busy system, esp. an OLTP like Oracle APPS, it is absolutely critical that write throughput on redo logs is maintained at every time. No RAID-5, just simple RAID-1, on faster fiber channels, san switches etc..
These screen shots shows pictographically the situation: