Practical Apps DBA

Oracle Apps DBA’s Blog

How did we solve our “log file sync” ?

Posted by Navdeep Saini on April 24, 2008

It was one of those intriguing days when for almost 2-3 days users/developers/managers has been bugging me for “possible” slowness in our EPR instance. Ofcourse I checked everything: no long running queries eating up resources, no sudden spurt of sessions, nothing in AWR report to show off, nothing to show that there is slowness in the database !!!

I do not normaly loose sleep over performance issues, but it was irritating when apps support guys woke me up in middle of night for two consecutive nights. One batch job (conc. request) very critical to business, was running 5 hours behind schedule. Obvious place to look for was dbconsole “top activity” and tracing the conc. request. The trace file showed that request was just running crazy (tail -f of trace) and dbconsole did not show any particular activity which might be causing the slowness in the database. I tkprofed (sys=no, sort=exeela,prsela,fchela) the trace and found nothing wrong with the sqls that were reported. No wait events like “db file sequential/scattered read”, apart from few occurances of “log file sync” event. Looking more carefuly in the “top sessions” in dbconsole it occured to me that indeed “log file sync” was reported occasionally but only for few seconds.

Observing more keenly I also found that trace of the sessions (tail -f of trace) was freezing occasionaly for few second and showing “log file sync” after it freezes. So it seems “log file sync” is my area of concern, albeit it was very miniscule in the overall database scenario.

As per metalink standard Note:34592.1 on “log file sync”:

The log file sync wait may be broken down into the following components:

    1. Wakeup LGWR if idle
    2. LGWR gathers the redo to be written and issue the I/O
    3. Time for the log write I/O to complete
    4. LGWR I/O post processing
    5. LGWR posting the foreground/user session that the write has completed
    6. Foreground/user session wakeup

Tuning advice based on log file sync component breakdown above:

Steps 2 and 3 are accumulated in the “redo write time” statistic. (i.e. as found under STATISICS section of Statspack and AWR)
Step 3 is the “log file parallel write” wait event. (Note 34583.1:”log file parallel write” Reference Note )
Steps 5 and 6 may become very significant as the system load increases. This is because even after the foreground has been posted it may take a some time for the OS to schedule it to run. May require monitoring from O/S level.

As mentioned earlier, my AWR report was not showing “redo write time” and “log file parallel write” to be that significant. As such Load on the machine was “normal” and there was nothing that could point me to a direction in which I could proceed.

The situation was becoming worse and batch job finally finished at 7:30 AM in the morning, 5:30 hours behind schedule. Blood hounds (business/apps support) were out after me :-) and they were smelling blood. But I was not letting it go and turned to my internals knowledge. During my oracle days one of the thing I learned was that we can 10046 (sql trace) trace background sessions also. They will only show wait events in trace files as they normally do not fire any sqls. LGWR was my guy this time and I turned on the trace. VOILA !!! it was showing “log file parallel write” !!

However my hopes were dashed immiediately as it was normal for LGWR to post this event as it writes (IO) to redo logs. Again I was in dark and could not fathom what could be the reason for “log file sync”. One wild idea that was floating in my mind was that it could be that there are suddenly more commits in the sytem. Apps support guys purged almot 800 millions rows in workflow tables, as they fixed some issue with workflow becuase of which they were not able to purge WF tables for almost a year !!!. Could it be that there are now 1/2 the number of rows in WF tables which might have “oiled” the system and now it was “churing” out more !!! There is suddenly more commits in the database and causing the LGWR to lag behind…..

Keen observation is one common concept that many try to overlook and I was the one who was the victim of it that day. Looking at LGWR trace files again, I saw that elapsed time for “log file parallel write” was sometimes spiking. Simple unix sort on it showed that “elapsed time” was sometimes in secs where-as, as per oracle for 20k of redo it should not take more than 15 milliseconds. It was “log file parallel write” , a simple sutpid IO request on redo log files causing this delay. And since LGWR has to write to redo at every commit it was stalling the whole system !!!

Here is portion of lgwr trace showing “log file parallel write” spiking to sometime to 39 secs !!

WAIT #0: nam=’log file parallel write’ ela= 22324424 files=2 blocks=2 requests=2 obj#=-1 tim=13758521131245
WAIT #0: nam=’rdbms ipc message’ ela= 6595 timeout=294 p2=0 p3=0 obj#=-1 tim=13758521138897
WAIT #0: nam=’log file parallel write’ ela= 2343353 files=2 blocks=64 requests=4 obj#=-1 tim=13758521143128
WAIT #0: nam=’rdbms ipc message’ ela= 97448 timeout=290 p2=0 p3=0 obj#=-1 tim=13758521262712
WAIT #0: nam=’log file parallel write’ ela= 1294 files=2 blocks=80 requests=2 obj#=-1 tim=13758521264788
WAIT #0: nam=’rdbms ipc message’ ela= 9538 timeout=280 p2=0 p3=0 obj#=-1 tim=13758521275843
WAIT #0: nam=’log file parallel write’ ela= 909835 files=2 blocks=8 requests=2 obj#=-1 tim=13758521278533

Mount points for redo logs was targeted and I contacted my SAs for DMX system. They immiediately found that out of four fiber adaptors on database machine , which connects to storage array via SAN switch, one was performing little slower. They asked for some more time to investigate. I waited for almost 1 excrutiating hour and finally heard back from them. It was indeed the IO sub-system. One of the port on SAN switch was “dying” causing frequent IO queues. They swapped the port with another one and it fixed it !!!. So it was just one simple bad port on SAN switch which was giving frequent “heart burns” to the database.

Leason learned…on a very very busy system, esp. an OLTP like Oracle APPS, it is absolutely critical that write throughput on redo logs is maintained at every time. No RAID-5, just simple RAID-1, on faster fiber channels, san switches etc..

These screen shots shows pictographically the situation:

This was before SAs fixed IO sub-system. The orange colors is for “commits”. Observe the frequent spikes of Orange color.

This was after SAs fixed IO sub-system. There is no orange color.

About these ads

25 Responses to “How did we solve our “log file sync” ?”

  1. Atul said

    Navdeep,
    Wonderful post, Hope to see more on this.

    Atul
    http://onlineappsdba.com

  2. Atul
    Thanks,
    Sure I will write on such issues.

    rgds
    Nav

  3. Kalpit said

    Good One Navdeep!!

  4. dhanu said

    hai

    this is dhanu i wnat to learn oracle apps but i need some suggestions and steps how can i learn and from where i have to start as presently iam working in retail software company i have knowledge of retail industry and process there in my company apps are by VB and SQL i need to be ERP professional so pls mail me and give ur contact no i have small queries regarding ,so pls pls send me a mail as i may see this site once again because in my office onle mails are open

  5. dhanu said

    my mail id : is murari_dhanu@yahoo.com

  6. Dhanu
    The following sections in my blog will be helpful to you:

    ‘1 Beginners Guide’ : http://practicalappsdba.wordpress.com/category/1-beginners-guide/
    ‘1 Start Here’ : http://practicalappsdba.wordpress.com/category/1-beginners-guide/1-start-here/
    ‘2 Main Topics’ : http://practicalappsdba.wordpress.com/category/1-beginners-guide/2-main-topics/

    Let me know if you need more information.
    regards
    Nav

  7. Don Courtney said

    Navdeep

    What a great post – congrads. and thanks
    Don C.

  8. Don
    Thanks !
    I am glad that you found my post helpful.

    regards
    Nav

  9. John Hallas said

    Very good post Navdeep.

  10. Puneet Sachar said

    Nav

    Appriciate the effort. Please advice if you have any online blog or Community.

    Thanks
    Puneet !

  11. Abhishek Saxena said

    Good one Navdeep!
    I liked the post and they way you explained it.

    Regards
    Abhishek

  12. Shameer PIT said

    Hello Sir,

    Am a beginner in Apps DBA enivronment. i started as a oracle developer. as you said need is a mother. i became oracle apps DBA. i am some urgent issue.. i need some help to know how to delete all log files and which log files will be deleted and is possible to have a subroutine on 2 days once run and purged all log files.. please sir help me my email id is p.shameer@gmail.com thanks alot.. i like your blog

  13. Anand Sathyamurthy said

    Hi Navdeep,
    Fantastic update.I liked very much the way of your writing.

    It is nice posting and full details the problem, dba real time facing and solution to the problem.

    thanks
    Anand

  14. Dave Abercrombie said

    Wouldn’t this problem have been apparent in v$event_histogram, especially perhaps by taking deltas from periodic observations? If so, this might have been simpler than tracing. Also, Statspack reports include these v$event_histogram deltas, but AWR reports do not (one of the few advantages of Statspack).

  15. Shailendra said

    Excellent post. It helped me to resolve one of the problem I was facing. Thanks Nav

  16. aditya said

    Really informative and interesting post. thanks for sharing.

  17. A said

    It was a good post.
    Keep up the good work.

  18. Peter Wiseman said

    Great post.

  19. ST said

    Good job I know Saini’s are hard workers and smart, even my boss is one -:) and I like working for/with them. Grat work and kepe it up.

  20. internet income university scam3…

    […]How did we solve our “log file sync” ? « Practical Apps DBA[…]…

  21. implantacion de sistemas…

    […]How did we solve our “log file sync” ? « Practical Apps DBA[…]…

  22. Ram said

    Hi,

    My self Ram. I want to become a performance tunning expert. Can you please help me how to analyze and root cause the issues.

  23. Umesh said

    Very nice information. Here are few more steps to solve log file sync.

    http://www.dbas-oracle.com/2013/07/Solve-log-file-sync-wait-event-and-log-file-parallel-wait-event.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 60 other followers

%d bloggers like this: