Practical Apps DBA

Oracle Apps DBA’s Blog

Archive for April, 2008

EBS R12 on IBM P-Series Running RedHat Linux

Posted by Navdeep Saini on April 29, 2008

Few days ago, one of our customer had this interesting question which caught us little offguard. Question was, that whether EBS R12 is certified for RedHat Linux running on IBM P-series (IBM power based system) or not?

Checking metalink it looks that, it should just be fine. If you go to metalink you will see the following:

Which suggests that oracle database on Linux running on IBM Power-pSeries, runs just fine. However a closer look at metalink note:341507.1, you will find that no-where it talks of EBS R11i or R12 or related products. Another look at metalink certify:

Important thing to note here is that for Apps Tier on Linux, only x86 and x86-64 platforms are supported which means that only processor chipsets like x-86 (Intel based), AMD64,EMT64T are supported. Anyother processor chip (e.g. IBM P series which uses IBM POWER5+ or POWER5 processor) is NOT supported for APPS TIER, even though they can run Linux on it (as per RedHat certification: RedHat Enterprise Linux: Server Version comparison chart ).

Jist of the discussion is that on IBM Power with Linux, (pSeries, iSeries, System p5 and System i5) oracle EBS can run only in split configuration. Which means database tier can run on IBM Power with RedHat Linux and Apps Tier has to have AIX as OS software. However If you want to run Apps Tier on Linux, it has to be on supported chipsets only e.g. x-86, x-86-64 etc.

Posted in Advanced configurations | 2 Comments »

How did we solve our “log file sync” ?

Posted by Navdeep Saini on April 24, 2008

It was one of those intriguing days when for almost 2-3 days users/developers/managers has been bugging me for “possible” slowness in our EPR instance. Ofcourse I checked everything: no long running queries eating up resources, no sudden spurt of sessions, nothing in AWR report to show off, nothing to show that there is slowness in the database !!!

I do not normaly loose sleep over performance issues, but it was irritating when apps support guys woke me up in middle of night for two consecutive nights. One batch job (conc. request) very critical to business, was running 5 hours behind schedule. Obvious place to look for was dbconsole “top activity” and tracing the conc. request. The trace file showed that request was just running crazy (tail -f of trace) and dbconsole did not show any particular activity which might be causing the slowness in the database. I tkprofed (sys=no, sort=exeela,prsela,fchela) the trace and found nothing wrong with the sqls that were reported. No wait events like “db file sequential/scattered read”, apart from few occurances of “log file sync” event. Looking more carefuly in the “top sessions” in dbconsole it occured to me that indeed “log file sync” was reported occasionally but only for few seconds.

Observing more keenly I also found that trace of the sessions (tail -f of trace) was freezing occasionaly for few second and showing “log file sync” after it freezes. So it seems “log file sync” is my area of concern, albeit it was very miniscule in the overall database scenario.

As per metalink standard Note:34592.1 on “log file sync”:

The log file sync wait may be broken down into the following components:

    1. Wakeup LGWR if idle
    2. LGWR gathers the redo to be written and issue the I/O
    3. Time for the log write I/O to complete
    4. LGWR I/O post processing
    5. LGWR posting the foreground/user session that the write has completed
    6. Foreground/user session wakeup

Tuning advice based on log file sync component breakdown above:

Steps 2 and 3 are accumulated in the “redo write time” statistic. (i.e. as found under STATISICS section of Statspack and AWR)
Step 3 is the “log file parallel write” wait event. (Note 34583.1:”log file parallel write” Reference Note )
Steps 5 and 6 may become very significant as the system load increases. This is because even after the foreground has been posted it may take a some time for the OS to schedule it to run. May require monitoring from O/S level.

As mentioned earlier, my AWR report was not showing “redo write time” and “log file parallel write” to be that significant. As such Load on the machine was “normal” and there was nothing that could point me to a direction in which I could proceed.

The situation was becoming worse and batch job finally finished at 7:30 AM in the morning, 5:30 hours behind schedule. Blood hounds (business/apps support) were out after me :-) and they were smelling blood. But I was not letting it go and turned to my internals knowledge. During my oracle days one of the thing I learned was that we can 10046 (sql trace) trace background sessions also. They will only show wait events in trace files as they normally do not fire any sqls. LGWR was my guy this time and I turned on the trace. VOILA !!! it was showing “log file parallel write” !!

However my hopes were dashed immiediately as it was normal for LGWR to post this event as it writes (IO) to redo logs. Again I was in dark and could not fathom what could be the reason for “log file sync”. One wild idea that was floating in my mind was that it could be that there are suddenly more commits in the sytem. Apps support guys purged almot 800 millions rows in workflow tables, as they fixed some issue with workflow becuase of which they were not able to purge WF tables for almost a year !!!. Could it be that there are now 1/2 the number of rows in WF tables which might have “oiled” the system and now it was “churing” out more !!! There is suddenly more commits in the database and causing the LGWR to lag behind…..

Keen observation is one common concept that many try to overlook and I was the one who was the victim of it that day. Looking at LGWR trace files again, I saw that elapsed time for “log file parallel write” was sometimes spiking. Simple unix sort on it showed that “elapsed time” was sometimes in secs where-as, as per oracle for 20k of redo it should not take more than 15 milliseconds. It was “log file parallel write” , a simple sutpid IO request on redo log files causing this delay. And since LGWR has to write to redo at every commit it was stalling the whole system !!!

Here is portion of lgwr trace showing “log file parallel write” spiking to sometime to 39 secs !!

WAIT #0: nam=’log file parallel write’ ela= 22324424 files=2 blocks=2 requests=2 obj#=-1 tim=13758521131245
WAIT #0: nam=’rdbms ipc message’ ela= 6595 timeout=294 p2=0 p3=0 obj#=-1 tim=13758521138897
WAIT #0: nam=’log file parallel write’ ela= 2343353 files=2 blocks=64 requests=4 obj#=-1 tim=13758521143128
WAIT #0: nam=’rdbms ipc message’ ela= 97448 timeout=290 p2=0 p3=0 obj#=-1 tim=13758521262712
WAIT #0: nam=’log file parallel write’ ela= 1294 files=2 blocks=80 requests=2 obj#=-1 tim=13758521264788
WAIT #0: nam=’rdbms ipc message’ ela= 9538 timeout=280 p2=0 p3=0 obj#=-1 tim=13758521275843
WAIT #0: nam=’log file parallel write’ ela= 909835 files=2 blocks=8 requests=2 obj#=-1 tim=13758521278533

Mount points for redo logs was targeted and I contacted my SAs for DMX system. They immiediately found that out of four fiber adaptors on database machine , which connects to storage array via SAN switch, one was performing little slower. They asked for some more time to investigate. I waited for almost 1 excrutiating hour and finally heard back from them. It was indeed the IO sub-system. One of the port on SAN switch was “dying” causing frequent IO queues. They swapped the port with another one and it fixed it !!!. So it was just one simple bad port on SAN switch which was giving frequent “heart burns” to the database.

Leason learned…on a very very busy system, esp. an OLTP like Oracle APPS, it is absolutely critical that write throughput on redo logs is maintained at every time. No RAID-5, just simple RAID-1, on faster fiber channels, san switches etc..

These screen shots shows pictographically the situation:

This was before SAs fixed IO sub-system. The orange colors is for “commits”. Observe the frequent spikes of Orange color.

This was after SAs fixed IO sub-system. There is no orange color.

Posted in Performance Tuning | 13 Comments »

Oracle Apps Migration Path from HP-UX PA-RISC to HP-UX Itanium

Posted by Navdeep Saini on April 7, 2008

Most of the HP shops running Applications are mostly on PA-RISC based systems. HP plans to offer PA-RISC-based servers through 2008 and support the systems through 2013, and continue to push for Itanium based servers. Apps customers are asking for possible migration paths to port their applications from PA-RISC to Itanium basesd systems. This post gives detailed summary of how you can achieve this.

For release 11i Oracle had certified HP Itanium under “split configuration”, wherein your database node can run on HP-Itanium and apps tier runs on other supported platforms. Note that as of 11.5.10.2 Apps Tier is NOT supported for HP-UX Itanium.
Currently Oracle supports following platforms for its middle tier:

    1. HP Tru64
    2. HP-UX PA-RISC
    3. IBM AIX Based Systems
    4. Linux x86
    5. Solaris Operating System SPARC
    6. Windows 2000/Windows Server 2003 (x86)

However as of R12, oracle has certified HP Itanium as certified platform for Apps Tier, which means both your db tier and apps tier can run on HP Itanium. Hence if you are planning to port your applications (both db and apps tier) to HP-UX Itanium, then you will have to upgrade to R12 first.

Here are the summarized steps:

Database Tier: The database migration from HP-UX on PA-RISC to HP-UX Itanium (IPF) and from Itanium (IPF) to PA-RISC by copying the datafiles is possible. Starting with RDBMS 9.2.0.7 and 10.2.0.2, it is supported to copy the database files (control files, redolog/archivelog files, data files) from HP-UX on PA-RISC to HP-UX Itanium (IPF) and from Itanium (IPF) to PA-RISC. While HP-UX on PA-RISC is big endian because of the processor, the Itanium processor adapts to the endianess of the OS and is in this case also big endian. Follow these steps to migrate db-tier.

    1. Patch the target operating system to the required level and make sure it has all the required OS patches and packages (e.g. ar, cc,aCC, make etc.) for more details check:Oracle Applications Release 12.0.4 Installation Manual

    2. Configure the target system as per Oracle installation and configuration guide.

    3. Install the required Oracle RDBMS software (9.2.0.7 or above, 10.2.0.2 or above) on the target platform (Itanium)

    4. Patch the Oracle RDBMS to the required level.

    5. For copying the database (using tar, cpio, dd etc.) from the source to the target system, create the required volume groups, logical volumes, file systems, raw devices etc. on the target system

    6. Shutdown the Oracle database on the source system

    7. Copy all the required Oracle database files, including control files, date files, undo, redo logs, Oracle initialization file etc. to the target system.

    8. If required, regenerate the control file (for example when changing the file locations on the target system).

    9. Setup and configure the Oracle listener etc.

    10. Startup the database on the target system.

Apps Tier: For Apps tier you will need to install R12 on itnanium (follow: Note 402307.1 – Oracle Applications Installation and Upgrade Notes Release 12 (12.0) for HP-UX Itanium) and upgrade to R12 running 4440000.drv on HP-UX Itanium platform.

More to follow on R12 upgrade later in my posts.

Posted in Advanced configurations | Leave a Comment »

FND_STATS vs DBMS_STATS

Posted by Navdeep Saini on April 4, 2008

I have been asked this questions many times. Shoud we use fnd_stats or should we use dbms_stats in 11i (or release 12). Most of the Apps DBAs you will face will say use fnd_stats. In fact Oracle also recommends to use fnd_stats in 11i E-Business Environments. But do you know what are the basic difference or you can say advantages or using FND_STATS over DBMS_STATS.
Here I have tried to compare the two…

Posted in Performance Tuning | 2 Comments »

How to recover and open the database if the archive log required for recovery is missing.

Posted by kalpit on April 1, 2008

Last week, I had one interesting recovery scenario at client side; we had to recover one of our development databases from old backup.
As part of recovery process, our restore went fine and also were able to re-create controlfile. During recovery, it asked for Archive logs. We checked with our Unix team for required archivelogs and found out they don’t have required archive logs.

It was critical for us to recover database because of some project deadline.

Error:

SQL> recover database until cancel using backup controlfile;
ORA-00279: change 9867098396261 generated at 03/21/2008 13:37:44 needed for
thread 1
ORA-00289: suggestion : /arcredo/XSCLFY/log1_648355446_2093.arc
ORA-00280: change 9867098396261 for thread 1 is in sequence #2093

Specify log: {=suggested | filename | AUTO | CANCEL}
cancel
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01195: online backup of file 1 needs more recovery to be consistent
ORA-01110: data file 1: ‘/u100/oradata/XSCLFY/SYSTEM01_SCLFY.dbf’
ORA-01112: media recovery not started

SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01195: online backup of file 1 needs more recovery to be consistent
ORA-01110: data file 1: ‘/u100/oradata/XSCLFY/SYSTEM01_SCLFY.dbf’

After doing some research, I found out one hidden parameter (_ALLOW_RESETLOGS_CORRUPTION=TRUE) will allow us to open database even though it’s not properly recovered.

We forced open the database by setting the _ALLOW_RESETLOGS_CORRUPTION=TRUE. It allows us to open database but instance crashed immediately after open. I checked the alert.log file and found out we have undo tablespace corruption.

Alert log shows below error

Errors in file /u01/XSCLFYDB/admin/XSCLFY/udump/xsclfy_ora_9225.trc:
ORA-00600: internal error code, arguments: [4194], [17], [9], [], [], [], [], []
Tue Mar 25 12:45:55 2008
Errors in file /u01/XSCLFYDB/admin/XSCLFY/bdump/xsclfy_smon_24975.trc:
ORA-00600: internal error code, arguments: [4193], [53085], [50433], [], [], [], [], []
Doing block recovery for file 433 block 13525
Block recovery from logseq 2, block 31 to scn 9867098416340

To resolve undo corruption issue, I changed undo_management to “Manual” in init.ora. Now it allowed us to open database successfully. Once database was up and running, I created new undo tablespace and dropped old corrupted undo tablespace. I changed back the undo_management to “Auto” and undo_tablespace to “NewUndoTablespace”.

It resolved our issue and database was up and running without any issue.

_ALLOW_RESETLOGS_CORRUPTION=TRUE allows database to open without consistency checks. This may result in a corrupted database. The database should be recreated.

As per Oracle Metalink, there is no 100% guarantee that setting _ALLOW_RESETLOGS_CORRUPTION=TRUE will open the database. However, once the database is opened, then we must immediately rebuild the database. Database rebuild means doing the following, namely: (1) perform a full-database export, (2) create a brand new and separate database, and finally (3) import the recent export dump. This option can be tedious and time consuming, but once we successfully open the new database, then we expect minimal or perhaps no data loss at all. Before you try this option, ensure that you have a good and valid backup of the current database.Solution:

1) Set _ALLOW_RESETLOGS_CORRUPTION=TRUE in init.ora file.
2) Startup Mount
3) Recover database
4) Alter database open resetlogs.
5) reset undo_management to “manual” in init.ora file.
6) startup database
7) Create new undo tablespace
changed undo_management to “AUTO” and undo_tablespace to “NewTablespace”
9) Bounce database.

Posted in Backup & Recovery | 12 Comments »