Development:Oracle:oraback sh

From Main

Table of contents

oraback.sh v1.49 serious race condition...

which could result in inconsistent backup of datafiles under a heavy load.

What's Affected

The "Free Software" in question is one created by:

RCSfile: oraback.sh
Revision: 1.49
Date: 2000/12/14 09:29:02

@(#)oraback.sh - (Hot backups for Oracle databases)
Copyright (C) 1996 Curtis Preston - curtis@backupcentral.com [1] (http://www.backupcentral.com/toc-free-backup-software.html)

Workaround

  • Problem only occurs if doing a HOT (ie. your DB is in archivelog mode and you don't shutdown database while copying files) database backup and you are utilizing the 'parallelism' (ie. number of simulatenous copies) functionality of this script.
  • A quick solution to avoid the issue is to disable the 'parallelism' functionality in your oraback.conf

Current Algorigthm Description

If oraback.conf has paralelism enabled:

  • (i)LINE 126: Already within a loop (list of all tablespaces) a call to function Backup_one_tablespace & (notice the ampersand which means in the background, which also means that script doesn't wait for this comand to finish but immediately continues around for another iteration in the loop)
    • Backup_one_tablespace:
    • (ii)LINE 205: Within a sub-loop (list of all files within a tablespace) Backup_one_tablespace in turn will call Copy_one_database_file & (note the ampersand again meaning this command is also started in the background and execution immediatly returns to the next loop iteration)
      • Copy_one_database_file:
      • (iii)LINE 276: touch $TMP/$X.$ORACLE_SID.$TBS.files.$DBF_BASENAME - create a dummy file to signal this database .dbf file is in the process of being backed up
      • (iv)LINE 277-307: backup the .dbf file by copying and compressing it
      • (v)LINE 309: rm -f $TMP/$X.$ORACLE_SID.$TBS.files.$DBF_BASENAME - remove the dummy file, signalling we're done copying the .dbf file
    • (vi)LINE 214: #As long as there are $X.*.$TBS.* files, this tbs is not done
    • (vii)LINE 216-218: Keep checking for $X.*.$TBS.* files until tablespace is done
    • (viii) Once there are no more files, remove tablespace from HOT backup mode.

Race Condition Analysis

In order to understand why the problem occurs (under high I/O load) we need to take a look at what happens from the OS perspective.

Whenever a function is executed in the background as in points (i) and (ii) above a new process is spawned (which involves making a copy of the shell process, calling the exec command and the setting the program pointer to the proper memory location, if memory serves me right).

Now, let's use an example databse which has 100 Tablespaces, each of which have 5 datafiles. And as we all know all oracle databases have a SYSTEM tablespace. Let's assume that at:

  • point (i): We are currently iterating over the SYSTEM tablespace
  • point (ii): We have iterated over system01.dbf, system02.dbf, system03.dbf and system04.dbf and are now iterating over system05.dbf
    • At this point in time there is alot of I/O on the system. 4 DBF files are being copied, some of them are being compressed and possibly the database is processing some transactions causing increased I/O on the archive_logs.
    • At point (ii) for system05.dbf (the 5th file) it is likely that we hit this point 1 or 2 seconds (or less) after we began copying the first, system01.dbf, file. Remember that we are starting parallel processes all of which will copy concurently (at the same time)
  • point (ii) for system05.dbf - spawning of a new backgrond process is initiated (but not necessarily complete)
  • point (ii) execution returns to next command immediately (in theory the spawning process may not even be complete)
    • The following happens concurrently (at the same time) and are in a race condition:
      • point (iii) LINE 276 is executed (a request to create an operating system file)
        • at this point this process blocks (waits) for the I/O command to finish. And since the systems I/O is saturated this command may take some time
      • point (ii) as soon as this command finishes (spawning the copying of system05.dbf), the outside loop proceeds to point (vi) and begins searching for files that indicate a file copy is still in progress
      • point (vi) - Since the other process at point(iii) is blocked and still waiting for the file create request to finish, it is possible that point(vi) will not see any $X.*.$TBS.* (as the one for system05.dbf) has not yet been created.
      • point (vi) will conclude the tablespace is finished backup, point (viii) will remove this tablespace from hotbackup mode and continue with next tablespace
      • point (ii) just got it's request granted to create a file and continues to happily make a copy of the file
      • point (iv) during the same time the database is likely updating the same file(system05.dbf) making the filecopy incosistent
      • File copy silently finishes with no error and the DBA is under the impression he has a consistant backup

How to fix it

The fix is short and sweet!
LINE 276 (touch $TMP/$X.$ORACLE_SID.$TBS.files.$DBF_BASENAME) should be move to just prior to LINE 205 just before the Copy_one_database_file &.

By creating the 'begin copy' file prior to spawning the copy process (and without the ampersand on the touch command) we are guaranteed the script won't proceed untill the create file command is complete.

The fixed version is available here: Oraback v1.49.custom fix v0.01.zip

How did I disover this?

This script was implemented as a hot backup solution for all our databases. Early 2004 as a test of our disaster recover startegy I tried recovering one of the old backups. To my surprise I was unable to recover. The RECOVER process kept on asking for additional archive log files (files which I did not have). Which left me scratching my head? Did the backup process forget to copy all the files? Did I screw up during the recovery?

Tracking down the problem was not easy as the problem only occurs sometime. Finally with alot of debug statements printed the problems became apparent.

I contacted the Author, as well as the LazyDBA website (which distributes a copy of this software) and then forgot about the problem. Unfortunately as of Sept 22, 2005 a visit to the authors website still reveals the bug. Maybee if the author hears from enough people he will update his script. I cannot judge, but I know I was appaled to learn that backups of our Oracle Financials 11i, as well as our other system were inconsistent.

A piece of advice

Data loss is every DBA's worst fear. In many cases its enough to ensure you have to update your resumee. But, the solution is very simple. Design your backup strategy and then test it, and re-test it at least once a year, and if possible every 3 months. It'll keep your DR strategy fresh in your mind and it'll give you a piece of mind knowing that at worst possible case you'll only loose 3 months of data (because you confirmed that your backup from 3 months ago is still good).

The other solution is to always have your resume updated.

The fixed file

For the ME.first generation before I loose your attention (ooops too late)... Oraback v1.49.custom fix v0.01.zip


Cheers! marijan at marijan dot com