As bad as it gets – A DB2 support nightmare!
Imagine the scene – a broken database on an unsupported version of DB2, with no backups or log files to recover the database. Yes, this one really was the stuff of nightmares!
An erroneous script had deleted a few transaction log files that had a ‘last changed’ date of more than 45 days. The same script had caused other errors and a database restart was required, but the database did not start. The database was looking for an old log file, which had just been deleted by the script. As the policy was to retain the backups and archive logs for 30 days, this log file was deleted from the archive logs too.
The database was tiny – less than 50GB. Nevertheless, it was a very important one, with a number of web facing apps relying on it for important features. To make matters worse, the version of DB2 in use had passed its “End of Service” date, so DB2 support was not willing to investigate (though they were happy to guide).
When we got involved a few hours after the incident, panic had set in. Based on information available (saved snapshots and db2diag.log file), we were able to conclude that there was a transaction which started in the log file the database was looking for. This transaction was never completed. The rate of change of data was so small, the configured log numbers could go on for more than 45 days.
The options available were to extract the data from the latest backup image (using tools like High Performance Unload) or extract the data from the damaged database (using db2dart). The latter option was chosen as this would allow us to recover the most recent data.
Without further delay, we ran db2dart on the database to check for any errors and to get the Tablespace id, Table id and the total number of pages allocated to each table. We were then able to use the information to build the db2dart command with the DDEL option to extract the data in delimited format. db2dart with the DDEL option is interactive (i.e., when the command is run, it prompts for the tablesapce id, tableid and the page range to extract the data). This meant that the extract could not be scripted but had to be done manually for each of the 300+ tables.
Once that mind-numbing task was complete, we created a new database with the DDL that was available (thankfully, they had a db2look output from the production database less than a week before the incident). Finally, we loaded the extracted data to the new database and ran runstats on the table and indexes.
After a few hiccups and 15 hrs of db2dart, import/load, runstats and data fixes, the database was available for the application. The database was down for more than 20 hours, but it was back in one piece with nearly no data loss. Quite an achievement under the circumstances!
Lessons learned:
•Be aware of advanced recovery tools such as db2dart – they can be lifesavers in extreme situations.
•Do not touch the active log directory.
•Configure a log archiving strategy that archives log files to a different location than that of the active log directory.
•Have a scheduled clean up procedure for the archive log directory.
•The most recent active transaction log file can be found by listing the database configuration parameters. Make absolutely sure you don’t need a log file before deleting it.
•If possible, keep at least one backup image on disk.