July | 2021 | Oracle Tips and Tricks -- David Fitzjarrell

July 6, 2021

“Your Transaction Will Be Delayed… “

Filed under: disaster recovery,General — dfitzjarrell @ 12:10

“Now will you tell me where we are?” asked Tock as he looked around the
desolate island.
“To be sure,” said Canby; “you’re on the Island of Conclusions. Make yourself
at home. You’re apt to be here for some time.”
“But how did we get here?” asked Milo, who was still a bit puzzled by being
there at all.
“You jumped, of course,” explained Canby. “That’s the way most everyone gets
here. It’s really quite simple: every time you decide something without having a
good reason, you jump to Conclusions whether you like it or not. It’s such an easy
trip to make that I’ve been here hundreds of times.”
“But this is such an unpleasant-looking place,” Milo remarked.
“Yes, that’s true,” admitted Canby; “it does look much better from a distance.”
— Norton Juster, The Phantom Tollbooth

Standby databases have been both bane and boon to DBAs.  The boon is the ability to quickly recover from a priimary database failure by failing over to the standby essentially allowing the business to “pick up where they left off” when the unthinkable happens.  The bane, many times, is errant transactions which compromise the production data on the primary side; given that in most standby or Data Guard installations the lag period between primary and standby is no more tnan the time it takes to switch logs this can be far too short of an interval for the DBA to respond to prevent the responsible insert/update/delete from being applied at the standby site.  Yes, such a scenario can be a rare occurrence but it’s not beyond the realm  of possibility.  Compound that with customers using a standby database as a near-real-time reporting database and the situation gets more complicated. What’s a DBA to do?  Let’s examine this scenario and see what possible options the DBA has at his/her disposal.

Standby databases, whether or not they are managed via Data Guard, have been key components in most Disaster Recovery configurations.  All data flowing into or out of the primary is systematically repllicated to one or more standby databases by configuring standby redo logs (version 11 and later) or redo transport (all versions).  Let’s look at that last staement in more detail.  When standby redo logs were introduced the behavior changed for the standard ‘alter database recover managed standby database disconnect from session;’ statement; Oracle automatically  uses the standby redo logs for transaction information.  Yes, the old reliable archived redo logs are still transported to the standby but they aren’t the main source of information.  The archived redo logs CAN be used, if desired, by modifying the ‘alter database’ statement by adding the ‘using archived logfile’ directive, and this option opens the door to setting a delay to the redo apply process so that problem transactions can be caught before they can be applied to the standby — problem averted.

Redo log apply delay is configured in the spfile parameter log_archive_dest_N (N being a number from 1 to up to 31, depending on the version of Oracle) using, yes, you guessed it, the DELAY parameter.  This parameter requires an integer value, in minutes.  For example a possible configuration for a standy database located in Kansas City would be:

log_archive_dest_4=’SERVICE=stbyKC, DELAY=1440′
log_archive_dest_state_4=ENABLE

Standby number 4, in Kansas CIty, will now lag the primary database by 24 hours when archived logfiles are the source of truth.  The delay can be set to any value of minutes, from 1 to whatever you desire; if the delay had been set to 2880 there would be 2 days lag between primary and standby.  Yes, 2880 minutes is a bit extreme but it’s not an invalid value for the parameter.  Remember that corrupted transactions can occur at any time of the day or night and because of that I have seen on more than one occasion the DELAY set to 1440.  Take, as an example a situation where transactions hit the primary database 24/7 and at 3 AM on Tuesday, December 7th, the production data is rendered unusable due to a corrupted feed from Klottsylvania.  Additionally Perba Kleppneckle, the company’s intrepid DBA, wisely set a 24 hour delay between primary and standby.  She now has time to research when the transaction was applied so the current standby can be recovered to a time just before the data corrupting transaction would be processed.  Failing over to the standby would now be possible since Perba was able to prevent the corruption that rendered the primary unusable.  Yes, the roles are reversed but the former primary now becomes the standby once it’s recreated from the new primary and the corrupted data is no longer an issue.

One fly in this ointment becomes apparent if the standby is running as an Active Data Guard database; normally such configurations are found when the standby is used as an active reporting database; many times such use requires that the standby remain in lock step with the primary (the finance and investment industries are good examples); delays would render the reports essentially useless.  But, don’t despair, even the finance and investment industries can benefit from using an apply delay; such a configuration would be found on a second or third standby database not used for reporting.  Having more than one standby available can provide at least one source of truth where data corruption can be minimized or eliminated, giving the enterprise a recovery path that doesn’t jeopardize reporting and one that can restore the other standbys should disaster strike.  It’s likely that the current reports are run from a scheduler (on Windows or via cron on Unix/Linux) providing scripts that can be called at will to regenerate reports as necessary.  Fail over, rebuild the remaining standby databases and re-run the problem reports.  Since the old standby is now the new primary the archived logs which contain the suspect transactions are no longer valid and the danger of accidentally applying them is averted.

Can multiple standby databases have different delay settings?  Of course; the example above involves two standbys, one with a delay of 0 minutes and one with a delay of 1440 minutes.  Let’s go back to our wisely cautious DBA, Perba Kleppneckle, and see how she configured 3 standby databases for various uses.  Perba works for the investment firm Alhami Investments, LLC, and the company requires the use of standby databases to feed data to various parts of the organization.  Brokers need current data to generate investment reports for clients; sales can use data up to 24-hours old since they are building historical performance reports and portfolio managers have asked for data no more than two hours old to track futures.  Perba has configured and built the three standbys in separate physical locations — Topeka, KS, Reno, NV and Baltimore, MD — to isolate them from each other and from the primary, located in San Diego, CA.  Standby #1 is in Topeka, and it’s configured absent any delay; this is for the brokers.  Standby #2 is in Reno, configured with a 2-hour delay for the portfolio managers; standby #3 is in Baltimore, configured with a 24-hour delay for the sales team.  Looking at the log_archive_dest_N parameters Perba implemented we see:

log_archive_dest_2=’SERVICE=stbyone’
log_archive_dest_state_2=enable
log_archive_dest_3=’SERVICE=stbytwo, DELAY=120′
log_archive_dest_state_3=enable
log_archive_dest_4=’SERVICE=stbythree, DELAY=1440′
log_archive_dest_state_4=enable

Now that Perba has the configuration parameters set the various standbys can be started.  Standby #1 uses the standby redo logs so it can be started with the following command:

alter database recover managed standby database disconnect from session;

Standbys #2 and #3 will need to use the archive logs rather than the standby redo logs; Perba modifies her startup command:

alter database recover managed standby database using archived logfile disconnect from session;

which allows Oracle to use the configured delay to buffer the recovery should a data-corrupting transaction or set of transactions be applied to the primary.  Note that in this configuration both the primary and standby #1 will be affected by data-corrupting input; having the remaining standbys using different delay settings allows for faster data recovery and a shorter recovery time (the two-hour delay standby requires less redo to recover to just before the corruption hits, bringing the restored primary online in less time thus starting the standby rebuild process quicker).  Of course the standby log destination configuration will change since Reno will be the new primary database; San Diego will now become standby #2.  Once the standby reconfiguration is complete the applications that used Reno as a standby will need to be reconfigured to use San Diego, a simple change taking minutes to complete since the tnsnames.ora file only need to swap the connection information between San Diego and Reno.  The service names can remain the same since no location names are used to define those services.

Delaying standby reecovery may not be standard practice but for mission-critical databases maybe it should be.  The ability to prevent data corruption from spreading across the standby landscape can save the business time and money by providing a means to recover from such corruption in a short amount of time, putting the company back on track quickly and reliably by reducing the amount of duplicated work to enter current transactions again.  The key part of Disaster Recovery isn’t the disaster, it’s the recovery and the smoother that process can go the better it is for the enterprise.  Some thought and investigation beforehand can better prepare the environment, and the DBA, to respond to that which the DBA hopes never happens.

Because jumping to conclusions can land you somewhere you’d rather not be.