Apparently there is still confusion over which Oracle feature provides high-availability and which provides disaster recovery. This DBA seems to believe that Data Guard is a high-availability (HA) solution; I don’t consider it so as we’ll soon discover. Let’s define what high-availability is then see which product and/or feature satisfies that definition.
To be a high-availability solution it must provide relatively uninterrupted access to the production system by implementing a mechanism where failures are handled in a transparent manner (the user community is unaware of failures which could affect access).
Data Guard, in all of its glory, does not provide such access in my opinion, although the Oracle documentation says otherwise; Real Application Clusters does, as does an older Oracle product called FailSafe and an even older product that was cumbersome to configure and use, Oracle Parallel Server. Still there are DBAs in the workforce who firmly believe that Data Guard is a valid high-availability solution, even knowing that a failover involves time where users have no access the database. [Apparently my idea of HA and Oracle’s differs.] Given the criteria listed above Data Guard does not, in my mind, fit the bill. So why do some DBAs consider it high-availability? Let’s see what Data Guard does do and maybe we’ll see why I don’t consider it that way.
Data Guard provides a mechanism whereby Oracle will keep one or more databases synchronized with the primary database. For a physical standby configuration in release 10.2 three protection modes are available
Maximum Protection
Maximum Availability
Maximum Performance
Maximum Protection mode guarantees that the standby will be in ‘lock step’ with the primary as all transactions are written to both the primary redo logs and the standby redo logs with the caveat that if Oracle cannot write a transaction to the standby redo logs the primary will suspend activity until the error causing the write issue is corrected. No transaction can commit until all local and remote redo has been written successfully. This ensures a seamless cutover should a disaster strike, but it also inconveniences production users should problems in standby redo log writes occur. Maximum Availability mode works like Maximum Protection mode until a standby redo log write problem occurs, when Oracle switches to Maximum Performance mode until the standby redo log write issue is resolved, at which time the standby redo log writes catch up with the primary. Maximum Performance mode allows a transaction to commit after successfully writing the redo entries to the local redo logs regardless of whether the standby redo log writes have completed. [In 11.2 a snapshot mode is available which can be converted to a physical standby at any time, and allows for read/write access to the data. Also available in that release is support for redo apply to a physical standby database open for read access.] A logical standby configuration is also available which relies upon log shipping to the standby where Log Miner is used to extract and apply DDL and DML changes, although there are a few data types (listed in the online documentation) which won’t replicate in such a setup. For the purposes of this discussion only the physical standby configuration will be considered as it will replicate all changes made to the primary thus providing a byte-for-byte replica of the primary.
A physical standby can provide (depending upon the protection mode) an exact ‘point in time’ copy of the primary so that no transactions are ‘pending’ due to archive log transfers. [In Maximum Performance mode, if standby redo logs are not configured then the standby is synchronized to the last log transferred from the primary leaving a gap of several minutes worth of transactions at the standby site.] This does NOT provide a high-availability configuration as failover tasks consume time and take the database out of service until the failover is complete. Since high-availability is defined as relatively uninterrupted access to the database even during failure of some resources Data Guard cannot, and should not, be used if high-availability (meaning no downtime as RAC provides it) is desired or required. It is a Disaster Recovery (DR) solution and DR and HA are not the same in my book. [Golden Gate provides both DR (with Active Data Guard) and real-time replication solutions through the same interface, neither of which are high-availability offerings even though many ‘experts’, and Oracle Corporation, offer the product as a high-availability configuration.]
Real Application Clusters, or RAC as it’s known industry-wide, is an HA solution as it provides the uninterrupted access required. Unless this is a single-node RAC (a configuration available in release 11.2 which provides for expansion and is primarily designed for development and testing purposes) this option is configurable to transparently failover to a known good node should one node fail, the key term being ‘transparently’. No user interaction or intervention is necessary as RAC seamlessly transfers work to a good node and continues without inconveniencing the users. The database is available as long as at least one node remains operational; that, of course, could slow down transactional activity depending on the available memory on that node but there is no loss of service. Contrast that to Data Guard, where the primary database is no longer functioning and a secondary database, in a physically separate location, must be converted from being the standby to being the primary before users can resume work. Also add the time to reconfigure the old primary to being the new standby and it’s clear this is not high-availability.
As another aspect of this discussion a RAC configuration involves one database and two or more clustered instances accessing that database, and Data Guard involves two or more separate databases, usually found in two or more physical locations. Yes, the tnsnames.ora files can be configured to ‘fail over’ to the first active production site so that users need not reconfigure SQL*Net to access the former standby database should it be needed but that isn’t the issue with Data Guard; the issue is the failover time required to exchange the roles of primary and standby which interrupts service until the transition is complete. Improvements in Data Guard may have decreased the downtime considerably although I would have a difficult time recommending Data Guard as an HA offering.
Don’t get the wrong idea about Data Guard; it’s an excellent technology that provides data protection in the event of a catastrophic disaster at the primary data center and it is often used in conjunction with RAC to provide an environment resilient to node failure and complete disaster. If you want HA (in my opinion) then you need to consider RAC as the ‘out of the box’ solution provided by Oracle as it handles node failures with grace (and possibly style) and keeps the work flowing seamlessly. Know that data protection and high-availability are different, but compatible, areas which need to be considered when constructing a robust database configuration and that the former cannot replace the latter (and, again, this is my opinion).
[The Oracle documentation, at first blush, agrees with my definition but later on in the depths of the HA discussion clearly states, without question, that Oracle considers Data Guard a high-availability solution. Far be it from me to argue with Oracle.]
Data Guard and RAC are both well-tested and reliable options to consider when designing and implementing a fault-tolerant configuration, but of the two only RAC, in my estimation, provides high-availability.
Unless you like explaining to upper management why your ‘HA solution’ required an outage.