Search

Thursday, June 23, 2016

High Availability and SLA requirements for Oracle database

Oracle database HA - Unplanned downtime
Unplanned down time is primarily the result of computer failures or data failures.
After my presentation Oracle database High Availability strategy, architecture and solutions at German Oracle User Group (DOAG) meeting in Nuremberg a few days ago I decided to write about High Availability (HA) solutions for Oracle database on my DBMS Blog. I must admit that when I started preparing this topic I realized it’s so extensive and complex that I decided starting slowly describing the things that any DBA or Infrastructure architect should understand before building a high available database system with minim allowed downtime. This first article will focus on understanding the availability requirements and Service Level Agreement (SLA).

Understand and develop Service Level Agreement (SLA)

First, DBA needs to understand Service Level Agreements (SLA) or customer’s service requirements.
A Service Level Agreement (SLA) is a negotiated agreement between two or more parties, where one is the customer and the others are service providers. SLA usually is part of a service contract where a service is formally defined. As an example, IT service providers will commonly include Service Level Agreements within the terms of their contracts with customers to define the level(s) of service being sold in plain language terms. A database SLA typically has a technical definition in terms of following.

System Availability

This is a main SLA element and commonly expressed as a percentage, but is often more meaningful when expressed as hours. For example, 99.9% availability is roughly equivalent to 8 hours and 45 minutes of maintenance window, or allowed downtime, per year.
Availability Target Downtime Per Year (approx.)
90 % 36 days
98 % 7.3 days
99.7 % 26 hours
99.99 % 52 minutes
99.999 % 5 minutes

Sometimes System Availability is described in plain text like the following: The database servers must be available 5 days a week, from 6 am to midnight or 24×7.

Acceptable Data Loss

As example: No more than 15 minutes of data entry can be lost

Mean time to recover (MTTR)

As example: In the event of a disaster, the systems should be back up and running within one hour.

Mean time between failures (MTBF)

As example: A failure should not occur more than once a month.

Performance

As example: Transaction response time should not exceed 4 seconds.
These Service Level Agreement (SLA) elements above are required in order to develop SLA, and then design systems and processes to meet customer expectations.

Design systems and processes to meet SLA expectations

What’s important at this point is to understand the need to make design and infrastructure decisions in the context of meeting Service Level Agreements (SLA). In architecting a new database system the SLA targets can be usually achieved with different technical solutions in scope of your available budget (see next chapter).
Designing a highly available system involves taking various elements and combining them to suit your business needs and requirements. I mention below these elements that you need to consider especially in Oracle database environment design.

High Availability

Strictly speaking, High Availability (HA) gives consideration to the single points of failure in your system and eliminates them through redundancy.
Examples are: redundant HW, SAN/ASM, RAC databases

Disaster Recovery

Disaster recovery extends the concept of High Availability (HA) beyond single points of failure by providing secondary elements that can be brought into play when the primary elements fail.
Example is Standby databases.

Oracle Maximum Availability Architecture

Implementing High Availability (HA) to address single points of failure and disaster recovery to address system failure leads down the path of maximum availability architecture (MAA).

Downtime

The key to determine which elements of High Availability (HA) are appropriate for your site is how much downtime you can tolerate; this includes unexpected as well as planned downtime. Downtime is usually differentiated by 2 types. Unplanned down time is primarily the result of computer failures or data failures. Planned down time is primarily due to data changes or system changes.

Oracle database High Availability (HA) - planned downtime
Planned database downtime is primarily due to data changes or system changes
Choosing the right technical solution for database system design from scratch is difficult. You can follow some best practices in building High Available Oracle database systems based on Availability Levels that match database industry standards. I’ll show these Availability Levels in the next article.

Obtain the budget for building the database environment

After you made the database design proposal, you have to obtain the budget to implement the database environment that can meet the agreed SLA. Considering different optimal technical solutions, your goal is to build a database system that meets the required SLA level for least money. Cost comparison of options helps in achieving realistic expectations when developing a proper database system. Bear in mind, the cost of a database system considerably increases the closer you want to reach 100% of availability. In other words, removal of each “9” from the uptime target significantly reduces the cost of building an environment that meets the target, as Table below helps demonstrate.

Item
SAN Snapshot backup Native DB backup
Licensing €28,000 €5,000
Training €14,000  0
Storage €45,000 €10,000
Total Cost €87,000 €15,000
Cost is a negotiation topic. It might be that you will not receive that budget you requested. In this case you have to go to another round of negotiations with the customer having in place plan B (either fighting for money or reducing SLA requirements or thinking about more cost efficient solution).

Be prepare for a disaster

Generally speaking there are 2 options of dealing with potential disaster: (a) expect and plan for it; or (b) do nothing hoping for the best. And I strongly do not recommend following the second one!
Often IT personnel is not prepared for disaster or unplanned downtime at all.  Either they do not build IT systems to tolerate a disaster right from the beginning. Or if this step is done, with time and changes done on the systems the availability solutions and procedures that were implemented at the beginning simply do not work anymore. So many times I’ve heard from IT colleagues: “It worked somehow in the past. Why we need to improve or test it again?” But it’s the same like thinking I do not need to lock my car because it has never been robbed. If it’s happened however, it’s already late!
Ensure system disaster recovery planning and testing. I strongly recommend focusing in your planning and testing on every possible unplanned downtime case. Prepare solutions and procedures to avoid or at least mitigate database downtime.
So that was an prelude that gave an idea about the things that should be considered like  Service Level Agreement (SLA), Availability Level, downtime, etc. to start speaking about Oracle database High Availability (HA) solutions in the next articles.