Search

Thursday, June 23, 2016

Oracle database configuration issues that cause downtime

A well-designed high availability (HA) solution accounts for all these factors in preventing unplanned database downtime. One of the true challenges in designing a highly available (HA) solution is examining and addressing all the possible causes of downtime. It is important to consider causes of both unplanned and planned downtime. The diagram shown in the slide, classifies unplanned database failures.

Common Oracle database configuration issues cause downtime

Causes of unplanned database downtime
In this article I tried to list common database configuration issues that can affect Oracle database availability causing unplanned downtime. As an experienced DBA you should proactively diagnose and do preventive actions to avoid these types of situations in your database environment. Furthermore, from my experience even on database environments with high availability (HA) solutions (like clustering, storage mirroring, Oracle RAC and Data Guard, etc.) these cases still persists and negatively affect the database high availability (HA) at most.

 

Archive log destination is full

This issue can happen either on disk level or on the initialization parameter level.
Size your archive log destination properly. I recommend always having enough space to keep so many archive logs in an archive log destination to survive for at least one day without performing an archive log backup.
If the issue occurred, you still have a few options to fix it:
– Backup archive logs and delete those in the archive log destination to allow database log switches
– Add more space to the archive destination
– Dynamically change the log archive destination:
ALTER SYSTEM SET LOG_ARCHIVE_DEST_1 = (‘LOCATION=/oradata/ORCL/archive’);
– etc.

 

Tablespace or associated filesystem is full

Ensure proper configuration of your tablespaces and associated file systems / disks and think carefully about the database file extension strategy, database file options (like file autoextend), file system and tablespace monitoring, etc.

 

ORACLE_BASE file system / disk is full

ORACLE_BASE is the root of the Oracle Database directory tree. Usually it consists of one or more ORACLE_HOME directories where Oracle binaries are installed. How to prevent that issue:
–  Have a separate ORACLE_BASE file system and size it properly
–  Ensure that the Oracle mandatory and system auditing do not fill it in quickly
–  Be careful with default Data Pump dump location (ORACLE_HOME/rdbms/log/). Place it to a separate disk
–  In Oracle 11g keep track of file generation in Automatic Diagnostic Monitor (ADR) location that can also quickly fill the file system in.
When the issue occurs database may become non responsive. In such scenarios you should quickly react to free up the space or / and add more.

 

Database listener is down

Just restart it or fix the issues first

 

Database is in restricted or quiesced state

Be careful changing a database into restricted or quiesced state. And be aware that there are cases when somebody (a process, job or a system operation) can switch database in quiesced state. For example, DB is switched in quiescing for a short time when installing Oracle Enterprise Manager (OEM) Console using EMCA.

 

Database reaches certain parameter limits

Sometimes when a database reaches certain system limits, you have to increase those values quickly or even bounce the instance to make changes effective. Some common cases are below:
– processes – number of processes
– open_cursors – number of open cursors

 

Initialization parameter changes requirement

Think upfront in setting database initialization parameter right at the beginning, since not all of them can be adjusted online. 114 database parameters of 11gR2 are still not modifiable online on the instance level.

 

Control file limit reached

This issue can occur when you reach limits of some DB configuration parameters stored in a control file like MAXLOGFILES, MAXLOGFILEMEMBERS, MAXINSTANCES.
To fix this you need a downtime. 10g has reduced some of those limitations though.

 

Oracle ASM instance limits

Oracle ASM can be considered as another database instance that have own parameter limitations you also need to consider carefully. For example, If ASM instance reaches maximum number of processes, all of your databases on that host will hang.

 

Oracle auditing may cause service unavailability

Be aware that Oracle does by default a lot of auditing as on OS level as inside the database system tablespace. As a matter of fact the latest Oracle versions produce even more auditing records by default. So I suggest reviewing defaults and, based on your auditing strategy, keep only what is required and implement proper storage and housekeeping structures.

 

Oracle bugs

As in all the software, Oracle bugs sometimes cause database unavailability or simply increase the outage length in case of hardware or storage failure for example. Make your robust patching strategy and follow it carefully.

 

Patching may introduce database unavailability

This is the other side of the medal. Do not rush upgrading or applying patches on the production environments, but rather test them carefully on the test databases. It is important to consider not only the time to perform the upgrade, but also the effect the changes may have on the overall database and application.
Sometimes patches introduce new bugs reducing database high availability. In spite Oracle states that Patch Sets are only bug fixes and application vendors should not certify those, I strongly recommend conducting extensive testing and receiving a sign off from application vendors or/and your application owners before going productive.

 

Database recovery situation

This problem potentially can introduce your longest outage and kill your SLA with the customer. Thus to minimize the downtime, you have to be very careful in:
– utilizing modern software and hardware tools for potential recovery situations
– designing your backup and recovery strategy
– documenting it properly
– testing it periodically
– doing preliminary files, database and object consistency checks
– considering values of certain database initialization parameters:
DB_BLOCK_CHECKING
DB_BLOCK_CHECKSUM
DB_LOST_WRITE_PROTECT
DB_ULTRA_SAFE (new in 11g)

 

Database poor security

In spite of Oracle having more than17 or even more security certificates, DBAs have to strengthen database security on all the levels during DB setup, configuration, application installation and maintenance. And this is not only to prevent intruder’s brake-ins. My first rule is simple: minimize the number of users that have direct access to your database and stick to the Least Required Privilege rule. This can reduce number and effect of user errors, recovery situations and system break-ins.

 

Database environment complexity

Yes, even this sometimes become an issue introducing even more database outages. Companies implement complex database infrastructure not taking into account a risk of having more software and hardware components that potentially can fail and/or introduce more bugs and issues compared to simple and robust solutions.

 

DBAs or human factor

Human error, which is a leading cause of failures, includes errors by an operator, user, database administrator, or system administrator. Another type of human error that can cause unplanned down time is sabotage.
Nobody is perfect and every DBA can make mistakes but what you can do here is to train your personnel properly and do periodic testing of your common DBA tasks and procedures. At the end, you need to find an experienced DBA you can trust and rely on… and I think this is the most difficult thing to do.
In the following articles I plan to discuss some Oracle’s solutions to downtime.