Block corruption & Correction: a case study
In one of a production database, I have received following errors in UI ( front-end) and back-end as well though mail. I started investigation though logs and conclude about logical corruption with few requested logs. Please go through my analysis.
1) Error received Front End :
Timestamp: 7/22/2014 4:21:34 PM
Message: OraDataServiceProvider.AddIssueDetails Message : ORA-01476: divisor is equal to zero
ORA-06512: at "SALES.F_UPDATEINVENOTARY", line 517
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
ORA-06512: at "SALES.P_ADDISSUE", line 334
ORA-06512: at line 1
----------------------------------------
Timestamp: 7/22/2014 4:21:34 PM
Message: OraDataServiceProvider.AddIssueDetails StackTrace : at Oracle.DataAccess.Client.OracleException.HandleErrorHelper(Int32 errCode, OracleConnection conn, IntPtr opsErrCtx,
OpoSqlValCtx* pOpoSqlValCtx, Object src, String procedure, Boolean bCheck)
at Oracle.DataAccess.Client.OracleException.HandleError(Int32 errCode, OracleConnection conn, String procedure,
IntPtr opsErrCtx, OpoSqlValCtx* pOpoSqlValCtx, Object src, Boolean bCheck)
at Oracle.DataAccess.Client.OracleCommand.ExecuteNonQuery()
at eHIS.OracleDataAccessHelper.ORACLEHelper.ExecuteNonQuery(String connectionString, CommandType cmdType, String
sqlCommandText, OracleParameter[] parameterArray)
at eHIS.SALES.DataAccess.OraDataServiceProvider.AddIssueDetails(String Issue, String& IssueCode)
2)Error received back End : ( in alert log)
2.1) from alert Log:
Tue Jul 22 17:05:49 2014
Errors in file /u01/app/oracle/diag/rdbms/prod/PROD1/trace/PROD1_ora_11731176.trc (incident=82147):
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2.2) from related .trc file:
DDE: Problem Key 'ORA 600 [4511]' was flood controlled (0x6) (incident: 82875)
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
DDE: Problem Key 'ORA 600 [4511]' was flood controlled (0x6) (incident: 82876)
*** 2014-07-22 16:44:40.284
*** CLIENT ID:() 2014-07-22 16:44:40.284
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
2.3) from related incident .trc file:
Dump continued from file: /u01/app/oracle/diag/rdbms/prod/PROD2/trace/PROD2_ora_13893936.trc
ORA-00600: internal error code, arguments: [4511], [], [], [], [], [], [], [], [], [], [], []
========= Dump for incident 95001 (ORA 600 [4511]) ========
*** 2014-07-22 16:22:26.385
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=8ax21gq9d3rb1) -----
UPDATE BOOKSTORE I SET I.QOHISTORY = (SELECT SUM(QTY) FROM QOHISTORY WHERE BOOKCODE = :B2 AND STORECODE = :B1 ),
I.HOLDQOHISTORY=I.HOLDQOHISTORY+ABS(:B3 ) WHERE I.BOOKCODE = :B2 AND I.STORECODE = :B1
----- PL/SQL Stack -----
----- PL/SQL Call Stack -----
object line object
handle number name
7000003c7a76dd8 241 function SALES.F_UPDATEINVENOTARY
7000004b1c7e110 334 procedure SALES.P_ADDISSUE
7000004aecaf568 1 anonymous block
3) Issue Description:
From the above errors I investigated related procedure and function. From the error, it seems there is a transaction issue with "BOOKSTORE" table. Same issue also logged by support team. But ORA-00600 is related to so many bugs. But this error which is clearly described in related incident .trc file that no update is happening in "BOOKSTORE" table. So initially it is clear. But we need to investigate more and confirm about the issue.
Note: From all all alert logs ( both node) and incident file, I came to know two table corrupted logically. BOOKSTORE and PURCHASEITEMS.
4) Confirmation investigations:
Here are some confirmation investigations:
4.1) Analyze the structure for validation:
e.g.,
While running "analyze table SALES.PURCHASEITEMS validate structure online;", we are getting below error:
SQL> analyze table SALES.PURCHASEITEMS validate structure online;
analyze table SALES.PURCHASEITEMS validate structure online
*
ERROR at line 1:
ORA-01498: block check failure - see trace file
Then I started tracing this with following way:
Analyze the table and upload tracefile generated if analyze fails, trace name should end in _ANALYZE.trc.
SQL> alter session set tracefile_identifier='ANALYZE';
SQL> analyze table SALES.PURCHASEITEMS validate structure online;
Here are some contents from generated .trc file:
*** 2014-07-17 21:33:09.884
*** SESSION ID:(408.36503) 2014-07-17 21:33:09.884
*** CLIENT ID:() 2014-07-17 21:33:09.884
*** SERVICE NAME:(SYS$USERS) 2014-07-17 21:33:09.884
*** MODULE NAME:(sqlplus@ehdb2 (TNS V1-V3)) 2014-07-17 21:33:09.884
*** ACTION NAME:() 2014-07-17 21:33:09.884
Block Checking: DBA = 185398234, Block Type = KTB-managed data block
data header at 0x70000017729c0ac
kdbchk: row locked by non-existent transaction
table=0 slot=19
lockid=4 ktbbhitc=5
Block header dump: 0x0b0cf3da
Object id on Block? Y
seg/obj: 0x122c8 csc: 0x0f.e2b866de itc: 5 flg: E typ: 1 - DATA
brn: 0 bdba: 0xb0cf300 ver: 0x01 opc: 0
inc: 0 exflg: 0
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x001d.011.000237cd 0x01001e9e.4eb3.2a C--- 0 scn 0x000e.fa71110c
0x02 0x0177.008.000004cd 0x0100022d.0092.11 C--- 0 scn 0x000e.5f6767bd
0x03 0x0092.00b.0003c7b9 0x00c0d3e4.7168.0b C--- 0 scn 0x000e.aebfb808
0x04 0x000d.01f.0006477b 0x0107008d.c219.25 C--- 0 scn 0x000e.d03f9c6f
0x05 0x0017.001.00036313 0x01017ea0.67db.36 C--- 0 scn 0x000e.fa70d161
bdba: 0x0b0cf3da
data_block_dump,data header at 0x70000017729c0ac
===============
tsiz: 0x1f50
hsiz: 0x8c
pbl: 0x70000017729c0ac
76543210
flag=--------
ntab=1
nrow=61
frre=-1
fsbo=0x8c
fseo=0x952
avsp=0x909
tosp=0x924
...........
..........
But this may not understandable directly unless we do readable form. If you have Oracle support ID, then you can raise SR and upload this trace file. If without error stucture validated, probably there is other cause. Then let us proceed
some other tests.
4.2) Using RMAN method:
check the files used by these two tables PURCHASEITEMS and BOOKSTORE with RMAN to see how many blocks are affected
$ rman target / nocatalog
b) RMAN> run {
allocate channel d1 type disk;
allocate channel d2 type disk;
allocate channel d3 type disk;
allocate channel d4 type disk;
backup check logical validate datafile x,y,z;
}
/** where x,y,z should be replaced by the file#
Output:
$ rman target / nocatalog
RMAN>
run {
allocate channel d1 type disk;
allocate channel d2 type disk;
allocate channel d3 type disk;
allocate channel d4 type disk;
backup check logical validate datafile 44,45,46;
} 2> 3> 4> 5> 6> 7>
allocated channel: d1
channel d1: SID=39 instance=PROD1 device type=DISK
allocated channel: d2
channel d2: SID=424 instance=PROD1 device type=DISK
allocated channel: d3
channel d3: SID=629 instance=PROD1 device type=DISK
allocated channel: d4
channel d4: SID=1358 instance=PROD1 device type=DISK
Starting backup at 22-JUL-14
channel d1: starting full datafile backup set
channel d1: specifying datafile(s) in backup set
input datafile file number=00044 name=+DATA/prod/datafile/sales01.dbf
channel d2: starting full datafile backup set
channel d2: specifying datafile(s) in backup set
input datafile file number=00045 name=+DATA/prod/datafile/sales_index01.dbf
channel d3: starting full datafile backup set
channel d3: specifying datafile(s) in backup set
input datafile file number=00046 name=+DATA/prod/datafile/sales_indx_01.dbf
channel d3: backup set complete, elapsed time: 00:00:16
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
46 OK 0 309 18640 24375064492
File Name: +DATA/prod/datafile/sales_indx_01.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 0 0
Index 0 17782
Other 0 549
channel d2: backup set complete, elapsed time: 00:01:19
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
45 OK 0 37946 664640 88021468605
File Name: +DATA/prod/datafile/sales_index01.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 0 0
Index 0 616016
Other 0 10678
channel d1: backup set complete, elapsed time: 00:01:32
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
44 FAILED 0 48213 963536 88021469422
File Name: +DATA/prod/datafile/sales01.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 7 907742
Index 0 1
Other 0 7580
validate found one or more corrupt blocks
See trace file /u01/app/oracle/diag/rdbms/prod/PROD1/trace/PROD1_ora_21823514.trc for details
Finished backup at 22-JUL-14
released channel: d1
released channel: d2
released channel: d3
released channel: d4
RMAN>
RMAN> quit
Recovery Manager complete.
Here, RMAN reported no. of blocks corrupted with this statement, "Block Type Blocks Failing Blocks Processed". i.e., here we have some confirmation about block corruption. Next we will do some acid tests for max clarity.
4.3) Using DBMS packages:
Pre-requisites:
a) create 'REPAIR_TABLE' table :
Examples: Building a Repair Table or Orphan Key Table
The ADMIN_TABLE procedure is used to create, purge, or drop a repair table or an orphan key table.
A repair table provides information about the corruptions that were found by the CHECK_OBJECT procedure and how these will be addressed if the FIX_CORRUPT_BLOCKS procedure is run. Further, it is used to drive the execution of the FIX_CORRUPT_BLOCKS procedure.
Example: Creating a Repair TableThe following example creates a repair table for the users tablespace.
SET SERVEROUTPUT ON
BEGIN
DBMS_REPAIR.ADMIN_TABLES (
TABLE_NAME => 'REPAIR_TABLE',
TABLE_TYPE => dbms_repair.repair_table,
ACTION => dbms_repair.create_action,
TABLESPACE => 'USERS');
END;
/
b) Detecting Corruption :
The CHECK_OBJECT procedure checks the specified object, and populates the repair table with information about corruptions and repair directives. You can optionally specify a range, partition name, or subpartition name when you want to check a portion of an object.
Validation consists of checking all blocks in the object that have not previously been marked corrupt. For each block, the transaction and data layer portions are checked for self consistency. During CHECK_OBJECT, if a block is encountered that has a corrupt buffer cache header, then that block is skipped.
The following is an example of executing the CHECK_OBJECT procedure for the SALES.PURCHASEITEMS table.
SET SERVEROUTPUT ON
DECLARE num_corrupt INT;
BEGIN
num_corrupt := 0;
DBMS_REPAIR.CHECK_OBJECT (
SCHEMA_NAME => 'SALES',
OBJECT_NAME => 'PURCHASEITEMS',
REPAIR_TABLE_NAME => 'REPAIR_TABLE',
CORRUPT_COUNT => num_corrupt);
DBMS_OUTPUT.PUT_LINE('number corrupt: ' || TO_CHAR (num_corrupt));
END;
/
output:
number corrupt: 2
PL/SQL procedure successfully completed
set serveroutput on
DECLARE num_corrupt INT;
BEGIN
num_corrupt := 0;
DBMS_REPAIR.CHECK_OBJECT (
SCHEMA_NAME => 'SALES',
OBJECT_NAME => 'BOOKSTORE',
REPAIR_TABLE_NAME => 'REPAIR_TABLE',
corrupt_count => num_corrupt);
DBMS_OUTPUT.PUT_LINE('number corrupt: ' || TO_CHAR (num_corrupt));
END;
/
output:
number corrupt: 5
PL/SQL procedure successfully completed
Here it is confirmed, no. of corrupted blocks.
Run the below view to find comoplete information:
SQL> SELECT OBJECT_NAME, BLOCK_ID, CORRUPT_TYPE, MARKED_CORRUPT,CORRUPT_DESCRIPTION, REPAIR_DESCRIPTION
FROM REPAIR_TABLE;
OR
select object_id,object_name,BLOCK_ID, CORRUPT_TYPE, CORRUPT_DESCRIPTION from REPAIR_TABLE;
see: corrupt_blocks_output.xls
5) Corrective actions:
5.1) I can recommend some simplest method which needs to perform in downtime: Create copy tables after disbling constraints and truncate the currupted tables and then re-insert them. Example:
Step_1: take the counts
Check counts:
select count(1) from SALES.PURCHASEITEMS -- 491609
select count(1) from SALES.copy_PURCHASEITEMS -- 491609
select count(1) from SALES.BOOKSTORE -- 168806
select count(1) from SALES.copy_BOOKSTORE -- 168806
Step_2: Take logical backup -- when downtime starts, to avoid any other issues if entire schema size is less or take only required tables.
$ expdp directory=DATA_PUMP dumpfile=sales_22Jul14.dmp logfile=sales_22Jul14_exp.log schemas=SALES parallel=3 exclude=statistics cluster=NO
Step_3: Create copy table for two issued tables:
create table copy_PURCHASEITEMS as select * from SALES.PURCHASEITEMS;
alter table SALES.PURCHASEITEMS disable constraint FK_PURCHASEITEMS_PO;
truncate table SALES.PURCHASEITEMS;
insert into SALES.PURCHASEITEMS select * from SALES.copy_PURCHASEITEMS;
alter table SALES.PURCHASEITEMS enable constraint FK_PURCHASEITEMS_PO;
create table SALES.copy_BOOKSTORE as select * from SALES.BOOKSTORE ;
truncate table SALES.BOOKSTORE;
insert into SALES.BOOKSTORE select * from SALES.copy_BOOKSTORE;
Step_4: Analyze tables:
exec dbms_stats.gather_table_stats(ownname => 'SALES',tabname => 'PURCHASEITEMS',cascade => TRUE);
exec dbms_stats.gather_table_stats(ownname => 'SALES',tabname => 'BOOKSTORE',cascade => TRUE);
5.2) Using DBMS Packages
Fixing Corrupt Blocks
Use the FIX_CORRUPT_BLOCKS procedure to fix the corrupt blocks in specified objects based on information in the repair table that was generated by the CHECK_OBJECT procedure. Before changing a block, the block is checked to ensure that the block is still corrupt. Corrupt blocks are repaired by marking the block software corrupt. When a repair is performed, the associated row in the repair table is updated with a timestamp.
This example fixes the corrupt block in table SALES.PURCHASEITEMS that was reported by the CHECK_OBJECT procedure.
SET SERVEROUTPUT ON
DECLARE num_fix INT;
BEGIN
num_fix := 0;
DBMS_REPAIR.FIX_CORRUPT_BLOCKS (
SCHEMA_NAME => 'SALES',
OBJECT_NAME=> 'PURCHASEITEMS',
OBJECT_TYPE => dbms_repair.table_object,
REPAIR_TABLE_NAME => 'REPAIR_TABLE',
FIX_COUNT=> num_fix);
DBMS_OUTPUT.PUT_LINE('num fix: ' || TO_CHAR(num_fix));
END;
/
Simple & strait workaround:
-- start DBMS API to skip corrupt block flag
BEGIN
DBMS_REPAIR.SKIP_CORRUPT_BLOCKS (
SCHEMA_NAME => 'SALES',
OBJECT_NAME => 'BOOKSTORE',
OBJECT_TYPE => dbms_repair.table_object,
FLAGS => dbms_repair.SKIP_FLAG);
END;
/
-- Create a repair table
> create table SALES.BOOKSTORE_REPAIR as select * from SALES.BOOKSTORE;
-- Check counts
select count(*) from SALES.BOOKSTORE;
select count(*) from SALES.BOOKSTORE_REPAIR;
-- If the count is close
>truncate table SALES.BOOKSTORE;
>insert into SALES.BOOKSTORE select * from SALES.BOOKSTORE_REPAIR;
Note: In my experience, I found some junk data updated in table and anonyms row inserted in the table. Manually data corrected and anaonymous record deleted with a pl/sql programming. Here are few errors:
error-1:
insert into SALES.BOOKSTORE
select * from SALES.servicerequestdetails_repair ;
ORA-12899: value too large for column "SALES"."BOOKSTORE"."EXTRACHARGEFLAG" (actual: 83, maximum: 5)
error-2:
insert into SALES.BOOKSTORE
select * from SALES.BOOKSTORE_repair ;
ORA-12899: value too large for column "SALES"."BOOKSTORE"."CONSULTATIONDATE" (actual: 120, maximum: 7)
etc.
-- after correction, check the counts again.
>select count(*) from SALES.BOOKSTORE;
>select count(*) from SALES.BOOKSTORE_REPAIR;
-- Again set "no skip" flag
BEGIN
DBMS_REPAIR.SKIP_CORRUPT_BLOCKS (
SCHEMA_NAME => 'SALES',
OBJECT_NAME => 'BOOKSTORE',
OBJECT_TYPE => dbms_repair.table_object,
FLAGS => dbms_repair.NOSKIP_FLAG);
END;
/
-- Analyze the table
exec DBMS_STATS.gather_table_stats(ownname=> 'SALES',tabname=>'BOOKSTORE',estimate_percent => 100,cascade=>TRUE,degree=>1,granularity=>'AUTO',method_opt=>'FOR ALL INDEXED COLUMNS SIZE AUTO');
Start RMAN and connect to a target database.
Execute the
VALIDATE
command with the desired options.
For example, to validate all datafiles and control files (and the server
parameter file if one is in use), execute the following command at the
RMAN prompt:
RMAN> VALIDATE DATABASE;
RMAN> VALIDATE DATABASE;
Starting validate at 15-MAY-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=547 device type=DISK
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00053 name=/oradata/datafiles/PROD/PAYROLL01.dbf
......
.....
-- Validate backupset:
Alternatively, you can validate a particular backup set by using the
form of the command shown in the following example (sample output
included).
RMAN> VALIDATE BACKUPSET 22;
Below query can be used to find any corrupted block is avialbe in database or not:
set head on;
set pagesize 2000
set linesize 250
select * from v$database_block_corruption;
SELECT e.owner, e.segment_type, e.segment_name, e.partition_name, c.file#
, greatest(e.block_id, c.block#) corr_start_block#
, least(e.block_id+e.blocks-1, c.block#+c.blocks-1) corr_end_block#
, least(e.block_id+e.blocks-1, c.block#+c.blocks-1)
- greatest(e.block_id, c.block#) + 1 blocks_corrupted
, null description
FROM dba_extents e, v$database_block_corruption c
WHERE e.file_id = c.file#
AND e.block_id <= c.block# + c.blocks - 1
AND e.block_id + e.blocks - 1 >= c.block#
UNION
SELECT s.owner, s.segment_type, s.segment_name, s.partition_name, c.file#
, header_block corr_start_block#
, header_block corr_end_block#
, 1 blocks_corrupted
, 'Segment Header' description
FROM dba_segments s, v$database_block_corruption c
WHERE s.header_file = c.file#
AND s.header_block between c.block# and c.block# + c.blocks - 1
UNION
SELECT null owner, null segment_type, null segment_name, null partition_name, c.file#
, greatest(f.block_id, c.block#) corr_start_block#
, least(f.block_id+f.blocks-1, c.block#+c.blocks-1) corr_end_block#
, least(f.block_id+f.blocks-1, c.block#+c.blocks-1)
- greatest(f.block_id, c.block#) + 1 blocks_corrupted
, 'Free Block' description
FROM dba_free_space f, v$database_block_corruption c
WHERE f.file_id = c.file#
AND f.block_id <= c.block# + c.blocks - 1
AND f.block_id + f.blocks - 1 >= c.block#
order by file#, corr_start_block#;
Note: If you have "NOLOGING" block corruption, you can follow aboev method, but in alert log message will come. Below query can be used to find "NOLOGGING" block corruptions.
select * from v$database_block_corruption
where CORRUPTION_TYPE='NOLOGGING';
You can use below query since when this corruption occured:
select file#, block#, first_time, next_time
from v$archived_log, v$database_block_corruption
where CORRUPTION_CHANGE# between first_change# and next_change#
and CORRUPTION_TYPE='NOLOGGING';
Fix: Keep your database in FORCE LOGGING mode.
sql> connect sys as sysdba
sql> alter database force logging;
If you are getting LOB segments as corrupted block, then first find which segment it refers. Use below query:
e.g., Assume, 'SYS_LOB0004142294C00005$$' came as corrupted segment.
select * from dba_lobs where segment_name='SYS_LOB0004142294C00005$$'
Find & repair Corrupt block in Oracle database
-- Find if any block (Table) is corrupted.
SQL> select * from V$DATABASE_BLOCK_CORRUPTION -- will show if any corruped block
Below query can give you information about corrupted block:
set head on;
set pagesize 2000
set linesize 250
select * from v$database_block_corruption;
SELECT e.owner, e.segment_type, e.segment_name, e.partition_name, c.file#
, greatest(e.block_id, c.block#) corr_start_block#
, least(e.block_id+e.blocks-1, c.block#+c.blocks-1) corr_end_block#
, least(e.block_id+e.blocks-1, c.block#+c.blocks-1)
- greatest(e.block_id, c.block#) + 1 blocks_corrupted
, null description
FROM dba_extents e, v$database_block_corruption c
WHERE e.file_id = c.file#
AND e.block_id <= c.block# + c.blocks - 1
AND e.block_id + e.blocks - 1 >= c.block#
UNION
SELECT s.owner, s.segment_type, s.segment_name, s.partition_name, c.file#
, header_block corr_start_block#
, header_block corr_end_block#
, 1 blocks_corrupted
, 'Segment Header' description
FROM dba_segments s, v$database_block_corruption c
WHERE s.header_file = c.file#
AND s.header_block between c.block# and c.block# + c.blocks - 1
UNION
SELECT null owner, null segment_type, null segment_name, null partition_name, c.file#
, greatest(f.block_id, c.block#) corr_start_block#
, least(f.block_id+f.blocks-1, c.block#+c.blocks-1) corr_end_block#
, least(f.block_id+f.blocks-1, c.block#+c.blocks-1)
- greatest(f.block_id, c.block#) + 1 blocks_corrupted
, 'Free Block' description
FROM dba_free_space f, v$database_block_corruption c
WHERE f.file_id = c.file#
AND f.block_id <= c.block# + c.blocks - 1
AND f.block_id + f.blocks - 1 >= c.block#
order by file#, corr_start_block#;
1) Collect file ids
sql> select distinct file_id from dba_extents;
2) Collect details
sql>SELECT file_id,segment_name,segment_type,owner,tablespace_name,block_id,blocks
FROM sys.dba_extents
WHERE (file_id between 1 and 23) AND 658438 BETWEEN block_id AND block_id + blocks - 1;
3) Repair - If Table
a) Collect all data to temporary table and collect all DDL script and grants.
b) drop the table and re-create it with DDL script. (Disable refence key before drop, enable after create table)
c) Insert all records to the table
Note: This entire activity should not be taken in prod databases without Oracle support's recommendation.