Search

Sunday, June 19, 2016

Clustering Factor - a key metrics for SQL Tuing in Oracle 11g

We will discuss the following points in this article:

1) What is clustering factor?
2) How the clustering factor is being calculated?
3) How to interpret clustering factor?
4) Bad Clustering factor Vs Good Clustering factor
5) Demonstration
6) FAQ

What is clustering factor?

The clustering factor is a number which represent the degree to which data is randomly distributed in a table. In simple terms it is the number of “block switches” while reading a table using an index. Alternatively, The clustering factor is a measure of the ordered-ness of an index in comparison to the table that it is based on. It is used to check the cost of a table lookup following an index access (multiplying the clustering factor by index’s selectivity gives you the cost of the operation).

How the clustering factor is being calculated?

To calculate the clustering factor of an index during the gathering of index statistics, Oracle does the following.

For each entry in the index Oracle compares the entry's table rowid block with the block of the previous index entry.

If the block is different, Oracle increments the clustering factor by 1.

The minimum possible clustering factor is equal to the number of blocks identified through the index's list of rowid's -- for an index on a column or columns containing no nulls, this will be equal to the number of blocks in the table that contain data. The maximum clustering factor is the number of entries in the index.

How to interpret clustering factor? 

So this means that Oracle now has a statistic to allow it to estimate how many table blocks would be associated with an index range scan.

If the clustering factor is close to the number of entries in the index, then an index range scan of 1000 index entries may require nearly 1000 blocks to be read from the table.

If the clustering factor is close to the number of blocks in the table, then an index range scan of 1000 index entries may require only 50 blocks to be read from the table.

This can be compared with the cost of reading the entire table up to the high-water mark (using the more efficient multiblock i/o mechanism) to determine whether a full table scan or an index range scan offers the most efficient access mechanism.

Note that where extensive deletes have occurred from the table there may be blocks with no rows in. These will be accounted for in the clustering factor because those blocks will not appear in the index's rowid list. The full table scan will still read all table blocks up to the high water mark, regardless of whether they contain rows or not. So in an extreme case it is possible that Oracle could see from the index and table statistics that although a table has 1,000,000 blocks below the high water mark, reading 100% of the rows in the table might only require reading 10 of those blocks. Providing that "Not Null" constraints tell Oracle that all table rows are present in the index, a query such as "select * from big_table_with_few_rows" might be more efficiently satisfied with an index range scan than with a full table scan.

Note also that calculating the clustering factor is done without reference to the table at all -- it is based solely on information contained in the index.


Bad Clustering factor Vs Good Clustering factor

Fig-1 : Bad CF

The above diagram explains that how scatter the rows of the table are. The first index entry (from left of index) points to the first data block and second index entry points to second data block. So while making index range scan or full index scan, optimizer have to switch between blocks and have to revisit the same block more than once because rows are scatter. So the number of times optimizer will make these switches is actually termed as “Clustering factor”.

Fig-2 : Good CF

The above image represents "Good CF”. In an event of index range scan, optimizer will not have to jump to next data block as most of the index entries points to same data block. This helps significantly in reducing the cost of your SELECT statements.

Clustering factor is stored in data dictionary and can be viewed from dba_indexes (or user_indexes)

Demonstration:

SQL> create table CF_TEST as select * from all_objects;
Table created.

SQL> create index obj_id_indx on CF_TEST(object_id);
Index created.

SQL> select clustering_factor from user_indexes where index_name='OBJ_ID_INDX';

CLUSTERING_FACTOR
-----------------
             2165
SQL> 
SQL> SQL> select count(*) from CF_TEST;

  COUNT(*)
----------
    103090

SQL> select blocks from user_segments where segment_name='OBJ_ID_INDX';

    BLOCKS
----------
       256

The above example shows that index has to jump 2165 times to give you the full data had you performed full table scan using the index.

Note:
- A good CF is equal (or near) to the values of number of blocks of table.
- A bad CF is equal (or near) to the number of rows of table.

Myth:
Rebuilding of index can improve the CF.

Tip:
The clustering of data within the table can be used to improve the performance of statements that perform range scan–type operations. By determining how the column is being used in the statements, indexing these column(s) may provide a great benefit.

The clustering factor records the number of blocks that will be read when scanning the index. If the index being used has a large clustering factor, then more table data blocks have to be visited to get the rows in each index block (because adjacent rows are in different blocks). If the clustering factor is close to the number of blocks in the table, then the index is well ordered, but if the clustering factor is close to the number of rows in the table, then the index is not well ordered. The clustering factor is computed by the following (explained briefly):

  • The index is scanned in order.
  • The block portion of the ROWID pointed at by the current indexed valued is compared to the previous indexed value (comparing adjacent rows in the index).
  • If the ROWIDs point to different TABLE blocks, the clustering factor is incremented (this is done for the entire index).

The CLUSTERING_FACTOR column in the USER_INDEXES view gives an indication as to how organized the data is compared to the indexed columns. If the value of the CLUSTERING_FACTOR column value is close to the number of leaf blocks in the index, the data is well ordered in the table. If the value is not close to the number of leaf blocks in the index, then the data in the table is not well ordered. The leaf blocks of an index store the indexed values as well as the ROWIDs to which they point.

For example, say the CUSTOMER_ID for the CUSTOMERS table is generated from a sequence generator, and the CUSTOMER_ID is the primary key on the table. The index on CUSTOMER_ID would have a clustering factor very close to the number of leaf blocks (well ordered). As the customers are added to the database, they are stored sequentially in the table in the same way the sequence numbers are issued from the sequence generator (well ordered). An index on the CUSTOMER_NAME column would have a very high clustering factor, however, because the arrangement of the customer names is random throughout the table.

The clustering factor can impact SQL statements that perform range scans. With a low clustering factor (relative to the number of leaf blocks), the number of blocks needed to satisfy the query is reduced. This increases the possibility that the data blocks are already in memory. A high clustering factor relative to the number of leaf blocks may increase the number of data blocks required to satisfy a range query based on the indexed column.

FAQ:
1) The clustering_facotr column in the user_indexes view is a measure of how organized the data is compared to the indexed column, is there any way i can improve clustering factor of a index. or how to improve it?

Ans:
Already we discussed. Again trying to elaborate.

It tells us how ordered the rows are in the index. If CLUSTERING_FACTOR approaches the number of blocks in the table, the rows are ordered.  If it approaches the number of rows in the table, the rows are randomly ordered.  In such a case (clustering factor near the number of rows), it is unlikely that index entries in the same leaf block will point to rows in the same data blocks.

Note that typically only 1 index per table will be heavily clustered (if any).  It would be extremely unlikely for 2 indexes to be very clustered. If you want an index to be very clustered -- consider using index organized tables.  They force the rows into a specific physical location based on their index entry.

Otherwise, a rebuild of the table is the only way to get it clustered (but you really don't want to get into that habit for what will typically be of marginal overall improvement).

2) What is well order of Index as part of clustering factor?
Ans:
In general, if all of the index entries in a given leaf block point to the same block, then
the table is well ordered with regards to this index.

If all of the index entries in a given leaf block point to different blocks in the table , then
the table is not well ordered with respect to this index.