Oracle Tips and Tricks — David Fitzjarrell

September 26, 2016

“My Kingdom For A Histogram”

Filed under: General — dfitzjarrell @ 09:55

"Don't you know anything at all about numbers?"
"Well, I don't think they're very important," snapped Milo, too embarrassed to admit the truth.
"NOT IMPORTANT!" roared the Dodecahedron, turning red with fury.  "Could you have tea for two without the two, or three blind mice
without the three? Would there be four corners of the earth if there weren't a four? And how would you sail the seven seas without
a seven?"
"All I meant was..." began Milo, but the Dodecahedron, overcome with emotion and shouting furiously, carried right on.
"If you had high hopes, how would you know how high they were? And did you know that narrow escapes come in all different widths? 
Would you travel the whole wide world without ever knowing how wide it was? And how could you do anything at long last," he concluded,
waving his arms over his head, "without knowing how long the last was? Why, numbers are the most beautiful and valuable things in the
world.  Just follow me and I'll show you." He turned on his heel and stalked off into the cave.
-- Norton Juster, The Phantom Tollbooth

In one of the current Oracle forums an “old saw” was resurrected, namely that a single-column primary key has no need of a histogram. This presumes that the primary key has no ‘holes’ in the data and has a continuous range of data. The reality of that situation is such primary keys are rarely, if ever, unbroken and the range of data may not be continuous. With that last sentence in mind it’s not strange to consider a histogram on a single-column primary key. And depending on the release of Oracle in use that histogram may be an old standard (a height-balanced histogram) or a new histogram offered in 12c, the hybrid histogram. Let’s look at an example how this plays out in 11.2.0.4 and in 12.1.0.2. First, though, let’s look at that new Hybrid histogram.

The Hybrid histogram is a new type of histogram that uses a number of buckets less than the number of distinct values in the table data; it also determines, and utilizes, the repeat frequency of the endpoint values. The bucket size can be adjusted from the initial size based on endpoint values; there are two special buckets which will not change in size, and these are the ‘boundary’ buckets, bucket 0 and the last bucket of the histogram. The Hybrid histogram is based on four apparent ‘rules’, which are:


	1) a value should not be found in more than one bucket
	2) the bucket size is allowed to be extended in order to contain all instances of the same distinct value
	3) adjusted bucket size cannot be less than the original size (not applicable at either end of the data set)
	4) the original number of buckets should not be reduced

One way to create a hybrid histogram is shown in the example below, part of which was provided by Jonathan Lewis, the remainder of this example was adapted from work done by Mohammed Houri. We begin:


SQL>
SQL> --
SQL> -- Create and populate a table so we can
SQL> -- build a hybrid histogram
SQL> --
SQL> -- Thanks to Jonathan Lewis for generating
SQL> -- this table and data set
SQL> --
SQL> create table t1 (id number, n1 number);

Table created.

Elapsed: 00:00:00.01
SQL>
SQL> @InsT1.sql

Elapsed: 00:00:00.00
SQL>

Choosing a number of buckets less than the number of distinct values in the table is one way to create a hybrid histogram:


SQL>
SQL> --
SQL> -- Return the number of distinct values
SQL> -- in column n1
SQL> --
SQL>
SQL> select count(distinct n1) from t1;

COUNT(DISTINCTN1)
-----------------
               37

Elapsed: 00:00:00.00
SQL>

20 buckets will be used for our histogram and use DBMS_STATS to create it:


SQL>
SQL> --
SQL> -- Create a hybrid histogram on column n1
SQL> -- by setting the number of buckets less
SQL> -- than the number of distinct values
SQL> --
SQL> BEGIN
  2  	dbms_stats.gather_table_stats
  3  	  (user, 't1', method_opt => 'for columns n1 size 20');
  4  END;
  5  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.43
SQL>

During the creation of this histogram Oracle places the column values in order to find values that occur in more than one bucket. When such values are found new bucket sizes are computed for all but the boundary buckets, in this case bucket 0 and bucket 19:


SQL>
SQL> --
SQL> -- Four rules exist to create a hybrid histogram, and the
SQL> -- first two are:
SQL> --
SQL> -- a value should not be found in more than one bucket
SQL> -- a bucket size is allowed to be extended in order to contain
SQL> --   all instances of the same value
SQL> --
SQL> -- Compute the 'new' bucket size based on the data and
SQL> -- endpoint values
SQL> --
SQL> -- To illustrate how the bucket size adjustment is
SQL> -- calculated let's look at the value 13 in the data
SQL> --
SQL> -- Oracle 'walks' the ordered list of values until
SQL> -- it finds 13 in two buckets
SQL> --
SQL> -- Oracle extends the bucket where 13 is first found (bucket n)
SQL> -- and moves all other instances of 13 (from bucket n+1)
SQL> -- to this bucket (n)
SQL> --
SQL> -- Moving all instances of 13 to a single bucket
SQL> -- also makes 13 an endpoint value allowing Oracle
SQL> -- to compute the repeat count of that value
SQL> --
SQL> -- This continues on through the values so that
SQL> -- Oracle can place all occurrences of a distinct
SQL> -- value so they don't populate more than one bucket
SQL> --
SQL> SELECT
  2  	     (row_number() over(order by ept_nbr)-1) NumBucket
  3  	     ,ept_nbr
  4  	     ,ept_act_val
  5  	     ,rpt_cnt
  6  	     ,ept_nbr - (lag(ept_nbr,1,0) over(order by ept_nbr)) "new bucket size"
  7  	     ,bucket_size "original bucket_size"
  8  	 FROM
  9  	     (SELECT
 10  		  ah.endpoint_number		ept_nbr
 11  		 ,ah.endpoint_actual_value	ept_act_val
 12  		 ,lag(ah.endpoint_number,1,0) over(order by ah.endpoint_number) ept_lag
 13  		 ,ah.endpoint_repeat_count rpt_cnt
 14  		 ,at.sample_size/at.num_buckets bucket_size
 15  	      FROM
 16  		 user_tab_histograms	  ah
 17  		,user_tab_col_statistics  at
 18  	      WHERE ah.table_name  = at.table_name
 19  	      AND ah.column_name = at.column_name
 20  	      AND ah.table_name  = 'T1'
 21  	      AND ah.column_name = 'N1'
 22  	    ) ORDER BY ept_nbr;

 NUMBUCKET    EPT_NBR EPT_ACT_VAL     RPT_CNT new bucket size original bucket_size
---------- ---------- ------------ ---------- --------------- --------------------
         0          1 8                     1               1                    5
         1          6 13                    3               5                    5
         2         12 18                    2               6                    5
         3         20 20                    5               8                    5
         4         26 23                    2               6                    5
         5         32 26                    3               6                    5
         6         38 27                    6               6                    5
         7         44 28                    6               6                    5
         8         50 29                    6               6                    5
         9         58 31                    5               8                    5
        10         69 33                    8              11                    5

 NUMBUCKET    EPT_NBR EPT_ACT_VAL     RPT_CNT new bucket size original bucket_size
---------- ---------- ------------ ---------- --------------- --------------------
        11         79 35                    7              10                    5
        12         86 38                    5               7                    5
        13         90 41                    1               4                    5
        14         92 42                    2               2                    5
        15         95 43                    3               3                    5
        16         96 44                    1               1                    5
        17         97 45                    1               1                    5
        18         98 46                    1               1                    5
        19        100 59                    1               2                    5

20 rows selected.

Elapsed: 00:00:00.21
SQL>

Notice that for buckets 1-12 the bucket size increased; for the remaining non-boundary buckets the bucket size should decrease. But according to Rule 3 no bucket can be adjusted to a size smaller than the original bucket size so those buckets remain unadjusted. The bucket size computation in the query results below is based on the sample size divided by the number of buckets and, thus, displays a constant value for all buckets:


SQL>
SQL> --
SQL> -- Given that boundary values are the exception computing the new
SQL> -- bucket sizes shows that for bucket numbers 1-12 the sizes are
SQL> -- the same or larger than the originally computed size (5)
SQL> --
SQL> -- All remaining non-endpoint buckets compute to a smaller size
SQL> -- than the original
SQL> --
SQL> -- Now the third rule appears:
SQL> --
SQL> -- No non-endpoint bucket can be smaller than the originally
SQL> -- computed size
SQL> --
SQL> select
  2  	      uth.endpoint_number
  3  	     ,uth.endpoint_actual_value
  4  	     ,uth.endpoint_repeat_count
  5  	     ,ucs.sample_size/ucs.num_buckets bucket_size
  6  	     ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
  7  	 from
  8  	     user_tab_histograms uth
  9  	    ,user_tab_col_statistics ucs
 10  	where
 11  	     uth.table_name   = ucs.table_name
 12  	 and uth.column_name   = ucs.column_name
 13  	 and uth.table_name    = 'T1'
 14  	 and uth.column_name   = 'N1'
 15  	order by uth.endpoint_number;

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULARITY
--------------- ------------ --------------------- ----------- ----------
              1 8                                1           5         -4
              6 13                               3           5         -2
             12 18                               2           5         -3
             20 20                               5           5          0
             26 23                               2           5         -3
             32 26                               3           5         -2
             38 27                               6           5          1
             44 28                               6           5          1
             50 29                               6           5          1
             58 31                               5           5          0
             69 33                               8           5          3

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULARITY
--------------- ------------ --------------------- ----------- ----------
             79 35                               7           5          2
             86 38                               5           5          0
             90 41                               1           5         -4
             92 42                               2           5         -3
             95 43                               3           5         -2
             96 44                               1           5         -4
             97 45                               1           5         -4
             98 46                               1           5         -4
            100 59                               1           5         -4

20 rows selected.

Elapsed: 00:00:00.19
SQL>

For a Hybrid histogram three types of column value exist to base cardinality estimates upon: popular values, non-popular values which are endpoint and non-popular values that are not an endpoint. Let’s look at how Oracle estimates the cardinality for each case, starting with popular values:


SQL>
SQL> --
SQL> -- Display the endpoint, value, repeat count,
SQL> -- bucket size and the 'popularity'
SQL> --
SQL> -- Popularity is determined by computing the
SQL> -- difference between the frequency of appearance
SQL> -- (repeat count) and the bucket size (sample size/number of buckets)
SQL> -- Positive values are considered 'popular'
SQL> --
SQL> select
  2  	      endpoint_number
  3  	     ,endpoint_actual_value
  4  	     ,endpoint_repeat_count
  5  	     ,bucket_size
  6  	     ,case when Popularity > 0 then 'Pop'
  7  			else 'Non-Pop'
  8  	       end Popularity
  9  	 from
 10  	(
 11  	  select
 12  	      uth.endpoint_number
 13  	     ,uth.endpoint_actual_value
 14  	     ,uth.endpoint_repeat_count
 15  	     ,ucs.sample_size/ucs.num_buckets bucket_size
 16  	     ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
 17  	 from
 18  	     user_tab_histograms uth
 19  	    ,user_tab_col_statistics ucs
 20  	where
 21  	     uth.table_name   = ucs.table_name
 22  	 and uth.column_name   = ucs.column_name
 23  	 and uth.table_name    = 'T1'
 24  	 and uth.column_name   = 'N1'
 25  	 )
 26  	order by endpoint_number;

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
--------------- ------------ --------------------- ----------- -------
              1 8                                1           5 Non-Pop
              6 13                               3           5 Non-Pop
             12 18                               2           5 Non-Pop
             20 20                               5           5 Non-Pop
             26 23                               2           5 Non-Pop
             32 26                               3           5 Non-Pop
             38 27                               6           5 Pop
             44 28                               6           5 Pop
             50 29                               6           5 Pop
             58 31                               5           5 Non-Pop
             69 33                               8           5 Pop

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
--------------- ------------ --------------------- ----------- -------
             79 35                               7           5 Pop
             86 38                               5           5 Non-Pop
             90 41                               1           5 Non-Pop
             92 42                               2           5 Non-Pop
             95 43                               3           5 Non-Pop
             96 44                               1           5 Non-Pop
             97 45                               1           5 Non-Pop
             98 46                               1           5 Non-Pop
            100 59                               1           5 Non-Pop

20 rows selected.

Elapsed: 00:00:00.19
SQL>

Let’s return only the popular values and work through the cardinality estimate Oracle generates:


SQL> --
SQL> -- Using the popularity Oracle estimates the cardinality
SQL> -- by considering the following three types of values:
SQL> --
SQL> -- popular value
SQL> -- non-popular value with an endpoint number
SQL> -- non-popular value not present in the histogram table
SQL> --
SQL> -- Case 1: Popular values
SQL> --
SQL> -- Display the 'popular' values in this data set
SQL> --
SQL> select
  2  	      endpoint_actual_value
  3  	     ,endpoint_repeat_count
  4  	     ,bucket_size
  5  	     ,Popularity
  6  from
  7  (
  8    select
  9  	      endpoint_number
 10  	     ,endpoint_actual_value
 11  	     ,endpoint_repeat_count
 12  	     ,bucket_size
 13  	     ,case when Popularity > 0 then 'Pop'
 14  			else 'Non-Pop'
 15  	       end Popularity
 16  	 from
 17  	(
 18  	  select
 19  	      uth.endpoint_number
 20  	     ,uth.endpoint_actual_value
 21  	     ,uth.endpoint_repeat_count
 22  	     ,ucs.sample_size/ucs.num_buckets bucket_size
 23  	     ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
 24  	 from
 25  	     user_tab_histograms uth
 26  	    ,user_tab_col_statistics ucs
 27  	where
 28  	     uth.table_name   = ucs.table_name
 29  	 and uth.column_name   = ucs.column_name
 30  	 and uth.table_name    = 'T1'
 31  	 and uth.column_name   = 'N1'
 32  	 )
 33    )
 34  where Popularity = 'Pop';

ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
------------ --------------------- ----------- -------
27                               6           5 Pop
28                               6           5 Pop
29                               6           5 Pop
33                               8           5 Pop
35                               7           5 Pop

Elapsed: 00:00:00.20
SQL>

Looking at one of these popular values (33) let’s return the cardinality estimate Oracle calculated:


SQL> --
SQL> -- Using explain plan the cardinality estimation
SQL> -- can be displayed for one of the 'popular' values
SQL> --
SQL> explain plan for select count(1) from t1 where n1 = 33;

Explained.

Elapsed: 00:00:00.00
SQL>
SQL> --
SQL> -- The cardinality estimate displayed below was
SQL> -- calculated using the following formula:
SQL> --
SQL> -- E-Rows = ENDPOINT_REPEAT_COUNT * num_rows/sample_size
SQL> -- E-Rows = 8 * 100/100 = 8
SQL> --
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     8 |    24 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------

   2 - filter("N1"=33)

14 rows selected.

Elapsed: 00:00:00.03
SQL>

From the formula found in the above comments our calculated cardinality matches with that provided by the optimizer, so our understanding of the Hybrid histogram seems sound. To continue that verification we consider the case where we have a non-popular value that is an endpoint. The formula for that cardinality calculation is: E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/sample_size). We calculate the NewDensity with a query based on the work of Alberto Dell’Era. First we return the non-popular values that are endpoints:


SQL> --
SQL> -- Case 2: Non-popular values with an endpoint
SQL> --
SQL> -- First return all non-popular values that are an
SQL> -- endpoint
SQL> --
SQL> select
  2  	      endpoint_actual_value
  3  	     ,endpoint_repeat_count
  4  	     ,bucket_size
  5  	     ,Popularity
  6  from
  7  (
  8    select
  9  	      endpoint_number
 10  	     ,endpoint_actual_value
 11  	     ,endpoint_repeat_count
 12  	     ,bucket_size
 13  	     ,case when Popularity > 0 then 'Pop'
 14  			else 'Non-Pop'
 15  	       end Popularity
 16  	 from
 17  	(
 18  	  select
 19  	      uth.endpoint_number
 20  	     ,uth.endpoint_actual_value
 21  	     ,uth.endpoint_repeat_count
 22  	     ,ucs.sample_size/ucs.num_buckets bucket_size
 23  	     ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
 24  	 from
 25  	     user_tab_histograms uth
 26  	    ,user_tab_col_statistics ucs
 27  	where
 28  	     uth.table_name   = ucs.table_name
 29  	 and uth.column_name   = ucs.column_name
 30  	 and uth.table_name    = 'T1'
 31  	 and uth.column_name   = 'N1'
 32  	 )
 33    )
 34  where Popularity = 'Non-Pop';

ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
------------ --------------------- ----------- -------
8                                1           5 Non-Pop
13                               3           5 Non-Pop
18                               2           5 Non-Pop
20                               5           5 Non-Pop
23                               2           5 Non-Pop
26                               3           5 Non-Pop
31                               5           5 Non-Pop
38                               5           5 Non-Pop
41                               1           5 Non-Pop
42                               2           5 Non-Pop
43                               3           5 Non-Pop

ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
------------ --------------------- ----------- -------
44                               1           5 Non-Pop
45                               1           5 Non-Pop
46                               1           5 Non-Pop
59                               1           5 Non-Pop

15 rows selected.

Elapsed: 00:00:00.20
SQL>

Let’s return Oracle’s estimated cardinality for one of these values (45):


SQL> --
SQL> -- Use explain plan to return the estimated cardinality
SQL> --
SQL> -- In this case the estimated cardinality is computed as:
SQL> --
SQL> -- E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/sample_size)
SQL> --
SQL> explain plan for select count(1) from t1 where n1 = 45;

Explained.

Elapsed: 00:00:00.00
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     2 |     6 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - filter("N1"=45)

14 rows selected.

Elapsed: 00:00:00.02
SQL>

Now let’s use our query to return the value for NewDensity:


SQL> --
SQL> -- The NewDensity is computed by an internal algorithm but
SQL> -- we can get a reliable density using the following query
SQL> -- based on work done by Alberto Dell'Era:
SQL> --
SQL> SELECT
  2  	      BktCnt
  3  	     ,PopBktCnt
  4  	     ,PopValCnt
  5  	     ,NDV
  6  	     ,pop_bucketSize
  7  	     ,trunc(((BktCnt-PopBktCnt)/BktCnt)/(NDV-PopValCnt),10) NewDensity
  8  	  FROM
  9  	     (SELECT
 10  		COUNT(1) PopValCnt,
 11  		SUM(endpoint_repeat_count) PopBktCnt,
 12  		ndv,
 13  		BktCnt,
 14  		pop_bucketSize
 15  	      FROM
 16  	       (SELECT
 17  		 (sample_size - num_nulls) BktCnt,
 18  		 num_distinct ndv,
 19  		 num_buckets,
 20  		 density OldDensity,
 21  		 (sample_size-num_nulls)/num_buckets pop_bucketSize
 22  	       FROM user_tab_col_statistics
 23  	       WHERE
 24  		   table_name  = 'T1'
 25  	       AND column_name = 'N1'
 26  	       ),
 27  	       user_histograms
 28  	     WHERE table_name	      = 'T1'
 29  	     AND column_name	      = 'N1'
 30  	     AND endpoint_repeat_count> pop_bucketSize
 31  	     GROUP BY ndv,
 32  	       BktCnt,
 33  	       pop_bucketSize
 34  	     );

    BKTCNT  POPBKTCNT  POPVALCNT        NDV POP_BUCKETSIZE NEWDENSITY
---------- ---------- ---------- ---------- -------------- ----------
       100         33          5         37              5   .0209375

Elapsed: 00:00:00.19
SQL>

We now see if our calculation returned a valid value by ‘plugging’ it into the following equation: E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size). The cardinality estimate we compute is:


SQL> --
SQL> -- 'Plugging' this new density into the equation above produces:
SQL> --
SQL> -- E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size)
SQL> -- E-Rows = 100 * greatest (.0209375, 1/100) = 2.09375 ~ 2
SQL> --
SQL> -- which matches the value returned by Oracle
SQL> --
SQL>
SQL> --
SQL> -- To validate these findings use a different non-popular value
SQL> -- that's also an endpoint:
SQL> --
SQL> explain plan for select count(1) from t1 where n1 = 43;

Explained.

Elapsed: 00:00:00.00
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     3 |     9 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------

   2 - filter("N1"=43)

14 rows selected.

Elapsed: 00:00:00.02
SQL>

Oracle returned 3 as the estimated cardinality and so did our calculation (E-Rows = 100 * greatest (.0209375, 3/100) = 3), more proof that our concept of the Hybrid histogram is correct. One final test will prove how well we understand this histogram, considering non-popular values that are not endpoints. Using our calculated value of NewDensity in this equation: E-Rows = num_rows * NewDensity gives us the following result: E-Rows = 100 * .0209375 = 2.09375 ~ 2. Time to see if our calculated value matches that which Oracle has estimated:


SQL>
SQL> --
SQL> -- Case 3: Non-popular value without an endpoint number
SQL> --
SQL> -- This calculation is fairly simple and straightforward:
SQL> --
SQL> -- E-Rows = num_rows * NewDensity = 100 * .0209375 = 2.09375 ~ 2
SQL> --
SQL>
SQL> explain plan for select count(1) from t1 where n1 = 17;

Explained.

Elapsed: 00:00:00.00
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     2 |     6 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------

   2 - filter("N1"=17)

14 rows selected.

Elapsed: 00:00:00.02
SQL>

Our calculated value, as evidenced by the Rows value in the execution plan, matches Oracle’s estimate, giving further evidence that our concept of the Hybrid histogram is sound.

As mentioned earlier in this post Hybrid histograms can exhibit instability in the form of endpoint changes based on the changing frequency of values as inserts are executed against the table. To illustrate this let’s add another 16 to the data and see what happens:


SQL>
SQL> --
SQL> -- Hybrid histograms can display instability in terms of
SQL> -- endpoints and bucket size
SQL> --
SQL> -- To illustrate this we add 16 to the data, increasing
SQL> -- the repeat count to 3, causing this newly added 16
SQL> -- to shift into the previous bucket leaving 17
SQL> -- as the new endpoint
SQL> --
SQL> insert into t1 values (2, 16);

1 row created.

Elapsed: 00:00:00.00
SQL>
SQL> BEGIN
  2  	dbms_stats.gather_table_stats
  3  	  (user, 't1', method_opt => 'for all columns size 20');
  4  END;
  5  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.08
SQL>
SQL> select
  2  	      uth.endpoint_number
  3  	     ,uth.endpoint_actual_value
  4  	     ,uth.endpoint_repeat_count
  5  	 from
  6  	    user_tab_histograms uth
  7  	    ,user_tables ut
  8  	    ,user_tab_col_statistics ucs
  9  	where
 10  	     uth.table_name    = 'T1'
 11  	    and uth.column_name   = 'N1'
 12  	    and uth.table_name	  = ut.table_name
 13  	    and ut.table_name	  = ucs.table_name
 14  	    and uth.column_name   = ucs.column_name
 15  	   order by uth.endpoint_number;

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
              1 8                                1
              6 13                               3
             11 17                               1 <----
             16 19                               3
             21 20                               5
             27 23                               2
             33 26                               3
             39 27                               6
             45 28                               6
             51 29                               6
             59 31                               5

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
             70 33                               8
             80 35                               7
             87 38                               5
             91 41                               1
             96 43                               3
             97 44                               1
             98 45                               1
             99 46                               1
            101 59                               1

20 rows selected.

Elapsed: 00:00:00.24

The endpoint value of 18 has been replaced by the value 17, because of Rule 2, that no value can span buckets. Before the insert the histogram bucket looked like this: 8 12 12 13 13 13|15 16 16 17 18 18|19 19 19 20 20 20 20 20| … After the insert the ordered value list looked like this: 8 12 12 13 13|13 15 16 16 16|17 18 18 19 19 19 … After Oracle ‘walks’ through the list to consolidate like values the buckets look like this: 8 12 12 13 13 13| 15 16 16 16 17 |18 18 19 19 19… giving a new endpoint number of 17. If we add yet another 16 to the data set we see:


SQL>
SQL> --
SQL> -- Add one more 16 and the endpoint shifts to 16
SQL> --
SQL> insert into t1 values (3, 16);

1 row created.

Elapsed: 00:00:00.00
SQL>
SQL> BEGIN
  2  	   dbms_stats.gather_table_stats
  3  	       (user, 't1', method_opt => 'for all columns size 20');
  4  	 END;
  5  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> select
  2  	      uth.endpoint_number
  3  	     ,uth.endpoint_actual_value
  4  	     ,uth.endpoint_repeat_count
  5  	 from
  6  	    user_tab_histograms uth
  7  	    ,user_tables ut
  8  	    ,user_tab_col_statistics ucs
  9  	where
 10  	     uth.table_name    = 'T1'
 11  	    and uth.column_name   = 'N1'
 12  	    and uth.table_name	  = ut.table_name
 13  	    and ut.table_name	  = ucs.table_name
 14  	    and uth.column_name   = ucs.column_name
 15  	   order by uth.endpoint_number;

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
              1 8                                1
              6 13                               3
             11 16                               4 

The endpoint has shifted again and is now 16.

Now that we know what a Hybrid histogram is and what it can do let’s look at another way to generate such a histogram in 12.1.0.2, and this goes back (finally!) to that “old saw” mentioned earlier. Let’s create a table with a single-column primary key and see what Oracle does with it when we run DBMS_STATS.

This example, again by Jonathan Lewis, creates a table and populates the id column (the eventual primary key for the table) in a rather interesting way:


SQL> 
SQL> create table t1
  2  as
  3  with generator as (
  4  	     select  --+ materialize
  5  		     rownum id
  6  	     from dual
  7  	     connect by
  8  		     level  column maxid1 new_value max1
SQL> column maxid2 new_value max2
SQL> column maxid3 new_value max3
SQL> 
SQL> select min(id), max(id) maxid1
  2  from t1;

   MIN(ID)     MAXID1                                                           
---------- ----------                                                           
         1      10000                                                           

SQL> 
SQL> insert into t1 select id + 1e6 from t1;

10000 rows created.

SQL> 
SQL> select min(id), max(id) maxid2
  2  from t1;

   MIN(ID)     MAXID2                                                           
---------- ----------                                                           
         1    1010000                                                           

SQL> 
SQL> insert into t1 select id + 1e7 from t1;

20000 rows created.

SQL> 
SQL> select min(id), max(id) maxid3
  2  from t1;

   MIN(ID)     MAXID3                                                           
---------- ----------                                                           
         1   11010000                                                           

SQL> 

Notice the values in the column; the values go from 1 to 10000 then jump to 1000001, continuing to 1010000, and jump again to 11000001 continuing on to 11010000. The breaks in the value sequence are what make this data set qualify for a Hybrid histogram in 12.1.0.2; in 11.2.0.4 a height-balanced histogram is created. Let’s add the primary key to the table:

`


SQL> 
SQL> alter table t1 add constraint t1_pk primary key(id);

Table altered.

Now let’s query for data that isn’t in the table:


SQL> 
SQL> select
  2  	     /*+ dynamic_sampling(0) */
  3  	     *
  4  from    t1
  5  where
  6  	     id between 12000 and 13000
  7  ;

no rows selected

SQL> 

Doing nothing special we generate statistics on the table; Oracle will automatically determine that the PK column needs a histogram and create it:


SQL> begin
  2  	     dbms_stats.gather_table_stats(
  3  		     ownname	      => user,
  4  		     tabname	      =>'T1'
  5  	     );
  6  end;
  7  /

PL/SQL procedure successfully completed.

SQL> 
SQL> column column_name format a35
SQL> set linesize 150
SQL> 
SQL> select
  2  	     column_name, sample_size,
  3  	     num_distinct, num_nulls, density,
  4  	     histogram, num_buckets
  5  from
  6  	     user_tab_cols
  7  where
  8  	     table_name = 'T1'
  9  order by
 10  	     column_name
 11  ;

COLUMN_NAME                         SAMPLE_SIZE NUM_DISTINCT  NUM_NULLS    DENSITY HISTOGRAM       NUM_BUCKETS                                        
----------------------------------- ----------- ------------ ---------- ---------- --------------- -----------                                        
ID                                         5502        40000          0    .000025 HYBRID                  254                                        

SQL> 

To prove that 11.2.0.4 will create a height-balanced histogram let’s run the example there:


SQL> 
SQL> create table t1
  2  as
  3  with generator as (
  4  	     select  --+ materialize
  5  		     rownum id
  6  	     from dual
  7  	     connect by
  8  		     level 'T1'
  5  	     );
  6  end;
  7  /

PL/SQL procedure successfully completed.

SQL> 
SQL> column column_name format a35
SQL> set linesize 150
SQL> 
SQL> select
  2  	     column_name, sample_size,
  3  	     num_distinct, num_nulls, density,
  4  	     histogram, num_buckets
  5  from
  6  	     user_tab_cols
  7  where
  8  	     table_name = 'T1'
  9  order by
 10  	     column_name
 11  ;

COLUMN_NAME                         SAMPLE_SIZE NUM_DISTINCT  NUM_NULLS    DENSITY HISTOGRAM       NUM_BUCKETS                                        
----------------------------------- ----------- ------------ ---------- ---------- --------------- -----------                                        
ID                                         5509        40000          0 .000024959 HEIGHT BALANCED         254                                        

SQL> 

So it would appear that having a histogram on a single-column primary key isn’t so far-fetched, after all. And we didn’t have to do anything special to get it created.

It all comes down to numbers.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: