Oracle Tips and Tricks — David Fitzjarrell

February 3, 2015

How Can I Compress Thee

Filed under: Exadata,General,Performance — dfitzjarrell @ 10:26

“You can swim all day in the Sea of Knowledge and not get wet.” 
― Norton Juster, The Phantom Tollbooth 

In previous posts compression options have been discussed, and now it’s time to see how Oracle performs basic compression. It isn’t really compression, it’e de-duplication, but it does result in space savings for data that won’t be modified after it’s ‘compressed’. Let’s look at how Oracle saves space with your data.

Oracle de-duplicates the data by finding common strings, tokenizing them and using the token identifier in the string to reduce the row length. So, what does that mean? Looking at an example might help; a table is built and populated as follows:


--
-- Create and populate the table
--
create table comptst(
	tstcol1	varchar2(4),
	tstcol2 varchar2(6),
	tstcol3	number(8,2),
	tstcol4	varchar2(10));

insert into comptst
values ('ZZXZ', 'bbddff', 199.44, 'PENDING');

insert into comptst
values ('ZZXZ', 'ghijkl', 43.08, 'PENDING');

insert into comptst
values ('ZZXZ', 'bbddff', 881.02, 'PENDING');

insert into comptst
values ('ZZXZ', 'bbddff', 54.97, 'PENDING');

commit;

insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;
insert into comptst select * From comptst;

commit;
				--
-- Compress the table with BASIC compression
--
alter table comptst compress basic;
alter table comptst move;

[The table was compressed after the data was inserted which required two steps, the first to set the compression level and the second, a table move in place, to actually compress the data. Had the table been built as compressed and direct path inserts used the data would have been compressed without further action.] Since the initial 4-row insert was re-inserted multiple times there is a lot of duplication in the data, and since Oracle de-duplicates rows to produce the effect of compression there should be a lot of data in a block dump indicating this. There is, and the first piece of that data is the following line:


  perm_9ir2[4]={ 0 2 3 1 }

Oracle builds a token table for each data block; this provides a reference for each data string that occurs more than once in the block. Additionally Oacle can re-arrange the column values in that token table so that multiple column values can be turned into a single token and, thus, a single reference. The line shown above indicates what column values map to the table positions in the token table for this block; in this case column 0 maps to the data in table column 0, column 1 maps to the data in table column 2, column 2 maps to data column 3 and column 3 maps to data column 1. Let’s look at the unique data that was inserted:


('ZZXZ', 'bbddff', 199.44, 'PENDING');
('ZZXZ', 'ghijkl', 43.08, 'PENDING');
('ZZXZ', 'bbddff', 881.02, 'PENDING');
('ZZXZ', 'bbddff', 54.97, 'PENDING');

Since these data rows are duplicated in each block every column is a potential compression token. Two values occur in every row, ‘ZZXZ’ and ‘PENDING’, so it should be expected that tokens for those values will be found in each of the compressed data rows. As mentioned previously Oracle builds a token table in each block so there are two tables in this block, the first, starting at offset 0, is the token table that has 7 rows and the second, starting at offset 7, is the actual table data and there are 721 rows:


0x24:pti[0]	nrow=7		offs=0
0x28:pti[1]	nrow=721	offs=7

Oracle has a clue with this implementation of compression and can create a token that includes a data value and a token, from the same token table, to reduce that row length even further. The examples provided here won’t be demonstrating that but know that it is possible. Now let’s look at the first row in this block for the data table:


tab 1, row 0, @0x1f31
tl: 5 fb: --H-FL-- lb: 0x0  cc: 4
col  0: [ 4]  5a 5a 58 5a
col  1: [ 7]  50 45 4e 44 49 4e 47
col  2: [ 6]  62 62 64 64 66 66
col  3: [ 3]  c1 37 62
bindmp: 2c 00 01 04 02

The actual column lengths are supplied between the square brackets for each column; the total length should be the sum of those values plus 7 bytes, 4 of those for the column lengths, one for the lock byte, one for the flag byte and one for the column count. Using that information the total length should be 24 bytes; the block dump provides a different total length of 5, as reported by the tl entry. There is a line at the end of the row dump labeled bindmp (a binary dump of the row contents) revealing the actual contents of those 5 bytes. As expected there is the lock byte (0x2c), the number of columns at this location (0x01) and two bytes representing the token, reporting that 4 columns are in this token and that the reference row in the token table is row 2. So, let’s look at table 0, row 2:


tab 0, row 2, @0x1f5c
tl: 10 fb: --H-FL-- lb: 0x0  cc: 4
col  0: [ 4]  5a 5a 58 5a
col  1: [ 7]  50 45 4e 44 49 4e 47
col  2: [ 6]  62 62 64 64 66 66
col  3: [ 3]  c1 37 62
bindmp: 00 b3 04 04 05 06 cb c1 37 62

It looks almost like the data row, but the total token length is 10 bytes. Looking at the bindmp the first two bytes indicate this token is used 179 times in this block, the third byte indicates that 4 columns are in this token, the two bytes after that report that the first two columns are also tokens, 0x04 and 0x05. Going back to the token table we see that those tokens are:


tab 0, row 4, @0x1f66
tl: 7 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 4]  5a 5a 58 5a
bindmp: 00 04 cc 5a 5a 58 5a
tab 0, row 5, @0x1f76
tl: 10 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 7]  50 45 4e 44 49 4e 47
bindmp: 00 04 cf 50 45 4e 44 49 4e 47

These are single-column tokens, and each is used 4 times in this block. This is how Oracle reduced the row length from 24 bytes to 5 to save block space. Working through the block dump it’s now possible to re-construct the 24 bytes of data the row originally contained even though it now is only 5 bytes in length.


"The way you see things depends a great deal on where you look at them from."
 - Norton Juster, The Phantom Tollbooth 

We see that Oracle doesn’t actually compress data, it replaces duplicate values with tokens and, through those tokens, reconstructs the data at query time by using the row directory and the actual row pieces in each block. Depending on the select list some tokens won’t be accessed if that data isn’t required. Of course all of this re-constructing can be expensive at the CPU level, and for full table scans of large tables performance can be an issue, especially with the “cache buffers chains” latch because Oracle is performing fewer “consistent gets – examination”. This is because Oracle has to pin blocks for a longer period due to the reconstruction. On the plus side the number of physical reads can decrease since the data is in a smaller number of blocks and can stay in the cache longer. Using basic compression is a trade-off between size and performance, and for extremely large tables or in cases where the compression savings are quite large (meaning the data is compressed more) queries may become CPU-intensive rather than I/O intensive. The good and the bad need to be weighed carefully when making the decision to use compression; choose wisely. Space is relatively inexpensive when compared to end-user satisfaction. The DBA’s idea of performance and the end-users ideas of performance use different criteria, and it’s really the end-users idea that should take precendence.

Anyone up for a swim?

September 8, 2014

Blooms

Filed under: Exadata,General,Performance — dfitzjarrell @ 09:41

In Oracle releases 10.2.0.x and later join processing can be made more efficient by the use of Bloom filters, primarily to reduce traffic between parallel query slaves. What is a Bloom filter? Named after Burton Howard Bloom, who came up with the concept in the 1970s, it’s an efficient data structure used to quickly determine if an element has a high probability of being a member of a given set. It’s based on a bit array that allows for rapid searches and returns one of two results: either the element is probably in the set (which can produce false positives) or the element is definitely not in the set. The filter cannot produce false negatives, and the incidence of false positives is relatively rare. Another advantage to Bloom filters is their small size relative to other data structures used for similar purposes (self-balancing binary search trees, hash tables, or linked lists). The possibility of false positives necessitates the addition of another filter to eliminate them from the results, yet such a filter doesn’t add appreciably to the process time and, therefore, goes relatively unnoticed.

When Bloom filters are used for a query on an Exadata system, the filter predicate and the storage predicate will list the SYS_OP_Bloom_FILTER function as being called. This function includes the additional filter to eliminate any false positives that could be returned. It’s the storage predicate that provides the real power of the Bloom filter on Exadata. Using a Bloom filter to pre-join the tables at the storage server level reduces the volume of data the database servers need to process and can significantly reduce the execution time of the given query.

An example of Bloom filters in action follows; a script named Bloom_fltr_ex.sql was executed to generate this output on an Exadata system; the script can be emailed on request:


SQL> --
SQL> -- Create sample tables
SQL> --
SQL> -- Create them parallel, necessary 
SQL> -- to get a Smart Scan on these tables
SQL> --
SQL> create table emp(
  2  	empid   number,
  3  	empnmvarchar2(40),
  4  	empsal  number,
  5  	empssn  varchar2(12),
  6  	constraint emp_pk primary key (empid)
  7  ) parallel 4;

Table created.

SQL>
SQL> create table emp_dept(
  2  	empid   number,
  3  	empdept number,
  4  	emploc  varchar2(60),
  5    constraint emp_dept_pk primary key(empid)
  6  ) parallel 4;

Table created.

SQL>
SQL> create table dept_info(
  2  	deptnum number,
  3  	deptnm  varchar2(25),
  4    constraint dept_info_pk primary key(deptnum)
  5  ) parallel 4;

Table created.

SQL>
SQL> --
SQL> -- Load sample tables with data
SQL> --
SQL> begin
  2  	     for i in 1..2000000 loop
  3  		     insert into emp
  4  		     values(i, 'Fnarm'||i, (mod(i, 7)+1)*1000, mod(i,10)||mod(i,10)||mod(i,10)||'-'||mod(i,10)||mod(i,10)||'-'||mod(i,10)||mod(i,10)||mod(i,10)||mod(i,10));
  5  		     insert into emp_dept
  6  		values(i, (mod(i,8)+1)*10, 'Zanzwalla'||(mod(i,8)+1)*10);
  7  		     commit;
  8  	     end loop;
  9  	     insert into dept_info
 10  	     select distinct empdept, case when empdept = 10 then 'SALES'
 11  					   when empdept = 20 then 'PROCUREMENT'
 12  					   when empdept = 30 then 'HR'
 13  					   when empdept = 40 then 'RESEARCH'
 14  					   when empdept = 50 then 'DEVELOPMENT'
 15  					   when empdept = 60 then 'EMPLOYEE RELATIONS'
 16  					   when empdept = 70 then 'FACILITIES'
 17  					   when empdept = 80 then 'FINANCE' end
 18  	     from emp_dept;
 19  
 20  end;
 21  /

PL/SQL procedure successfully completed.

SQL>
SQL> --
SQL> -- Run join query using Bloom filter
SQL> --
SQL> -- Generate execution plan to prove Bloom
SQL> -- filter usage
SQL> --
SQL> -- Also report query execution time
SQL> --
SQL> set autotrace on
SQL> set timing on
SQL>
SQL> select /*+ Bloom join 2 parallel 2 use_hash(empemp_dept) */ e.empid, e.empnm, d.deptnm, e.empsal, count(*) emps
  2  fromemp e join emp_depted on (ed.empid = e.empid) join dept_info d on (ed.empdept = d.deptnum)
  3  whereed.empdept = 20
  4  group by e.empid, e.empnm, d.deptnm, e.empsal;

     EMPID EMPNM                  DEPTNM                        EMPSAL       EMPS
---------- ---------------------- ------------------------- ---------- ----------
    904505 Fnarm904505            PROCUREMENT                     1000          1
    907769 Fnarm907769            PROCUREMENT                     3000          1
    909241 Fnarm909241            PROCUREMENT                     5000          1
    909505 Fnarm909505            PROCUREMENT                     3000          1
    909641 Fnarm909641            PROCUREMENT                     6000          1
    910145 Fnarm910145            PROCUREMENT                     6000          1
...
    155833 Fnarm155833            PROCUREMENT                     7000          1
    155905 Fnarm155905            PROCUREMENT                     2000          1
    151081 Fnarm151081            PROCUREMENT                     1000          1
    151145 Fnarm151145            PROCUREMENT                     2000          1

250000 rows selected.

Elapsed: 00:00:14.27

Execution Plan
----------------------------------------------------------
Plan hash value: 2313925751

------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name         | Rows | Bytes | Cost (%CPU)| Time     |   TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |              |  218K|    21M|  1378   (1)| 00:00:01 |       |      |            |
|   1 |  PX COORDINATOR                          |              |      |       |            |          |       |      |            |
|   2 |   PX SEND QC (RANDOM)                    | :TQ10003     |  218K|    21M|  1378   (1)| 00:00:01 | Q1,03 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY                         |              |  218K|    21M|  1378   (1)| 00:00:01 | Q1,03 | PCWP |            |
|   4 |     PX RECEIVE                           |              |  218K|    21M|  1378   (1)| 00:00:01 | Q1,03 | PCWP |            |
|   5 |      PX SEND HASH                        | :TQ10002     |  218K|    21M|  1378   (1)| 00:00:01 | Q1,02 | P->P | HASH       |
|   6 |       HASH GROUP BY                      |              |  218K|    21M|  1378   (1)| 00:00:01 | Q1,02 | PCWP |            |
|*  7 |        HASH JOIN                         |              |  218K|    21M|  1376   (1)| 00:00:01 | Q1,02 | PCWP |            |
|   8 |         PX RECEIVE                       |              |  218K|    11M|   535   (1)| 00:00:01 | Q1,02 | PCWP |            |
|   9 |          PX SEND BROADCAST               | :TQ10001     |  218K|    11M|   535   (1)| 00:00:01 | Q1,01 | P->P | BROADCAST  |
|  10 |           NESTED LOOPS                   |              |  218K|    11M|   535   (1)| 00:00:01 | Q1,01 | PCWP |            |
|  11 |            BUFFER SORT                   |              |      |       |            |          | Q1,01 | PCWC |            |
|  12 |             PX RECEIVE                   |              |      |       |            |          | Q1,01 | PCWP |            |
|  13 |              PX SEND BROADCAST           | :TQ10000     |      |       |            |          |       | S->P | BROADCAST  |
|  14 |               TABLE ACCESS BY INDEX ROWID| DEPT_INFO    |    1 |    27 |     1   (0)| 00:00:01 |       |      |            |
|* 15 |                INDEX UNIQUE SCAN         | DEPT_INFO_PK |    1 |       |     1   (0)| 00:00:01 |       |      |            |
|  16 |            PX BLOCK ITERATOR             |              |  218K|  5556K|   534   (1)| 00:00:01 | Q1,01 | PCWC |            |
|* 17 |             TABLE ACCESS STORAGE FULL    | EMP_DEPT     |  218K|  5556K|   534   (1)| 00:00:01 | Q1,01 | PCWP |            |
|  18 |         PX BLOCK ITERATOR                |              | 1657K|    75M|   839   (1)| 00:00:01 | Q1,02 | PCWC |            |
|* 19 |          TABLE ACCESS STORAGE FULL       | EMP          | 1657K|    75M|   839   (1)| 00:00:01 | Q1,02 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   7 - access("ED"."EMPID"="E"."EMPID")
  15 - access("D"."DEPTNUM"=20)
  17 - storage("ED"."EMPDEPT"=20)
       filter("ED"."EMPDEPT"=20)
  19 - storage(SYS_OP_Bloom_FILTER(:BF0000,"E"."EMPID"))
       filter(SYS_OP_Bloom_FILTER(:BF0000,"E"."EMPID"))

Note
-----
   - dynamic sampling used for this statement (level=2)


Statistics
----------------------------------------------------------
     60  recursive calls
    174  db block gets
  40753  consistent gets
  17710  physical reads
   2128  redo size
9437983  bytes sent via SQL*Net to client
 183850  bytes received via SQL*Net from client
  16668  SQL*Net roundtrips to/from client
      6  sorts (memory)
      0  sorts (disk)
 250000  rows processed

SQL>

In less than 15 seconds 250,000 rows were returned from a three table join of over 4 million rows; the Bloom filter made a dramatic difference in how this query was processed and provided exceptional performance given the volume of data queried.

Without the storage level execution of the Bloom filter, the query execution time increased by 2.33 seconds, a 16.3 percent increase. For longer execution times, this difference can be significant. It isn’t the Bloom filter that gives Exadata such power with joins, it’s the fact that Exadata can execute it not only at the database level but also at the storage level, something commodity hardware configurations can’t do.

Of course Bloom filters are not exclusive to Exadata, as the following run of the same script on a non-Exadata system shows (only the abbreviated output section is reproduced here):


SQL> select /*+ Bloom join 2 parallel 2 use_hash(emp emp_dept) */ e.empid, e.empnm, d.deptnm, e.empsal	  --, count(*) emps
  2  from emp e join emp_dept ed on (ed.empid = e.empid) join dept_info d on (ed.empdept = d.deptnum)
  3  where ed.empdept = 20;

     EMPID EMPNM                                    DEPTNM                        EMPSAL                                                              
---------- ---------------------------------------- ------------------------- ----------                                                              
   1670281 Fnarm1670281                             PROCUREMENT                     5000                                                              
   1670289 Fnarm1670289                             PROCUREMENT                     6000                                                              
   1670297 Fnarm1670297                             PROCUREMENT                     7000                                                              
   1670305 Fnarm1670305                             PROCUREMENT                     1000                                                              
   1670313 Fnarm1670313                             PROCUREMENT                     2000                                                              
   1670321 Fnarm1670321                             PROCUREMENT                     3000                                                              
   1670329 Fnarm1670329                             PROCUREMENT                     4000                                                              
   1670337 Fnarm1670337                             PROCUREMENT                     5000                                                              
   1670345 Fnarm1670345                             PROCUREMENT                     6000                                                              
   1670353 Fnarm1670353                             PROCUREMENT                     7000                                                              
   1670361 Fnarm1670361                             PROCUREMENT                     1000                                                              
...
   1857369 Fnarm1857369                             PROCUREMENT                     4000                                                              
   1857377 Fnarm1857377                             PROCUREMENT                     5000                                                              
   1857385 Fnarm1857385                             PROCUREMENT                     6000                                                              

250000 rows selected.

Elapsed: 00:00:54.86

Execution Plan
----------------------------------------------------------                                                                                            
Plan hash value: 2643012915                                                                                                                           
                                                                                                                                                      
----------------------------------------------------------------------------------------------------------------------------------                    
| Id  | Operation                            | Name         | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |                    
----------------------------------------------------------------------------------------------------------------------------------                    
|   0 | SELECT STATEMENT                     |              |   273K|    26M|  1378   (1)| 00:00:01 |        |      |            |                    
|   1 |  PX COORDINATOR                      |              |       |       |            |          |        |      |            |                    
|   2 |   PX SEND QC (RANDOM)                | :TQ10002     |   273K|    26M|  1378   (1)| 00:00:01 |  Q1,02 | P->S | QC (RAND)  |                    
|*  3 |    HASH JOIN                         |              |   273K|    26M|  1378   (1)| 00:00:01 |  Q1,02 | PCWP |            |                    
|   4 |     PX RECEIVE                       |              |   273K|    13M|   536   (1)| 00:00:01 |  Q1,02 | PCWP |            |                    
|   5 |      PX SEND BROADCAST               | :TQ10001     |   273K|    13M|   536   (1)| 00:00:01 |  Q1,01 | P->P | BROADCAST  |                    
|   6 |       NESTED LOOPS                   |              |   273K|    13M|   536   (1)| 00:00:01 |  Q1,01 | PCWP |            |                    
|   7 |        BUFFER SORT                   |              |       |       |            |          |  Q1,01 | PCWC |            |                    
|   8 |         PX RECEIVE                   |              |       |       |            |          |  Q1,01 | PCWP |            |                    
|   9 |          PX SEND BROADCAST           | :TQ10000     |       |       |            |          |        | S->P | BROADCAST  |                    
|  10 |           TABLE ACCESS BY INDEX ROWID| DEPT_INFO    |     1 |    27 |     1   (0)| 00:00:01 |        |      |            |                    
|* 11 |            INDEX UNIQUE SCAN         | DEPT_INFO_PK |     1 |       |     1   (0)| 00:00:01 |        |      |            |                    
|  12 |        PX BLOCK ITERATOR             |              |   273K|  6947K|   535   (1)| 00:00:01 |  Q1,01 | PCWC |            |                    
|* 13 |         TABLE ACCESS FULL            | EMP_DEPT     |   273K|  6947K|   535   (1)| 00:00:01 |  Q1,01 | PCWP |            |                    
|  14 |     PX BLOCK ITERATOR                |              |  2099K|    96M|   840   (1)| 00:00:01 |  Q1,02 | PCWC |            |                    
|* 15 |      TABLE ACCESS FULL               | EMP          |  2099K|    96M|   840   (1)| 00:00:01 |  Q1,02 | PCWP |            |                    
----------------------------------------------------------------------------------------------------------------------------------                    
                                                                                                                                                      
Predicate Information (identified by operation id):                                                                                                   
---------------------------------------------------                                                                                                   
                                                                                                                                                      
   3 - access("ED"."EMPID"="E"."EMPID")                                                                                                               
  11 - access("D"."DEPTNUM"=20)                                                                                                                       
  13 - filter("ED"."EMPDEPT"=20)                                                                                                                      
  15 - filter(SYS_OP_Bloom_FILTER(:BF0000,"E"."EMPID"))                                                                                               
                                                                                                                                                      
Note                                                                                                                                                  
-----                                                                                                                                                 
   - dynamic sampling used for this statement (level=2)                                                                                               


Statistics
----------------------------------------------------------                                                                                            
         33  recursive calls                                                                                                                          
        139  db block gets                                                                                                                            
      36224  consistent gets                                                                                                                          
      17657  physical reads                                                                                                                           
          0  redo size                                                                                                                                
    9526012  bytes sent via SQL*Net to client                                                                                                         
     183846  bytes received via SQL*Net from client                                                                                                   
      16668  SQL*Net roundtrips to/from client                                                                                                        
          5  sorts (memory)                                                                                                                           
          0  sorts (disk)                                                                                                                             
     250000  rows processed                                                                                                                           

SQL>

The only differences between the two plans involve the storage component of Exadata. Notice also the vast difference in elapsed times between Exadata and non-Exadata systems; Exadata ran the query returning 250,000 rows in just over 14 seconds and the non-Exadata system returned the same volume of data in just under 55 seconds. Both times are substantial improvements over not using a Bloom filter so any hardware running 11.2.0.2 or later will benefit from Bloom filters.

Of course there is a ‘gotcha’ with Bloom filters, and Oracle recognizes this as a bug in releases older than 12.1.0.2. Hash left outer joins, hash right outer joins and hash full outer joins do not use a Bloom Filter and for those interested the unpublished bug number is 17716301. MOS document 1919508.1 reports details on this behavior including an example. As reported in that document the patch can be backported to earlier releases when an SR is filed requesting it.

Bloom filters are a welcome addition to both Exadata and non-Exadata installations as they can improve query performance with parallel queries, as evidenced here. Now that you know how to spot when they are used you may find they are utilized more often than you expected.

It’s a bloomin’ miracle.

July 22, 2014

That’s … Huge!

Filed under: Exadata,General,Performance — dfitzjarrell @ 14:27

Recently I’ve noticed the occasional thread in Oracle newsgroups and lists asking about hugepages support in Linux, including ‘best practices’ for hugepages configuration. This information is out on that ‘world-wide web’ in various places; I’d rather put a lot of that information in this post to provide an easier way to get to it. I’ll cover what hugepages are, what they do, what they can’t do and how best to allocate them for your particular installation. Let’s get started.

“Normal” memory pages in Linux are 4 KB in size and are allocated as needed where the memory map will allow so they are likely not contiguous. Hugepages, in comparison, are 2 MB pages locked in memory and are allocated in a contiguous ‘chunk’; these are allocated at boot time using a parameter in the /etc/sysctl.conf file named vm.nr_hugepages for RedHat Linux kernel 2.6 and a parameter named vm.hugetlb_pool for RedHat Linux kernel 2.4. You must remember that each page is 2 MB is size, as this affects how many hugepages you need to allocate to cover all of the SGAs of running Oracle databases. Set this too high and your system might not boot; set this too low and you won’t get the benefit of using hugepages in one or more Oracle instances. [When use_large_pages is set to ONLY and insufficient hugepages are available the database won’t start; changing use_large_pages to TRUE allows the database to start but it won’t be using hugepages, which results in more paging activity, lowering performance.] Since hugepages are contiguous it stands to reason that performance will improve since Oracle won’t need to access memory segments across the entire range of installed memory. Hugepages also reduce the size of ‘page tables’ by reducing the overhead for page tables for each connection to the database. Page tables are part of the Linux memory management, created and used to map a virtual address to a physical memory location. For SGAs 100 MB and greater using hugepages is probably a good idea, and that idea gets better as the SGA size increases. Kevin Closson reported back in 2009 on the page table overhead for an Oracle database with an 8000 MB SGA running 500 simultaneous connections. The page table overhead reported was startling — almost 6.8 GB of memory used for page tables when hugepages were not configured. He ran the same test again, on the same Oracle database, only that test was run after hugepages were allocated and used by the Oracle database. The page table overhead was reduced to slightly less than 23 MB. That is a dramatic change in memory usage, one that can definitely improve performance.

Hugepages are available for IPC (Inter-Process Communication) shared memory; this is the ‘standard’ shared memory model (starting with UNIX System V) allowing multiple processes to access the same shared memory segment. There is also another form of shared memory segment, the memory-mapped file, and currently such segments are not supported by hugepages. Oracle, on Linux, gives you a choice of using hugepages or memory-mapped files and you implement that choice by selecting to use (or not use) Automatic Memory Management (AMM). Choosing AMM disallows hugepages since AMM uses memory-mapped files located in /dev/shm; you cannot use AMM and hugepages together. For databases with SGAs up to 100 MB using AMM is probably not a good idea; a script to determine how many hugepages to allocate won’t consider SGAs of less than 100 MB. For Oracle databases with larger SGAs hugepages is, in my opinion, the choice to make as you can significantly reduce shared memory overhead where page tables are concerned.

Calculating how many hugepages to allocate has been a difficult task in the past; Oracle supplies a script that can calculate that value for you provided you have all of the databases that you want to use hugepages running; the script from Oracle won’t consider databases you have created that aren’t up and available as it uses the ipcs -m command to return the allocated shared memory segments. The script is available from My Oracle Support in Doc ID 401749.1. It generates a list of IPC shared memory segment sizes owned by Oracle and computes the total number of hugepages required. It outputs the result to be used in setting the vm.nr_hugepages parameter in /etc/sysctl.conf. Once that setting is modified the system has to be rebooted before the changes are implemented. Sample output from that script is shown below:


This script is provided by Doc ID 401749.1 from My Oracle Support
(http://support.oracle.com) where it is intended to compute values for
the recommended HugePages/HugeTLB configuration for the current shared
memory segments. Before proceeding with the execution please note following:
 * For ASM instance, it needs to configure ASMM instead of AMM.
 * The 'pga_aggregate_target' is outside the SGA and
   you should accommodate this while calculating SGA size.
 * In case you changes the DB SGA size,
   as the new SGA will not fit in the previous HugePages configuration,
   it had better disable the whole HugePages,
   start the DB with new SGA size and run the script again.
And make sure that:
 * Oracle Database instance(s) are up and running
 * Oracle Database 11g Automatic Memory Management (AMM) is not setup
   (See Doc ID 749851.1)
 * The shared memory segments can be listed by command:
     # ipcs -m


Press Enter to proceed...

Recommended setting: vm.nr_hugepages = 25240

For Linux RedHat kernel 2.6 this is the number of hugepages, not the overall size of the memory allocated for those hugepages. For RedHat kernel 2.4 a different parameter, vm.hugetlb_pool, is set to the actual size that pool needs to be. Fortunately the script knows which kernel is running and reports the correct parameter and value to set.

After you have hugepages allocated you need to tell Oracle to use them exclusively. An init parameter, use_large_pages, should be set to ONLY to keep Oracle from allocating conventionally managed memory segments for the SGA. If your hugepages calculations are correct then all of the databases on your database server will start without error. Allocate too few hugepages and databases with use_large_pages set to ONLY won’t start after all of the existing hugepages are used. Setting use_large_pages to TRUE will allow a database to start without using hugepages but that’s not what you want to get better performance from Oracle.

Earlier I mentioned that allocating too many hugepages can be worse than allocating too few hugepages. Since each hugepage consumes 2 MB of space multiplying the vm.nr_hugepages setting by 2097152 will show how much memory is occupied by hugepages. It is possible to try to allocate all of the system memory as hugepages resulting in a system that will not boot in multi-user mode. The kernel isn’t configured to use hugepages; the more hugepages that are allocated the less conventionally managed memory is available. Take it all away and there is no place for the operating system to start necessary processes after the kernel is loaded at boot time. Using single-user mode will allow the system adminstrator to modify the offending entry and change it to a more reasonable value so the system can be rebooted. Of course you don’t want to end up in that situation; manually calculating such a value can result in errors so use the Oracle-supplied script to calculate the hugepages value you actually need.

Exadata systems run best when hugepages are allocated and used exclusively. The exachk script checks for such settings in each running database on the system and reports those that do not have use_large_pages set to ONLY. It also checks if the currently configured hugepages setting meets or exceeds the estimated number of hugepages required by all running instances on the given database (compute) node. If this check fails it most likely will fail for all available nodes on the system. Failing this check isn’t the end of the world but it does mean that the failing databases aren’t using memory as efficiently as they could. For each node run the script, make the reported changes then reboot. Before starting any of the databases ensure that all have use_large_pages set to ONLY; start the databases across the cluster using srvctl. The next run ef exachk should report that the hugepages check passed.

RAC databases need to be configured the same way Exadata databases are configured (since Exadata databases are RAC databases). Modify kernel parameters on every database server in the cluster, then reboot. Modify the spfile for each database in the cluster then use srvctl to start those databases cluster-wide. Presuming the hugepages settings are correct across the cluster there should be no error when starting the databases.

Using hugepages for Oracle databases on Linux is, to me, the only way to go if you want to increase database performance. Of course, if you have loads memory to spare you may not see the need. Most systems I’ve worked on don’t have that luxury so being as ‘frugal’ with memory as you can is good. Making the most of available resources is key and hugepages on Linux can definitely help with respect to Oracle.

Size does matter.

July 15, 2014

It Pays To Be Smart

Filed under: Exadata,General — dfitzjarrell @ 09:40

Exadata is a powerful system, able to provide exceptional performance. Much of this peformance is due to Smart Scans, Exadata’s mechanism for shifting the workload to the storage cells and having them filter and reduce the amount of data the database servers must process. Not every Smart Scan that starts ends up completing, though. Oracle may decide that a different path, one that doesn’t include a Smart Scan, may be more efficient. Certain conditions must be met and there is a metric that records Smart Scan starts and another that records how many of those starts actually execute Smart Scans. Let’s look at why Oracle may decide a Smart Scan isn’t the ‘smartest’ route to the data and which metrics you can use to see how many Smart Scans end up using another execution path.

Three conditions can trigger Oracle foregoing a smart scan: setting optimizer_mode to either FIRST_ROWS or FIRST_ROWS_n or executing a query using the ‘where rownum’ predicate. Oracle originally opts to do the Smart Scan then can perform a few block I/O operations to see if that pathway can satisfy the query more efficiently. When that happens a metric, cell num fast response sessions, is incremented. If the block I/O is indeed faster Oracle foregoes the Smart Scan and completes the operation with conventional block I/O. If the block I/O proves that correct results won’t be returned to the calling session in a timely fashion Oracle resumes the Smart Scan and continues processing. Simply because this counter is incremented does not mean, by itself, that Oracle didn’t execute a Smart Scan. You’ll need another metric for that.

When Oracle decides to return to Smart Scan execution another metric is incremented, cell num fast response sessions continuing to smart scan. The same conditions that triggered the previous metric also apply here. Thus when Oracle decides to forego a Smart Scan only the previous metric, cell num fast response sessions, is incremented.

It’s fairly easy to see how many Smart Scans followed through and how many didn’t with the following query:


select a.value smart_scan_started, b.value smart_scan_cont, a.value - b.value no_smart_scan
from v$sysstat a, v$sysstat b
where a.statistic#=262
and b.statistic#=263
/

You may find that none of the Smart Scans that were started actually went on to finish:


SMART_SCAN_STARTED SMART_SCAN_CONT NO_SMART_SCAN
------------------ --------------- -------------
              1222               0          1222

Or you may find that all of them finished:


SMART_SCAN_STARTED SMART_SCAN_CONT NO_SMART_SCAN
------------------ --------------- -------------
               407             407             0

You may also find that some did, and some didn’t, continue:


SMART_SCAN_STARTED SMART_SCAN_CONT NO_SMART_SCAN
------------------ --------------- -------------
                53              27            26

What is important to note is that simply because Oracle started a Smart Scan then chose to not continue on with it doesn’t indicate there is any problem with your Exadata system. On the contrary it indicates that Exadata is doing exactly what it’s supposed to do, efficiently return query results to the calling session. That may be through a Smart Scan, it may be through conventional I/O. Think of it as adjusting your route to a destination to avoid heavy traffic. If Oracle can get the correct results through a few conventional I/O operations it will do so; if a Smart Scan is necessary Oracle can again ‘change gears’ and resume it to ensure correct data is sent to the calling session. It all comes down to using the right tool for the job.

Smart Scans are great when you need them; it’s also good to know that Oracle, on Exadata, can decide when to use them and when it’s best not to. It’s also good to realize that when that happens it’s not because of a problem with Exadata. It’s simply Oracle choosing the right method to get the correct data. Sometimes the shorter path is the better path.

Robert Frost may have said it best:


 
"Two roads diverged in a wood, and I—  
I took the one less traveled by,  
And that has made all the difference." 

Oracle, on Exadata, may take the ‘road less traveled by’ when it comes to choosing a Smart Scan. That’s normally nothing to worry about as Oracle is selecting the most efficient path to return the data.

Anyone for a stroll?

December 15, 2013

To DMA, Or Not To DMA

Filed under: Exadata — dfitzjarrell @ 12:35

Exadata is a different system for a DBA to administer. Some tasks in this environment, such as running the exachk script, require root O/S privileges. This script can be run by the system administrator, and this will be the case if you are managing Exadata as a DBA. However, a new role has emerged relative to Exadata, that of the Database Machine Administrator, or DMA. Let’s look at what being a DMA really means.

In addition to the usual DBA skillset, the DMA must also be familiar with, and be able to understand, the following management and monitoring commands on the specified systems.

On the compute nodes (database nodes):

Linux: top , mpstat , vmstat , iostat , fdisk , ustat , sar , sysinfo
Exadata: dcli
ASM: asmcmd , asmca
Clusterware: crsctl , srvctl

On the storage servers/cells:

Linux: top , mpstat , vmstat , iostat , fdisk , ustat , sar , sysinfo
Cell management: cellcli , cellsrvstat

Being a DMA also includes other areas of responsibility not associated with being a DBA. The following table summarizes the areas of responsibility for a DMA:

*******DMA Responsibilities *******
Skill Percent
System Administrator 15
Storage Administrator 0
Network Administrator 5
Database Administrator 60
Cell Administrator 20

The ‘Percent’ column indicates the percentage of the overall Exadata system requiring this knowledge, and as you can see if you’ve been an 11g RAC administrator, you have 60 percent of the skillset required to be a DMA. Notice that storage administration knowledge is not necessary; this is because Exadata uses a different storage model than normal, a model where the database servers (or nodes) have no direct access to the storage and must access it through the storage servers. Also the number of available disks is dependent upon the number of available storage servers, with each storage server managing 12 physical disks. For many Exadata installations a separate server is configured to communicate with Exadata and with Oracle to provide information on hardware and firmware problems and/or failures; this system notifies Oracle Customer Support of the issue, generates any parts orders and notifies the Oracle technician when to pick up the delivered parts and dispatches the techician to the site to effect the repairs/replacements. Unless there is an external SAN or NAS device connected to Exadata to store database backups and exports there isn’t much need for a Storage Administrator. The remaining skills necessary to be a DMA are not difficult to learn and master. The Cell Administrator commands you will need ( cellcli , dcli ) will increase your knowledge to 80 percent of the DMA skillset. CellCLI is the command-line interface to monitor and manage the storage cells. There are three supplied logins to each storage cell and these are ‘root’, ‘cellmonitor’ and ‘celladmin’. As you can probably guess ‘celladmin’ is the most powerful login that isn’t ‘root’ (the superuser in Linux and Unix). You can do most anything to the storage cells, including startup and shutdown, with ‘celladmin’. The ‘cellmonitor’ user can generate reports and list attributes from the storage cells but has no authority to perform management tasks. The full list of available cellcli commands is shown below:

CellCLI> help

 HELP [topic]
   Available Topics:
        ALTER
        ALTER ALERTHISTORY
        ALTER CELL
        ALTER CELLDISK
        ALTER GRIDDISK
        ALTER IBPORT
        ALTER IORMPLAN
        ALTER LUN
        ALTER PHYSICALDISK
        ALTER QUARANTINE
        ALTER THRESHOLD
        ASSIGN KEY
        CALIBRATE
        CREATE
        CREATE CELL
        CREATE CELLDISK
        CREATE FLASHCACHE
        CREATE FLASHLOG
        CREATE GRIDDISK
        CREATE KEY
        CREATE QUARANTINE
        CREATE THRESHOLD
        DESCRIBE
        DROP
        DROP ALERTHISTORY
        DROP CELL
        DROP CELLDISK
        DROP FLASHCACHE
        DROP FLASHLOG
        DROP GRIDDISK
        DROP QUARANTINE
        DROP THRESHOLD
        EXPORT CELLDISK
        IMPORT CELLDISK
        LIST
        LIST ACTIVEREQUEST
        LIST ALERTDEFINITION
        LIST ALERTHISTORY
        LIST CELL
        LIST CELLDISK
        LIST FLASHCACHE
        LIST FLASHCACHECONTENT
        LIST FLASHLOG
        LIST GRIDDISK
        LIST IBPORT
        LIST IORMPLAN
        LIST KEY
        LIST LUN
        LIST METRICCURRENT
        LIST METRICDEFINITION
        LIST METRICHISTORY
        LIST PHYSICALDISK
        LIST QUARANTINE
        LIST THRESHOLD
        SET
        SPOOL
        START

CellCLI>

All of the above commands are available to ‘celladmin’; only the LIST, DESCRIBE, SET and SPOOL commands are available to ‘cellmonitor’.

Networking commands that you may need are ifconfig , iwconfig , netstat , ping , traceroute , and tracepath . You may, at some time, also need ifup and ifdown , to bring up or bring down network interfaces, although using these commands will not be a regular occurrence. The following example shows how to bring up the eth0 interface.

# ifup eth0

It seems like a daunting task, to become a DMA, but it really isn’t that difficult. It does require a slightly different mindset, as you are now looking at, and managing, the entire system, rather than just the database. There will still be a need for a dedicated System Administrator and Network Administrator for your Exadata system, because, as a DMA, you won’t be responsible for configuration of these resources, nor will you be responsible for patching and firmware upgrades. The DMA is, essentially, assisting these dedicated administrators by assuming the day-to-day tasks these resources would provide. Being a DMA is also more useful to you and to the enterprise as the regular tasks for these areas can be performed by the person or persons who do most of the interaction with Exadata on a daily basis. Enterprises vary, however, and it may not be possible to assume the role of DMA as the division of duties is strictly outlined and enforced. It is good to know, though, that such a role exists and may be made available to you at some time in the future.

March 11, 2013

You Can Never Be Too Smart

Filed under: Exadata — dfitzjarrell @ 11:40

Smart Scans are part of what Kerry Osborne calls the ‘secret sauce’ of Exadata, and he’s right. Smart Scans can definitely improve throughput by using the various optimizations to reduce the amount of data read and the volume of data returned to the database servers. You really want to have them working on an Exadata machine. But how do you know they’re working? Let’s look at a couple ways to prove you have Smart Scans working for you.

The easiest way to prove you have a Smart Scan working (from the database server side) is to query the V$SQL view (or GV$SQL if you want to check across the RAC cluster). There are two ‘counters’ available which report on Smart Scan activity; these are IO_CELL_OFFLOAD_ELIGIBLE_BYTES and IO_CELL_OFFLOAD_RETURNED_BYTES. I’ve used this example in another post but it’s a good example for showing Smart Scan activity so I’ll use it again here:

SQL> select *
  2  from emp
  3  where empid = 7934;

     EMPID EMPNAME                                      DEPTNO
---------- ---------------------------------------- ----------
      7934 Smorthorper7934                                  15

Elapsed: 00:00:00.02

Execution Plan
----------------------------------------------------------
Plan hash value: 3956160932

----------------------------------------------------------------------------------
| Id  | Operation                 | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |     1 |    26 |  1167   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL| EMP  |     1 |    26 |  1167   (1)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("EMPID"=7934)
       filter("EMPID"=7934)


Statistics
----------------------------------------------------------
          1  recursive calls
          1  db block gets
       4349  consistent gets
       4276  physical reads
          0  redo size
        680  bytes sent via SQL*Net to client
        524  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL>
SQL> set autotrace off timing off
SQL>
SQL> select       sql_id,
  2          io_cell_offload_eligible_bytes qualifying,
  3          io_cell_offload_returned_bytes actual,
  4          round(((io_cell_offload_eligible_bytes - io_cell_offload_returned_bytes)/io_cell_offload_eligible_bytes)*100, 2) io_saved_pct,
  5          sql_text
  6  from v$sql
  7  where io_cell_offload_returned_bytes > 0
  8  and instr(sql_text, 'emp') > 0
  9  and parsing_schema_name = 'BING';

SQL_ID        QUALIFYING     ACTUAL IO_SAVED_PCT SQL_TEXT
------------- ---------- ---------- ------------ --------------------------------------
gfjb8dpxvpuv6   35028992       6872        99.98 select * from emp where empid = 7934

SQL>

Looking at the output from the V$SQL query we can see that a Smart Scan was executed since we had both eligible and returned bytes from the cell offload process. This is probably the quickest and easiest way to verify Smart Scan activity. There is another way, though, that is a bit more involved but gives more detailed stats on Smart Scan activity. This method involves the storage cells and the cellsrvstat program; since it involves the storage cells it can be a bit more work, especially if passwordless ssh connectivity isn’t configured between the database servers and the storage cells. [I won’t go into configuring that in this post but there are various resources on the Internet available to assist with that task. http://www.google.com is a great place to start looking for those.] I will presume that such connectivity is configured; getting these stats is a fairly straightforward task that can be scripted and run from the command line on one of the database servers. The script would be similar to this:

#!/bin/ksh

ssh celladmin@exasrv1cel01-priv.mydomain.com “cellsrvstat -stat_group=smartio -interval=$1 -count=$2”
ssh celladmin@exasrv1cel02-priv.mydomain.com “cellsrvstat -stat_group=smartio -interval=$1 -count=$2”
ssh celladmin@exasrv1cel03-priv.mydomain.com “cellsrvstat -stat_group=smartio -interval=$1 -count=$2”

The script takes two parameters, the sample interval and the count of samples to execute; as an example of how this would look at the Linux command line:

$ cellsrvstat_smartio.sh 5 10

which would set a sampling interval of 5 seconds with 10 samples gathered. One nice aspect of setting this up with a script and ssh access is that you can redirect the output to a local file on the database server and look at it later:

$ cellsrvstat_smartio.sh 5 10 > cellsrvstat_all_cells_smartio.lst

This makes it easier to scan all of the output later for review. The output is presented in a tabular format, shown below:


===Current Time===                                      Fri Mar  8 14:37:15 2013

== SmartIO related stats ==
Number of active smart IO sessions                              8             11
High water mark of smart IO sessions                            0            130
Number of completed smart IO sessions                           4        8490054
Smart IO offload efficiency (percentage)                        0              7
Size of IO avoided due to storage index (KB)                    0  2511867327696
Current smart IO to be issued (KB)                              0         214456
Total smart IO to be issued (KB)                            69592  5335617009440
Current smart IO in IO (KB)                                     0              0
Total smart IO in IO (KB)                                   69592  5333234484896
Current smart IO being cached in flash (KB)                     0              0
Total smart IO being cached in flash (KB)                       0              0
Current smart IO with IO completed (KB)                         0           3200
Total smart IO with IO completed (KB)                       69592  5427224552840
Current smart IO being filtered (KB)                            0              0
Total smart IO being filtered (KB)                          69592  5426811394136
Current smart IO filtering completed (KB)                       0          24528
Total smart IO filtering completed (KB)                     69592  5332811276704
Current smart IO filtered size (KB)                             0           8075
Total smart IO filtered (KB)                                13258   378946996063
Total cpu passthru output IO size (KB)                          0      449965296
Total passthru output IO size (KB)                              0      472758949
Current smart IO with results in send (KB)                      0              0
Total smart IO with results in send (KB)                    69592  5332478889672
Current smart IO filtered in send (KB)                          0              0
Total smart IO filtered in send (KB)                        13258   378925244226
Total smart IO read from flash (KB)                             0              0
Total smart IO initiated flash population (KB)                  0              0
Total smart IO read from hard disk (KB)                     69592  2786064722152
Total smart IO writes (fcre) to hard disk (KB)                  0    35298961256
Number of smart IO requests < 512KB                            20      134456077
Number of smart IO requests >= 512KB and < 1MB                 65      116155599
Number of smart IO requests >= 1MB and < 2MB                    3        2226841
Number of smart IO requests >= 2MB and < 4MB                    0      188723594
Number of smart IO requests >= 4MB and < 8MB                    0     1085851766
Number of smart IO requests >= 8MB                              0              0
Size of the smart IO 1MB IO quota being used                    0              0
Hwm of the smart IO 1MB IO quota being used                     0           1002
Number of failures to get 1MB IO quota for smart IO             0         674099
Number of times smart IO buffer reserve failures                0              0
Number of times smart IO request misses                         0         296985
Number of times IO for smart IO not allowed to be issued        0    58504757930
Number of times smart IO prefetch limit was reached             2      125315622
Number of times smart scan used unoptimized mode                0              0
Number of times smart fcre used unoptimized mode                0              0
Number of times smart backup used unoptimized mode              0              0

The metrics are, to me, at least, clearly described in the table so you won’t be wondering what an abbreviated metric name actually represents. The left numeric column indicates current Smart I/O activity and the rightmost numeric column is a cumulative value since the storage cell came online.

Another nice aspect of the storage cell method is you can monitor Smart Scan activity in progress, instead of waiting for the process to finish and Oracle to update the stats in the V$SQL/GV$SQL views. To monitor all cells simultaneously you can create three (or more, depending upon how many storage cells you have in your Exadata configuration) separate scripts and run them at basically the same time in the background. To be ‘really cool’ you could wrap the individual storage cell scripts into a single driver script and, at the end of the monitoring, concatenate the log files together into one huge report. Neat. Since I have to leave something for you to do I won’t illustrate those scripts here.

Exadata provides not only Smart Scans but at least a couple of ways to monitor them; which one you use is your choice, really, and it depends on if you want real time monitoring or just a daily report on how much Smart Scan activity occurred on a given day. Either way you’ll know if you’re executing Smart Scans and how much time (and I/O) you’ve saved because of them.

Speedy and smart. That’s a winning combination.

November 8, 2012

I Can’t Find My Keys

Filed under: Exadata — dfitzjarrell @ 10:32

Exadata offers many features to improve performance — Smart Scans, offloading, Infiniband internal network — but the one feature not often mentioned is the storage index. Designed to prevent Exadata from reading data blocks unnecessarily its purpose is sometimes not clearly explained, leading to confusion. So what is a storage index, where is it found and what does it do? Let’s see if we can answer those questions, and possibly a few more.

Exadata storage indexes are … indexes … but not in the usual bitmap/b-tree sense. You can’t use one to pinpoint the exact location of a particular record and you can’t scan one for a particular range of values. You can’t build one with DDL like ‘create storage index …’ because they are dynamically created in memory based on predicates in a where clause. How, then, are they built, what do they contain and what good purpose do they serve?

Exadata divides the storage into 1 MB sections, on a per table basis; for a quarter rack configured with high capacity disks using the out-of-the-box 80/20 data to recovery distribution there is roughly 30 TB of usable storage to be divided into 1 MB ‘chunks’, making approximately 31 million storage segments, each with its own storage index. That’s not all, as each storage index can contain data for up to eight columns referenced in where clauses of queries that satisfy the conditions for a Smart Scan. For example, we can have, in segment 1, the following eight columns ‘indexed’:

Table name		Column name
======================  ======================
ORDERS			ITEM_NO
			SHIP_ID
			SHIP_DT
			SHIP_CD
			ORDER_NO
			ORDER_DT
			INVOICE_NO
			INVOICE_AMT

and in segment 2 we may have the following columns indexed:

Table name		Column name
======================  ======================
ORDERS			QTY
			UNIT_PRICE
			TTL_PRICE
			SHIP_TO
			ORDER_STATUS
			ORDER_DT
			INVOICE_NO
			INVOICE_AMT

Notice that only three columns occur in both segments: ORDER.DT, INVOICE_NO and INVOICE_AMT. This can go on for the other 31 million storage segments, a fact that can make discussing storage indexes confusing. Remember, though, that it’s not really what is in the storage index that makes the difference, it’s what ISN’T in the storage index that provides the performance. Each storage index includes, besides the column each entry references, the minimum and maximum value for that column within that 1 MB segment and whether or not that column contains NULLs. That information is what gives the storage index its power as, during a Smart Scan as tasks are offloaded to the storage cells, Exadata scans these storage indexes for columns and values of interest. If the storage index contains the desired column and the minimum and maximum values don’t fall within the desired range that 1 MB segment is skipped. If you’re looking for a rather narrow range of values such segment skipping can reduce the volume of data scanned by gigabytes, or even terabytes, and that is the power of the storage index. It tells Exadata which data blocks NOT to read so some people have described storage indexes as ‘reverse indexes’ which is, I suppose, an apt description of their behaviour.

Do storage indexes always help? No, there are conditions when a storage index can’t provide any benefit, such as when the columns of interest are not indexed or when the query in question doesn’t qualify for a Smart Scan. Smart scans are necessary to reap the benefits of a storage index because storage indexes exist at the storage cell level and are accessed (if they’re available) as part of the offloading process. It’s important to remember the conditions that trigger Smart Scans:

Full table scans and/or full index scans coupled with direct path reads, at least one predicate and using the following simple comparison operators
=,<,>,>=,=<,BETWEEN,IN,IS NULL,IS NOT NULL

– or –

Parallel query is in use, again with at least one predicate and using the above mentioned simple comparison operators

Notice the first set of conditions — just because a full table scan or full index scan is in use there is no guarantee that direct path reads will be used so not every full table scan or full index scan qualifies for a Smart Scan. Parallel query, by design, does implement direct path reads. [Oracle has supplied a hidden parameter, _serial_direct_read, in 11.2 which, when set to TRUE (the default is auto), will cause Oracle to favor direct path reads in serial access mode. Notice I said ‘favor’; it doesn’t guarantee serial direct path reads will be executed, it simply increases the likelihood that they will occur. Setting this parameter can also introduce performance problems as Oracle attempts to use serial direct path reads, thus it’s best to test this on a non-production system.]

Storage indexes cannot be modified or tuned like traditional b-tree indexes, and they are not physical objects in the database. They are in-memory ‘tables’, if you will, built ‘on-the-fly’ which will need to be rebuilt if the cellsrv program (the operating system for the storage cells) is restarted so on a reboot the storage indexes created after the restart are not likely to be the same as those that existed before cellsrv was rebooted. Because they can’t be tuned or modified there are no hints to ensure their use; the optimizer won’t know and there are no wait events that monitor storage index usage so AWR and ASH won’t be able to report if they are used or not. That is not to say there isn’t SOME way of knowing if a storage index has been used — a single statistic, cell physical IO bytes saved by storage index, reports, in a cumulative manner by session, how many bytes were saved using storage indexes. Since it’s a cumulative statistic it will be necessary to query that statistic, using the v$sesstat view, before running a candidate query, execute the statement then query the statistic again afterwards. Subtracting the latter value from the former will provide the bytes saved by storage indexes for that query.

Do storage indexes work with Hybrid Columnar Compressed (HCC) tables? Yes, in fact storage indexes can work with compressed tables using any valid level of compression (HCC.OLTP). They also work with bind variables as those values are passed to the storage cells during the offload process. Partitioned tables can also make use of storage indexes, so for qualifying queries both partition pruning and storage indexes can combine to dramatically reduce I/O. Joins and multi-column predicates (again the limit is 8 columns per 1 MB segment) will also make use of storage indexes should a Smart Scan be executed. They will even benefit queries using subqueries as the returned value from the subquery will be compared to the minimum/maximum values for the given column.

Are there any other limitations to storage indexes? Yes, a storage index won’t be used for CLOB columns as CLOBS are not indexed, and predicate values using wildcards (‘%’,’_’) cannot be searched. Index-organized tables also won’t qualify for smart scans — with an IOT the row source cannot be offloaded so a Smart Scan won’t occur. Index range scans won’t trigger Smart Scans, nor will queries using reverse-key indexes. As with other, traditional indexes, functions applied to columns disable a storage index, such as using trunc() on a date column. Implicit conversions used to disable storage indexes on occasion, especially when the optimizer didn’t recognize the string as a convertible value (such as date strings). Recent releases of the Exadata storage server software have corrected most of these issues.

Storage indexes are part of the ‘secret sauce’ of Exadata, and can provide dramatic results for qualifying queries. Knowing when they can be used and when they can’t should help when tuning performance issues you may experience with an Exadata configuration.

 Sometimes it’s good to lose your keys.

October 10, 2012

“Ex”-Men

Filed under: Exadata — dfitzjarrell @ 09:00

In 2008 Oracle introduced Exadata, and data warehousing, business intelligence and OLTP applications have benefited from this integrated technology stack. Exadata is not like anything most DBAs have ever seen or managed; it’s a tuned combination of database server, storage server and private interconnects that truly can outperform commodity hardware configurations. Let’s take a peek inside Exadata and get a glimpse of what Oracle hath wrought.

An Exadata machine is more than just the parts that comprise it. In its smallest configuration the rack houses two database servers running Oracle Enterprise Linux, a storage server managing storage ‘cells’ (running its operating systemn known as ‘cellsrv’) and a private interconnect known as InfiniBAND to allow the database servers to talk, albeit indirectly, to the storage server. Typically in this configuration the raw storage adds up to 104 terabytes, and it’s configured with normal redundancy providing, all told, 52 terabytes of storage. Accounting for the operating system and other management software the usable storage runs around 39 TB divided among diskgroups earmarked for database storage, recovery and an interesting feature known as DBFS (more on that later). How does all of this hardware, software, memory and storage work together? Quite nicely as Oracle has done an excellent job of creating a database appliance that can dramatically improve performance over conventional setups.

Exadata is built to efficiently use the resources to return results to the end user. Toward that end it has been designed to offload query tasks to the storage server, when possible, to reduce the amount of data passed to the calling session. Certain criteria must be met to offload to the storage server:

* Full table or index scans must be used
* Direct path reads must be used to access the data

If the first requirement is not met another option can trigger the offloading mechanism:

* Parallel query is in use

Parallel query, by default, uses direct path reads so even if full scans aren’t used offloading can occur. What is offloading? Exadata is smart enough to know that some operations can be transferred to the storage server and serviced by the cells directly resulting in far less data traversing the InfiniBAND ‘pipe’. Offloading itself doesn’t do this, it’s what Exadata does with the offloaded tasks that makes the difference. Offloading triggers a new execution path element called Smart Scans, which use another new item, Storage Indexes. Storage indexes are created in memory (so they’re resident as long as cellsrv is not restarted) and can contain up to 8 columns per storage segment. For Storage Indexes the storage cells are divided into segments of 1 MB and each segment has its own Storage Index. This configuration allows Exadata to create Storage Indexes such that on one segment 8 columns of table A are indexed, on the next segment 8 columns of table B are indexed, on the next segment 8 different columns from table A are indexed, and so on. Each storage index records the lowest and highest values from each column in the index, and also include NULL values. Indexing NULLs gives Storage Indexes their true power, as these are not ‘where do I find this value’ indexes as normal B-Tree indexes are. Storage indexes are used to tell Exadata where to NOT look for data; Exadata will skip over storage segments that have no possibility of containing the desired values and this dramatically speeds searches and data retrieval. If a Smart Scan is effected it can return data in mere seconds rather than the minutes it could take using a conventional database configuration (database server, switch, network storage array); remember, though, that not all queries submitted to Exadata will qualify for Smart Scans. That being the case the integrated design of the Exadata machine enables it to outperform conventional setups in most cases, with or without using a Smart Scan.

How do you know you’ve used a Smart Scan? The ‘cell smart table scan’ wait event or the ‘cell smart index scan’ wait event will show activity indicating a Smart Scan occurred. Nothing in the execution plan would be displayed to prove a Smart Scan was used; querying v$waitstat would be the only way to confirm that a Smart Scan was implemented.

Another interesting aspect of Exadata is that the database servers have no direct access to the storage cells. ASM is configured to be the bridge between the database and storage servers. The storage server has its own operating system and Oracle database which is how Smart Scans are executed. All of the components are integrated, which can be a challenge for organizations which separate UNIX/Linux system administration duties, Oracle DBA responsibilities, network configuration and storage duties as patching and maintenance for the Exadata machine is performed using bundles that ensure no software/firmware mismatches occur. Such mismatches could result in Exadata features not working properly or, at the worst case, failing to work at all. Oracle suggests a new role, the DMA (Database Machine Administrator) that has both Oracle admin rights and O/S admin rights (root) so that patching can be performed and verified by personnel trained in Exadata machine management. It is possible to divide the responsibilities provided there is good communication between the groups during maintenance outages.

Getting back to parallel query there is a new way, in 11.2, to manage parallel executions across sessions which is extremely effective in the Exadata environment — parallel statement queuing. This works in concert with another new feature, Auto Degree of Parallelism, which can dynamically reconfigure the parallelism during query execution. Auto DOP is controlled by the parallel_degree_policy parameter and when set to auto will implement auto DOP. No objects need to have DEGREE set as Oracle will determine, at run time, the optimal DOP for the query provided it will take longer than parallel_min_time_threshold, which defaults to 10 seconds. (As usual this parameter can also be set to a different value to enable more, or fewer, queries to run in parallel.) Oracle also provides some measure of control on how it determines the ideal DOP with the parallel_degree_limit parameter which defaults to CPU but can be set to CPU, IO, AUTO or an integer value. None of this will work on 11.2 if I/O calibration has not been completed. This is accomplished with the DBMS_RESOURCE_MANAGER.CALIBRATE_IO procedure; it takes two IN parameters, the number of disks and the maximum disk latency, and three OUT parameters, placeholders for latency, maximum i/o per second and the maximum MB per second. The procedure can take as long as 15 minutes to run as it generates a random workload spread across all nodes in the RAC cluster. Once completed Auto DOP can be enabled and will function as expected. If the calibration has not been done then Oracle will bypass Auto DOP and report in the execution plan that the IO calibrate statistics are missing. Getting back to parallel statement queuing once parallel_degree_policy is set to auto and parallel_servers_target is set to a ‘reasonable’ value (‘reasonable’ meaning the desired maximum number of parallel servers you want to have running at one time) you’re ready to go. Queries will run using Auto DOP until the system hits the limit you set with parallel_servers_target at which time the remaining queries will be queued until resources (parallel query slaves) are available. By default it’s a First In / First Out queue — this behavior can be changed but I won’t cover that in this article. As slaves are made available by prior queries completing the next query in line is run and this continues until the system has freed enough resources to run the remaining queries in the queue or the queue comes to end. Monitoring the queue progress is fairly simple as the information is found in the V$SQL_MONITOR view; if the STATUS is QUEUED then the query is waiting for resources to be made available. V$PX_PROCESS_SYSSTAT can report how many slaves are in use; using both views will enable you to determine which queries are queued and when the next query will come out of the queue to be executed.

DBFS is a unique aspect of Exadata that allows Oracle to utilize a dedicated tablespace as a file system. It’s configured through the clusterware and mounted to a given directory at the O/S level to provide additional storage space for items such as expdp dump files, documents, patch bundle zip files, etc. It operates, at the O/S level, as a regular file system. One caveat of this configuration is the tablespace must be monitored for free space; if the free space drops below, say, 300 GB in a 2 TB file system the df -h call can hang and access to the mounted directories will be compromised. A simple extension of that datafile restores dbfs to it’s speedy self; since DBFS uses bigfile tablespaces (one per mounted file system) you can’t add another datafile to the tablespace to increase the storage (it’s designed that way to prevent ‘striping’ the O/S data across multiple datafiles which could be a performance hit on reads and writes). On reboots of the server some manual intervention is necessary to start the DBFS resource and mount the file system but this not difficult as the shell script to mount the file system is created in the DBFS setup and configuration procedures. More information on configuring DBFS is found in this Metalink document.

There is so much more to Exadata, more than can be put into a single article which is why there are several books on the subject. (My personal choice is “Expert Oracle Exadata”, from Apress, written by Kerry Osborne, Randy Johnson and Tanel Poder.) The intent of this article is to provide a starting point for the journey into Exadata by discussing some of the features and, well, wonders, of the Exadata machine and I hope my efforts have been successful. I am curently on a project where a client is migrating Oracle ERP databases to Exadata and I’m learning new areas of Exadata every day, and enjoying it immensely. Where once Exadata was a novelty it’s now a proven technology, implemented by many organizations world-wide. Some day soon it may come to you (if it hasn’t already); what I’ve covered here should prepare you for your first experience with Exadata.

X-Men all started somewhere.

Create a free website or blog at WordPress.com.