Oracle Tips and Tricks — David Fitzjarrell

July 22, 2014

That’s … Huge!

Filed under: Exadata,General,Performance — dfitzjarrell @ 14:27

Recently I’ve noticed the occasional thread in Oracle newsgroups and lists asking about hugepages support in Linux, including ‘best practices’ for hugepages configuration. This information is out on that ‘world-wide web’ in various places; I’d rather put a lot of that information in this post to provide an easier way to get to it. I’ll cover what hugepages are, what they do, what they can’t do and how best to allocate them for your particular installation. Let’s get started.

“Normal” memory pages in Linux are 4 KB in size and are allocated as needed where the memory map will allow so they are likely not contiguous. Hugepages, in comparison, are 2 MB pages locked in memory and are allocated in a contiguous ‘chunk’; these are allocated at boot time using a parameter in the /etc/sysctl.conf file named vm.nr_hugepages for RedHat Linux kernel 2.6 and a parameter named vm.hugetlb_pool for RedHat Linux kernel 2.4. You must remember that each page is 2 MB is size, as this affects how many hugepages you need to allocate to cover all of the SGAs of running Oracle databases. Set this too high and your system might not boot; set this too low and you won’t get the benefit of using hugepages in one or more Oracle instances. [When use_large_pages is set to ONLY and insufficient hugepages are available the database won’t start; changing use_large_pages to TRUE allows the database to start but it won’t be using hugepages, which results in more paging activity, lowering performance.] Since hugepages are contiguous it stands to reason that performance will improve since Oracle won’t need to access memory segments across the entire range of installed memory. Hugepages also reduce the size of ‘page tables’ by reducing the overhead for page tables for each connection to the database. Page tables are part of the Linux memory management, created and used to map a virtual address to a physical memory location. For SGAs 100 MB and greater using hugepages is probably a good idea, and that idea gets better as the SGA size increases. Kevin Closson reported back in 2009 on the page table overhead for an Oracle database with an 8000 MB SGA running 500 simultaneous connections. The page table overhead reported was startling — almost 6.8 GB of memory used for page tables when hugepages were not configured. He ran the same test again, on the same Oracle database, only that test was run after hugepages were allocated and used by the Oracle database. The page table overhead was reduced to slightly less than 23 MB. That is a dramatic change in memory usage, one that can definitely improve performance.

Hugepages are available for IPC (Inter-Process Communication) shared memory; this is the ‘standard’ shared memory model (starting with UNIX System V) allowing multiple processes to access the same shared memory segment. There is also another form of shared memory segment, the memory-mapped file, and currently such segments are not supported by hugepages. Oracle, on Linux, gives you a choice of using hugepages or memory-mapped files and you implement that choice by selecting to use (or not use) Automatic Memory Management (AMM). Choosing AMM disallows hugepages since AMM uses memory-mapped files located in /dev/shm; you cannot use AMM and hugepages together. For databases with SGAs up to 100 MB using AMM is probably not a good idea; a script to determine how many hugepages to allocate won’t consider SGAs of less than 100 MB. For Oracle databases with larger SGAs hugepages is, in my opinion, the choice to make as you can significantly reduce shared memory overhead where page tables are concerned.

Calculating how many hugepages to allocate has been a difficult task in the past; Oracle supplies a script that can calculate that value for you provided you have all of the databases that you want to use hugepages running; the script from Oracle won’t consider databases you have created that aren’t up and available as it uses the ipcs -m command to return the allocated shared memory segments. The script is available from My Oracle Support in Doc ID 401749.1. It generates a list of IPC shared memory segment sizes owned by Oracle and computes the total number of hugepages required. It outputs the result to be used in setting the vm.nr_hugepages parameter in /etc/sysctl.conf. Once that setting is modified the system has to be rebooted before the changes are implemented. Sample output from that script is shown below:


This script is provided by Doc ID 401749.1 from My Oracle Support
(http://support.oracle.com) where it is intended to compute values for
the recommended HugePages/HugeTLB configuration for the current shared
memory segments. Before proceeding with the execution please note following:
 * For ASM instance, it needs to configure ASMM instead of AMM.
 * The 'pga_aggregate_target' is outside the SGA and
   you should accommodate this while calculating SGA size.
 * In case you changes the DB SGA size,
   as the new SGA will not fit in the previous HugePages configuration,
   it had better disable the whole HugePages,
   start the DB with new SGA size and run the script again.
And make sure that:
 * Oracle Database instance(s) are up and running
 * Oracle Database 11g Automatic Memory Management (AMM) is not setup
   (See Doc ID 749851.1)
 * The shared memory segments can be listed by command:
     # ipcs -m


Press Enter to proceed...

Recommended setting: vm.nr_hugepages = 25240

For Linux RedHat kernel 2.6 this is the number of hugepages, not the overall size of the memory allocated for those hugepages. For RedHat kernel 2.4 a different parameter, vm.hugetlb_pool, is set to the actual size that pool needs to be. Fortunately the script knows which kernel is running and reports the correct parameter and value to set.

After you have hugepages allocated you need to tell Oracle to use them exclusively. An init parameter, use_large_pages, should be set to ONLY to keep Oracle from allocating conventionally managed memory segments for the SGA. If your hugepages calculations are correct then all of the databases on your database server will start without error. Allocate too few hugepages and databases with use_large_pages set to ONLY won’t start after all of the existing hugepages are used. Setting use_large_pages to TRUE will allow a database to start without using hugepages but that’s not what you want to get better performance from Oracle.

Earlier I mentioned that allocating too many hugepages can be worse than allocating too few hugepages. Since each hugepage consumes 2 MB of space multiplying the vm.nr_hugepages setting by 2097152 will show how much memory is occupied by hugepages. It is possible to try to allocate all of the system memory as hugepages resulting in a system that will not boot in multi-user mode. The kernel isn’t configured to use hugepages; the more hugepages that are allocated the less conventionally managed memory is available. Take it all away and there is no place for the operating system to start necessary processes after the kernel is loaded at boot time. Using single-user mode will allow the system adminstrator to modify the offending entry and change it to a more reasonable value so the system can be rebooted. Of course you don’t want to end up in that situation; manually calculating such a value can result in errors so use the Oracle-supplied script to calculate the hugepages value you actually need.

Exadata systems run best when hugepages are allocated and used exclusively. The exachk script checks for such settings in each running database on the system and reports those that do not have use_large_pages set to ONLY. It also checks if the currently configured hugepages setting meets or exceeds the estimated number of hugepages required by all running instances on the given database (compute) node. If this check fails it most likely will fail for all available nodes on the system. Failing this check isn’t the end of the world but it does mean that the failing databases aren’t using memory as efficiently as they could. For each node run the script, make the reported changes then reboot. Before starting any of the databases ensure that all have use_large_pages set to ONLY; start the databases across the cluster using srvctl. The next run ef exachk should report that the hugepages check passed.

RAC databases need to be configured the same way Exadata databases are configured (since Exadata databases are RAC databases). Modify kernel parameters on every database server in the cluster, then reboot. Modify the spfile for each database in the cluster then use srvctl to start those databases cluster-wide. Presuming the hugepages settings are correct across the cluster there should be no error when starting the databases.

Using hugepages for Oracle databases on Linux is, to me, the only way to go if you want to increase database performance. Of course, if you have loads memory to spare you may not see the need. Most systems I’ve worked on don’t have that luxury so being as ‘frugal’ with memory as you can is good. Making the most of available resources is key and hugepages on Linux can definitely help with respect to Oracle.

Size does matter.

Advertisements

July 15, 2014

It Pays To Be Smart

Filed under: Exadata,General — dfitzjarrell @ 09:40

Exadata is a powerful system, able to provide exceptional performance. Much of this peformance is due to Smart Scans, Exadata’s mechanism for shifting the workload to the storage cells and having them filter and reduce the amount of data the database servers must process. Not every Smart Scan that starts ends up completing, though. Oracle may decide that a different path, one that doesn’t include a Smart Scan, may be more efficient. Certain conditions must be met and there is a metric that records Smart Scan starts and another that records how many of those starts actually execute Smart Scans. Let’s look at why Oracle may decide a Smart Scan isn’t the ‘smartest’ route to the data and which metrics you can use to see how many Smart Scans end up using another execution path.

Three conditions can trigger Oracle foregoing a smart scan: setting optimizer_mode to either FIRST_ROWS or FIRST_ROWS_n or executing a query using the ‘where rownum’ predicate. Oracle originally opts to do the Smart Scan then can perform a few block I/O operations to see if that pathway can satisfy the query more efficiently. When that happens a metric, cell num fast response sessions, is incremented. If the block I/O is indeed faster Oracle foregoes the Smart Scan and completes the operation with conventional block I/O. If the block I/O proves that correct results won’t be returned to the calling session in a timely fashion Oracle resumes the Smart Scan and continues processing. Simply because this counter is incremented does not mean, by itself, that Oracle didn’t execute a Smart Scan. You’ll need another metric for that.

When Oracle decides to return to Smart Scan execution another metric is incremented, cell num fast response sessions continuing to smart scan. The same conditions that triggered the previous metric also apply here. Thus when Oracle decides to forego a Smart Scan only the previous metric, cell num fast response sessions, is incremented.

It’s fairly easy to see how many Smart Scans followed through and how many didn’t with the following query:


select a.value smart_scan_started, b.value smart_scan_cont, a.value - b.value no_smart_scan
from v$sysstat a, v$sysstat b
where a.statistic#=262
and b.statistic#=263
/

You may find that none of the Smart Scans that were started actually went on to finish:


SMART_SCAN_STARTED SMART_SCAN_CONT NO_SMART_SCAN
------------------ --------------- -------------
              1222               0          1222

Or you may find that all of them finished:


SMART_SCAN_STARTED SMART_SCAN_CONT NO_SMART_SCAN
------------------ --------------- -------------
               407             407             0

You may also find that some did, and some didn’t, continue:


SMART_SCAN_STARTED SMART_SCAN_CONT NO_SMART_SCAN
------------------ --------------- -------------
                53              27            26

What is important to note is that simply because Oracle started a Smart Scan then chose to not continue on with it doesn’t indicate there is any problem with your Exadata system. On the contrary it indicates that Exadata is doing exactly what it’s supposed to do, efficiently return query results to the calling session. That may be through a Smart Scan, it may be through conventional I/O. Think of it as adjusting your route to a destination to avoid heavy traffic. If Oracle can get the correct results through a few conventional I/O operations it will do so; if a Smart Scan is necessary Oracle can again ‘change gears’ and resume it to ensure correct data is sent to the calling session. It all comes down to using the right tool for the job.

Smart Scans are great when you need them; it’s also good to know that Oracle, on Exadata, can decide when to use them and when it’s best not to. It’s also good to realize that when that happens it’s not because of a problem with Exadata. It’s simply Oracle choosing the right method to get the correct data. Sometimes the shorter path is the better path.

Robert Frost may have said it best:


 
"Two roads diverged in a wood, and I—  
I took the one less traveled by,  
And that has made all the difference." 

Oracle, on Exadata, may take the ‘road less traveled by’ when it comes to choosing a Smart Scan. That’s normally nothing to worry about as Oracle is selecting the most efficient path to return the data.

Anyone for a stroll?

July 1, 2014

What A Performance!

Filed under: General,Performance,stats — dfitzjarrell @ 10:09

Performance is one of the big watchwords in the IT industry; nowhere is it more often used than in the database arena, and it’s usually not a good sign. End users know nothing, really, of the inner workings of the database yet they are the first to point the ‘finger of blame’ at the database when the processing speed slows or queries take longer than they would like to return data. [For end users, performance is time, and the longer a process takes the more dismal performance seems to be. Given that perspective their complaints seem justified.] Yes, there ARE times when the database is at fault — ‘missing’ indexes or stale statistics can cause execution plans to change, as can inserts and updates to heavily transacted tables (which many times goes back to stale statistics since they now no longer accurately reflect the data and its distribution in the table). Bugs in the database software can also cause issues by causing the gathered statistics to miscalculate values or generate ‘useless’ frequency histograms, such as on single-valued not null columns. Such histograms can direct the optimizer to choose a path it wouldn’t normally choose when the statistics and histograms contain useful data.

Here is where the problem can get complicated: it may not be the database at fault when performance is down. Yes, it’s the database server you’re getting data from. Yes, things were running just fine until today. By ‘logical’ deduction it’s the database at fault, right? Not necessarily. The database also has to deal with the operating system, the storage system, the available memory and the network to get its work done. Problems in any of those areas are likely to adversely affect database performance. So it’s not necessarily the database causing the problem. How do you know where the problem originates? Oracle does provide some tools to help in assessing where a problem is likely to be, but you may also need to involve System Administrators, Network Administrators and/or Storage Administrators to fully investigate some issues. Let’s take a look at the Oracle supplied tools, along with some operating system utilities that can be used to diagnose performance issues.

At the first sign of trouble many look to the DBA to investigate the database for bottlenecks or other performance-robbing possibilities that surely must exist. Oracle provides at least two tools in releases 10.1 and later to investigate the database for performance problems — the AWR report, a general overall ‘picture’ of the database ‘health’ that can give direction to such a quest and the ASH report, a more detailed ‘picture’ for a given window of activity that provides more detailed wait information. These utilities should be used at the first report of performance issues to determine if the problem is database-related or caused by another area, such as the operating system or network. Starting with the AWR (Automatic Workload Repository) report the two sections to begin an investigation, in my estimation, are Wait Events Statistics and SQL Statistics. The first provides a system-wide view of the waits encountered and the times those waits consumed. Background Wait Events and the Wait Event Histogram sections can provide some insight into what was occurring during the report window; waits to watch are I/O related waits (reads, writes) which can signal issues with a query plan or issues with the storage (issues not addressable by the DBA). The SQL Statistics list SQL statements in order of several areas, including overall elapsed time, consumed CPU time, I/O wait time and physical reads. The SQL statistics can usually point to ‘bad’ plans, making performance problems a database issue because such plans are the result of bad or stale statistics. In 11.2 and later releases dynamic sampling can mitigate missing statistics by automatically sampling the data prior to generating a query execution plan. We won’t get into the details of using these reports to effect database tuning procedures as that’s a topic for another article. The intent here is to provide a means to determine IF the problem is a database issue or if it’s caused by outside ‘forces’. For releases earlier than 10.1 Oracle provides Statspack, the pre-cursor to AWR and the next step after the original database snapshot utilities utlbstat and utlestat. This utility can provide a very good snapshot of database activity, although not quite as robust as the AWR report. It is still available in 11.2 if you would like to install it. AWR and ASH are utilities which are part of the Diagnostic and Tuning Pack for OEM, and thus need to be licensed if they are to be used. Statspack doesn’t require such a license.

The “Top 5 Timed Foreground Events” section gives an overview of the wait events which consumed the most time during the report window:

Top 5 Timed Foreground Events

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
DB CPU 0 82.42
log file sync 8 0 11 30.72 Commit
Disk file operations I/O 1 0 1 0.33 User I/O
os thread startup 0 0 0.00 Concurrency
log file single write 0 0 0.00 System I/O

Host CPU (CPUs: 4 Cores: 2 Sockets: 1)

This provides a quick look at the ‘worst offenders’ in the wait event category (although they may not be performance-robbing events). Start here then work down to the Operating System Statistics section:

Operating System Statistics – Detail

Snap Time Load %busy %user %sys %idle %iowait
20-Jun 07:04:42 0.00
20-Jun 08:00:45 0.00 7.12 4.65 2.48 92.88 0.00

The Wait Event Histogram Detail can be very informative:

Wait Event Histogram Detail (64 msec to 2 sec)

  • Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000
  • Units for % of Total Waits: ms is milliseconds s is 1024 milliseconds (approximately 1 second)
  • % of Total Waits: total waits for all wait classes, including Idle
  • % of Total Waits: value of .0 indicates value was <.05%; value of null is truly 0
  • Ordered by Event (only non-idle events are displayed)
% of Total Waits
Event Waits 64ms to 2s <32ms <64ms <1/8s <1/4s <1/2s <1s <2s >=2s
ADR block file read 3 81.3 12.5 6.3
Disk file operations I/O 1 98.2 1.8
control file parallel write 15 98.7 .8 .3 .1
control file sequential read 139 94.0 4.7 .8 .3 .1 .0
db file parallel read 2 50.0 50.0
db file parallel write 185 91.4 5.7 1.8 .8 .1 .2
db file scattered read 63 93.5 5.0 1.2 .3
db file sequential read 142 94.2 4.5 1.2 .1 .0
direct path read 1 100.0
enq: CR – block range reuse ckpt 16 82.2 15.6 1.1 1.1
log file parallel write 82 92.5 4.6 2.6 .3 .1
log file switch completion 2 50.0 50.0
log file sync 1 90.0 10.0

Wait Event Histogram Detail (4 sec to 2 min)

No data exists for this section of the report.

Wait Event Histogram Detail (4 min to 1 hr)

No data exists for this section of the report.

Service Statistics

  • ordered by DB Time
Service Name DB Time (s) DB CPU (s) Physical Reads (K) Logical Reads (K)
SYS$USERS 0 0 0 2
SYS$BACKGROUND 0 0 8 436
smedley 0 0 0 0
smedleyXDB 0 0 0 0

Such histograms can show where the preponderance of waits lie making it easier to determine if they are problematic.

Next on the list is the SQL Statistics report. There are several sections, the one shown is ordered by elapsed time:

SQL ordered by Elapsed Time

  • Resources reported for PL/SQL code includes the resources used by all SQL statements called by the code.
  • % Total DB Time is the Elapsed Time of the SQL statement divided into the Total Database Time multiplied by 100
  • %Total – Elapsed Time as a percentage of Total DB time
  • %CPU – CPU Time as a percentage of Elapsed Time
  • %IO – User I/O Time as a percentage of Elapsed Time
  • Captured SQL account for 1.1E+04% of Total DB Time (s): 0
  • Captured PL/SQL account for 2.3E+03% of Total DB Time (s): 0
Elapsed Time (s) Executions Elapsed Time per Exec (s) %Total %CPU %IO SQL Id SQL Module SQL Text
5.88 1 5.88 2070.31 15.66 79.60 acc988uzvjmmt delete from WRH$_SYSMETRIC_HIS…
4.60 22 0.21 1618.80 27.16 64.39 70vs1d7ywk5m0 MMON_SLAVE begin dbms_stats.copy_table_st…
2.49 44 0.06 876.07 10.66 83.39 2mp99nzd9u1qp delete from histgrm$ where obj…
2.14 769 0.00 753.49 4.38 96.01 db78fxqxwxt7r select /*+ rule */ bucket, end…
1.57 212 0.01 551.64 21.91 74.49 43ruajmwv9fus insert into histgrm$(obj#, int…
1.42 1 1.42 501.54 7.67 86.00 0sfsrf0qzvsjv delete from WRH$_SYSMETRIC_SUM…
1.19 22 0.05 418.37 2.63 93.41 d89c1mh5pvbkz MMON_SLAVE SELECT /*+ rule */ U.NAME, OT….
0.83 1 0.83 291.84 13.18 77.73 2whm2vvjb98k7 delete from WRH$_SQL_PLAN tab …
0.80 22 0.04 281.90 25.34 67.82 8sd37ztuwkyyc MMON_SLAVE begin dbms_swrf_internal.massa…
0.66 1 0.66 231.35 2.38 98.81 28bm5y53nujvz delete from WRH$_RESOURCE_LIMI…
0.54 1 0.54 188.64 0.00 91.89 4y3yvmbav7uwb delete /*+ UNNEST(@SEL$2) UNNE…
0.53 22 0.02 185.34 0.00 96.67 63kf3an2j0pkc insert into sys.wri$_optstat_h…
0.51 1 0.51 179.21 9.20 75.31 96w2cbx7ntmtv delete from WRH$_BG_EVENT_SUMM…
0.50 1 0.50 175.27 6.27 94.54 31sdu97k3baq1 delete from WRH$_SHARED_POOL_A…
0.49 22 0.02 171.58 12.81 66.96 480utgnq0t6mf MMON_SLAVE begin dbms_stats.unlock_partit…
0.44 1 0.44 156.14 17.60 80.45 1h7zt6jks80pa delete from WRH$_ENQUEUE_STAT …
0.42 1 0.42 146.27 15.03 94.38 bzscyq07w79ab delete /*+ dynamic_sampling(4)…
0.38 2,361 0.00 133.20 24.75 80.65 96g93hntrzjtr select /*+ rule */ bucket_cnt,…
0.35 1 0.35 123.71 4.44 94.32 9v9n97qj8z1dg delete /*+ dynamic_sampling(4)…
0.35 22 0.02 122.04 27.02 66.45 2rn08n06qr261 update histgrm$ set obj# = dec…
0.34 1 0.34 121.06 0.00 96.73 6ajkhukk78nsr begin prvt_hdm.auto_execute( :…
0.31 1 0.31 107.67 0.00 97.51 48ffykkmk7sgc delete /*+ dynamic_sampling(4)…
0.28 1 0.28 99.65 0.00 79.94 350myuyx0t1d6 insert into wrh$_tablespace_st…
0.27 375 0.00 94.36 17.47 87.22 bvn085xhxfn9f insert into sys.wri$_optstat_h…
0.26 161 0.00 90.72 6.06 90.14 74anujtt8zw4h select o.owner#, o.name, o.nam…
0.25 1 0.25 86.79 6.33 95.67 47x8f8z2hx1a2 delete from WRH$_MUTEX_SLEEP t…
0.24 1 0.24 85.22 6.45 92.64 az8cj7dfnpp45 delete from WRH$_MEM_DYNAMIC_C…
0.23 1 0.23 81.60 6.73 41.10 3mqvkt9as1phq merge /*+ dynamic_sampling(mm…
0.22 1 0.22 76.17 0.00 87.96 du2bqfpfj9au2 delete from WRH$_LIBRARYCACHE …
0.21 1 0.21 73.67 14.92 93.36 b7ba0zt64wtc7 delete from WRH$_RSRC_CONSUMER…
0.20 44 0.00 71.59 7.68 71.51 as3uq6ggb3gx6 delete from hist_head$ where o…
0.20 89 0.00 71.52 30.73 92.39 3ktacv9r56b51 select owner#, name, namespace…
0.20 10 0.02 71.14 0.00 99.04 c6awqs517jpj0 select /*+ index(idl_char$ i_i…
0.20 1,098 0.00 70.92 23.24 58.47 3c1kubcdjnppq update sys.col_usage$ set equa…
0.19 22 0.01 68.12 56.47 8.79 9dy457uf4uxzd MMON_SLAVE SELECT /*+ all_rows */ ‘”‘||UI…
0.19 1 0.19 67.57 0.00 0.00 f318xdxdn0pdc insert into wrh$_log (snap_id,…
0.19 113 0.00 67.01 0.00 84.34 59vjj34vugaav delete from obj$ where obj# = …
0.18 10 0.02 64.60 0.00 96.35 cvn54b7yz0s8u select /*+ index(idl_ub1$ i_id…
0.18 375 0.00 63.34 43.38 53.87 95mpkn5xz9001 insert into hist_head$(obj#, i…
0.18 22 0.01 62.59 0.00 71.91 dr277b6yv83uy MMON_SLAVE SELECT COLUMN_NAME FROM DBA_PA…
0.18 671 0.00 62.17 17.68 0.00 5ms6rbzdnq16t select job, nvl2(last_date, 1,…
0.18 1 0.18 61.81 8.89 96.19 7f8bfqbjxdssr delete from WRH$_LOG tab where…
0.17 1 0.17 61.43 17.89 95.77 7g732rx16j8jc insert into WRH$_SERVICE_STAT …
0.16 48 0.00 56.37 48.74 48.27 4h8danbxj5nz6 begin dbms_utility.validate(:b…
0.16 1 0.16 54.88 80.11 0.00 dayq182sk41ks insert into wrh$_memory_target…
0.15 252 0.00 54.58 30.21 39.65 7ng34ruy5awxq select i.obj#, i.ts#, i.file#,…
0.15 1 0.15 54.51 80.65 0.00 bm2pwrpcr8ru6 select sga_size s, sga_size_fa…
0.15 1 0.15 51.42 106.87 0.00 bunssq950snhf insert into wrh$_sga_target_ad…
0.14 1 0.14 49.59 0.00 93.41 084z6qq3wwkdc delete from WRH$_SQL_WORKAREA_…
0.14 1 0.14 48.68 0.00 90.79 17g3cqghntwb6 delete from WRH$_PGA_TARGET_AD…
0.13 22 0.01 44.60 0.00 35.38 7kc5p5bw4uz55 update tabpart$ set dataobj# =…
0.12 1 0.12 41.02 26.79 91.44 83hpfc7nhjpvg delete from WRH$_IOSTAT_FILETY…
0.12 22 0.01 40.72 121.45 16.37 267s83vd49cfa MMON_SLAVE SELECT PARTITION_NAME FROM DBA…
0.11 66 0.00 40.29 13.64 87.09 9fg7fjbwmv7v8 insert into sys.wri$_optstat_t…
0.11 1 0.11 39.34 13.97 65.25 9xcfxn794wp5h select 1 from WRH$_LATCH_CHILD…
0.11 1 0.11 37.29 0.00 98.77 3hy3nssu9shxj select /*+ FIRST_ROWS(1) PARAL…
0.10 6 0.02 36.11 0.00 63.61 9wncfacx0nj9h insert into smon_scn_time (thr…
0.10 2 0.05 35.13 0.00 1.98 934ur8r7tqbjx SELECT DBID FROM V$DATABASE
0.10 1 0.10 33.93 0.00 74.29 1v44r7vam2wbt delete from WRH$_IOSTAT_FUNCTI…
0.10 1 0.10 33.68 0.00 96.85 cs78htw2dhb0x delete from WRH$_IC_DEVICE_STA…
0.09 66 0.00 33.17 0.00 92.58 gc7b0drtzbyc6 select max(intcol#) from hist_…
0.09 89 0.00 33.07 0.00 92.64 8swypbbr0m372 select order#, columns, types …
0.09 1 0.09 32.97 16.66 86.94 8h77nkt8bnukh delete from WRH$_IOSTAT_DETAIL…
0.09 1 0.09 31.42 17.49 83.52 7gfhbhxrxx8wf delete /*+ dynamic_sampling(4)…

This section can help determine which queries may be creating performance issues; using the sql_id and text you can retrive the plans to see if a rewrite may improve the situation. It can also show where statistics may not be current, although it would also be a good idea to run a 10053 trace against each suspect query to report what statistics the optimizer is using.

The ASH, or Active Session History, report digs deeper into important wait events and timings, providing a better picture of the wait activity during the requested window; a section of this report is shown below:

SQL ID Planhash Sampled # of Executions % Activity Event % Event Top Row Source % RwSrc SQL Text
572fbaj0fdw2b 1648993636 1 7.69 CPU + Wait for CPU 5.77 SELECT STATEMENT 5.77 select output from table(dbms_…
572fbaj0fdw2b 1648993636 1 7.69 db file sequential read 1.92 ** Row Source Not Available ** 1.92
6ajkhukk78nsr 1 3.85 CPU + Wait for CPU 3.85 ** Row Source Not Available ** 3.85 begin prvt_hdm.auto_execute( :…
1bx8mgs8by25x 2223429142 1 1.92 CPU + Wait for CPU 1.92 HASH JOIN 1.92 select coalesce( p1s.parameter…
2asgk01xtb6p0 3702291220 1 1.92 control file sequential read 1.92 FIXED TABLE – FULL 1.92 select d.dbid dbid , d.name db…
317v5hnvvd49h 1656119837 1 1.92 CPU + Wait for CPU 1.92 UPDATE 1.92 UPDATE wrh$_seg_stat_obj ob S…

Again, use this report to initially determine if the problem is database-related; if that is the case then this report should provide a wealth of data aimed at directing your database/query tuning efforts. If there is no indication of database problems then other areas are likely at fault. Waits to examine from both reports include SQL*Net more data to client, Disk file operations I/O and os thread startup. The first wait in that list can be an indicator of network problems, the second storage issues (array problems, pending disk failures) or CPU issues (excessive time for O/S thread startups can indicate over-taxed CPUs which may be caused by other, non-database processes).

After examining reports generated by AWR, ASH or Statspack you find the database is not the source of the performance issue you may be able to discern which area outside of the database is the likely suspect. SQL*Net more data to client waits, if the time per wait is relatively high, can indicate problems in the network layer. Utilities such as netstat can show configured network interfaces and network traffic data on those interfaces. Example output is shown below:

$ netstat -ginw
Kernel Interface table
Iface       MTU   Met      RX-OK RX-ERR RX-DRP RX-OVR      TX-OK TX-ERR TX-DRP TX-OVR Flg
bondeth0     1500   0 5084620269      0      0      0 3923352780      0      0      0 BMmRU
bondeth0:1   1500   0      - no statistics available -                                BMmRU
bondeth0:3   1500   0      - no statistics available -                                BMmRU
bondeth0:4   1500   0      - no statistics available -                                BMmRU
eth0         1500   0   42182099      0      0      0   21397203      0      0      0 BMRU
eth4         1500   0     305021      0      0      0     281488      0      0      0 BMsRU
eth5         1500   0 5084315248      0      0      0 3923071292      0      0      0 BMsRU
ib0         64000   0 2419856775      0      1      0 2178061942      3    271      0 BMsRU
ib1         64000   0   21305668      0      0      0          0      0      0      0 BMsRU
lo          16436   0 4495752311      0      0      0 4495752311      0      0      0 LRU
$

The utility reports transmit and receive statistics, including dropped packets and packet errors. Running netstat -ginw may show which local interface is having issues; send such output to the Network Administrator so he or she can interpret the results and take appropriate action.

What if the problem isn’t the database, and it isn’t the network? The next area that can adversely affect performance is the storage tier. Statspack and ASH/AWR reports also provide disk read and write wait information. Again, if the wait time for an individual read or write wait is high (for example, a single wait over 300 milliseconds) it may be the storage that’s at fault. As disks age their performance can deteriorate as the alignment the platter had when new is compromised by wear and tear around the spindle. Firmware can also be an issue, especially when disks are upgraded and the firmware is not. This is not an area for the DBA to assess, but it is one he or she can report to the Storage Administrator for further investigation.

Network-related issues may also be a problem with either the local network interface card (NIC) or the cabling. This is, again, a task for the Network Administrator to assess and address, but you may be able to ‘point’ him or her in the right direction with output from the ping utility. Provided you have two or more systems that can ‘see’ the same endpoint a ping comparison may show network problems local to the database server. For example if server A can ping server C with response times less than 20 milliseconds and server B, using the same subnet, reports ping times in the 100 millisecond or greater range this could indicate either cabling issues or a NIC that may be misconfigured or failing. This isn’t likely to be valid if the two servers are on different subnets which is why it’s necessary to get the Network Administrator involved as soon as you suspect a network issue. He or she can provide needed configuration data and can better analyze and diagnose such problems.

And yet another area where performance can suffer is at the operating system level. A system starved for CPU, memory or both because of a temporarily high load can severly slow down database processes along with many other non-database-related services. The System Administrator may need to be called in on the issue to report what processes were running during a particular period when response time was slow. Remember that the end users equate performance with response time; a large, resource-intensive report that runs once a month can seriously slow down other database activity during that same period. System backups can also consume resources normally available for user and database processes. Knowing the period of poor performance can greatly assist the System Administrator investigating the issue. If it’s a defined window, for example from 8 PM to Midnight on Tuesdays, the System Administrator can track down the process causing the slowdown by monitoring the system activity during that period. It may be a process that can’t be stopped or rescheduled, but at least the end users will know what is causing their normally fast transactions to run much slower than usual. As mentioned before when you know it’s not a database issue get the Administrators of the suspected systems involved, because the earlier you get more eyes on the problem the faster the cause can be found and a solution put in place.

The DBA is the ‘go-to’ person when the database is to blame for performance issues (long run times for specific queries, as an example). But, when the DBA finds nothing awry in the database, other parties may need to get involved. Again, just because it seems to be the database at fault doesn’t mean that the database IS at fault; it could be other, related areas affecting performance. Knowing what database reports can be of use is the first step in being able to assess a situation so other areas, such as network and storage, can be brought in if need be. Knowing who to involve is just as important as knowing that a performance issue exists. Communication is key in these situations; a network or storage administrator may ask for additional information regarding the problem; this is where the AWR and/or Statspack reports may be helpful. Learn how to generate these reports now (if you don’t already know) so that when you need them you can concentrate on the issues at hand rather than how to generate a report.

Performance is a term that can have a broad interpretation, depending upon who is doing the interpreting. End users may more often report performance issues when a report that ran fine yesterday now takes much longer to complete because all they may see is their process running slowly. Knowing when the problem is the database and when it isn’t is key to getting such issues diagnosed and resolved. Hopefully this article has given you some tools to help in that regard; use them wisely and communicate well so that you can be part of the solution by knowing who else to involve when it isn’t the database that is at fault.

Now, on with the show!

Blog at WordPress.com.