Performance Event
Monitoring
The Performance Event Monitoring feature uses Solaris
DTrace or Linux SystemTap to "publish" interesting events as
they occur inside of Berkeley DB. The operating system
utilities dtrace or stap
run scripts which select, analyze, and display events. There
is no need to modify the application. Any application which
uses that Berkeley DB library can be monitored. For more
information about these instrumentation tools refer to the
following pages:
Performance Event Monitoring is available for operating
systems where DTrace or SystemTap supports static probe points
in user applications, such as Solaris 10 and OpenSolaris, some
versions of Linux, and Mac OS X 10.6 and later. By including
--enable-dtrace to the configuration options, the resulting
libraries will include probe points for these event
categories:
-
database operations: opening a database or
cursor, get, put, delete.
-
internal operations regarding disk allocation,
transactions, concurrency control, and caching.
-
the beginning and ending of waiting periods due
to conflicting transactional consistency locks (for
example, page locks), mutexes, or shared latches.
-
timing dependent code paths which are expected
to be infrequent. These could raise concerns if they
were to happen too often.
These probe points are implemented as
user-level statically defined traces (USDT's) for DTrace, and
static userspace markers for SystemTap.
To monitor the statistics values as they are updated
include the --enable-perfmon-statistics configuration option.
This option generates probe points for updates to many of the
counters whose values are displayed by the db_stat utility or
returned by the various statistics functions. The "cache"
example script uses a few of these probe points.
Performance Event Monitoring is intended to be suitable for
production applications. Running Berkeley DB with DTrace or
SystemTap support built in has little effect on execution
speed until probes are enabled at runtime by the
dtrace or stap
programs.
The list of available events may be displayed by running
'make listprobes' after building the libdb-18.1
shared library.
Using the DTrace Provider
The DTrace probe provider for Berkeley DB is named
'bdb'. A simple dtrace command to monitor all
dtrace-enabled Berkeley DB activity in the system is:
dtrace -Zn 'bdb*::: { printf("%s", probename); }'
DTrace requires elevated privileges in order to run. On
Solaris you can avoid running as root by giving any users
who need to run dtrace the dtrace_proc
or dtrace_user privilege in
/etc/user_attr.
DTrace works on both 32 and 64 bit applications.
However, when tracing a 32-bit application on a 64-bit
processor it might be necessary to pass a "32 bit" option
to dtrace. Without this option the D
language might use a pointer size of 8 bytes, which could
cause pointer values (and structures containing them) to
be processed incorrectly. Use the -32
option
on Solaris; on Mac OS X use -arch i386
.
SystemTap looks up its static userspace markers in the
library name specified in the stap
script. A simple stap command to list
the probes of the default Berkeley DB installation is:
stap -l 'process("/usr/local/BerkeleyDB.18.1/lib/libdb-18.1.so").mark("*")'
Berkeley DB supports SystemTap version 1.1 or later.
Building with userspace marker support requires
sys/sdt.h
, which is often
available in the package
systemtap-sdt-devel
. Running
stap with userspace markers
requires that the kernel have "utrace" support; see http://sourceware.org/systemtap/wiki/utrace
for more information.
SystemTap needs elevated privileges in order to run.
You can avoid running as root by adding the users who need
to run stap to the group stapdev.
Berkeley DB includes several example scripts, in both
DTrace and SystemTap versions. The DTrace examples end
with a .d
suffix and are located in
util/dtrace
. The SystemTap
examples have a .stp
suffix and can
found be in util/systemtap
. The
Berkeley DB shared library name, including any necessary
path, is expected as the first parameter to the SystemTap
examples.
-
apicalls
-
This script graphs the count of the
main API calls. The result is grouped by
thread of the target process.
-
apitimes
-
This script graphs the time spent in
the main API calls, grouped by thread.
-
apitrace
-
This script displays the entry to and
return from each of the main API calls.
-
cache
-
This script displays overall and
per-file buffer cache statistics every
N (default: 1)
seconds for M (default:
60) intervals. It prints the number of cache
hits, misses, and evictions for each file with
any activity during the interval.
-
dbdefs
-
This contains DTrace-compatible
declarations of Berkeley DB data structures
returned from probe points. There is no Linux
equivalent; SystemTap can obtain type
information directly from the debugging
symbols compiled into the
libdb*-18.1.so shared library.
-
locktimes
-
This script graphs the time spent
waiting for DB page locks. The result times in
nanoseconds are grouped by filename, pgno, and
lock_mode. The optional integer maxcount
parameter directs the script to exit once that
many page lock waits have been measured.
-
locktimesid
-
This is similar to the locktimes script
above, except that it displays the 20 byte
file identifier rather than the file name.
This can be useful when there are several
environments involved, or when database files
are recreated during the monitoring period.
-
mutex
-
This script measures mutex wait
periods, summarizing the results two ways.
-
The first grouping is by mutex,
mode (exclusive or shared), and thread
id.
-
The second grouping is by the
mutex category and mode (exclusive or
shared). The mutex categories are the
MTX_XXX definitions in
dbinc/mutex.h
.
-
showerror
-
This script displays the application
stack when the basic error routines are
called. It provides additional information
about an error, beyond the string sent to the
diagnostic output.
These examples are designed to trace a single process.
They run until interrupted, or the monitored process
exits, or the limit as given in an optional 'maximum
count' argument has been reached.
Performance Events Reference
The events are described below as if they were
functions with ANSI C-style signatures. The values each
event provides to DTrace or SystemTap are the arguments to
the functions.
-
alloc
-
The alloc class covers the allocation of "on disk"
database pages.
-
alloc-new (char *file, char *db, unsigned
pgno, unsigned type, struct _db_page *pg, int
ret);
-
An attempt to allocate a database page
of type 'type' for database 'db' returned
'ret'. If the allocation succeeded then ret is
0, pgno is the location of the new page, and
pg is the address of the new page. Details of
the page can be extracted from the pg pointer.
-
alloc-free (char *file, char *db, unsigned
pgno, unsigned ret);
-
An attempt to free the page 'pgno' of
'db' returned 'ret'. When successful the page
is returned to the free list or the file is
truncated.
-
alloc-btree_split (char *file, char *db,
unsigned pgno, unsigned parent, unsigned
level);
-
A btree split of pgno in db is being
attempted. The parent page number and the
level in the btree are also provided.
-
db
-
These DB API calls provide the name of the file and
database being accessed. In-memory databases will have
a NULL (0) file name address. The db name will be null
unless subdatabases are in use.
-
db-open (char *file, char *db, unsigned
flags, uint8_t *fileid);
-
The database or file name was opened.
The 20 byte unique fileid can be used to keep
track of databases as they are created and
destroyed.
-
db-close (char *file, char *db, unsigned
flags, uint8_t *fileid);
-
The database or file name was closed.
-
db-cursor (char *file, char *db, unsigned
txnid, unsigned flags, uint8_t *fileid);
-
An attempt is being made to open a
cursor on the database or file.
-
db-get (char *file, char *db, unsigned
txnid, DBT *key, DBT *data, unsigned flags);
-
An attempt is being made to get data
from a db.
-
db-put (char *file, char *db, unsigned
txnid, DBT *key, DBT *data, unsigned flags);
-
An attempt is being made to put data to
a db.
-
db-del (char *file, char *db, unsigned
txnid, DBT *key, unsigned flags);
-
An attempt is being made to delete data
from a db.
-
lock
-
The lock class monitors the transactional
consistency locks: page, record, and database. It also
monitors the non-transactional file handle locks.
-
lock-suspend (DBT *lock, db_lockmode_t
lock_mode);
-
The thread is about to suspend itself
because another locker already has a
conflicting lock on object 'lock'. The lock
DBT's data points to a __db_ilock structure,
except for the atypical program which uses
application specific locking.
-
lock-resume (DBT *lock, db_lockmode_t
lock_mode);
-
The thread is awakening from a suspend.
-
lock-put (struct __sh_dbt *lock, unsigned
flags);
-
The lock is being freed.
-
lock-put_reduce_count (struct __sh_dbt
*lock, unsigned flags);
-
The lock would have been freed except
that its refcount was greater than 1.
These lock counters are included by
--enable-perfmon-statistics.
-
lock-deadlock (unsigned st_ndeadlocks,
unsigned locker_id, struct __sh_dbt
*lock_obj);
-
The locker_id's lock request in
lock_obj is about to be aborted in order to
resolve a deadlock. The lock region's
st_ndeadlocks has been incremented.
-
lock-nowait_notgranted (unsigned count, DBT
*lock, unsigned locker_id);
-
A DB_LOCK_NOWAIT lock request by
locker_id would have had to wait. The lock
regions's st_lock_nowait has been incremented
and the request returns DB_LOCK_NOTGRANTED.
-
lock-steal (unsigned st_locksteals,
unsigned from, unsigned to);
-
A lock is being stolen from one
partition for another one. The 'from' lock
partition's st_locksteals has been
incremented.
-
lock-object_steal (unsigned
st_objectsteals, unsigned from, unsigned to);
-
A lock object is being stolen from one
partition for another one. The 'from' lock
partition's st_objectsteals has been
incremented.
-
lock-locktimeout (unsigned
st_nlocktimeouts, const DBT *lock);
-
A lock wait expired due to the lock
request timeout.
-
lock-txntimeout (unsigned st_ntxntimeouts,
const DBT *lock);
-
A lock wait expired due to the
transaction's timeout.
-
lock-nlockers (unsigned active, unsigned
locker_id);
- The allocation or deallocation of the
locker id changed the number of active locker
identifiers.
-
lock-maxnlockers (unsigned new_max_active,
unsigned locker_id);
-
The allocation of the locker id set a
new maximum number of active locker
identifiers.
-
mpool
-
The mpool class monitors the allocation and
management of memory, including the cache.
-
mpool-read (char *file, unsigned pgno,
struct __bh *buf);
-
Read a page from file into buf.
-
mpool-write (char *file, unsigned pgno,
struct __bh *buf);
-
Write a page from buf to file.
-
mpool-env_alloc (unsigned size, unsigned
region_id, unsigned reg_type);
-
This is an attempt to allocate size
bytes from region_id. The reg_type is one of
the reg_type_t enum values.
-
mpool-evict (char *file, unsigned pgno,
struct __bh *buf);
-
The page is about to be removed from
the cache.
-
mpool-alloc_wrap (unsigned alloc_len, int
region_id, int wrap_count, int put_counter);
-
The memory allocator has incremented
wrap_count after searching through the entire
region without being able to fulfill the
request for alloc_len bytes. As wrap_count
increases the library makes more effort to
allocate space.
These mpool counters are included by
--enable-perfmon-statistics.
-
mpool-clean_eviction (unsigned st_ro_evict,
unsigned region_id);
-
The eviction of a clean page from a
cache incremented st_ro_evict.
-
mpool-dirty_eviction (unsigned st_rw_evict,
unsigned region_id);
-
The eviction of a dirty page from a
cache incremented st_rw_evict. The page has
already been written out.
-
mpool-fail (unsigned failure_count,
unsigned alloc_len, unsigned region_id);
-
An attempt to allocate memory from
region_id failed.
-
mpool-hash_search (unsigned
st_hash_searches, char *file, unsigned pgno);
-
A search for pgno of file incremented
st_hash_searches.
-
mpool-hash_examined (unsigned
st_hash_examined, char *file, unsigned pgno);
-
A search for pgno of file increased
st_hash_examined by the number of hash buckets
examined.
-
mpool-hash_longest (unsigned
st_hash_longest, char *file, unsigned pgno);
-
A search for pgno of file set a new
maximum st_hash_longest value.
-
mpool-map (unsigned st_map, char *file,
unsigned pgno);
-
A file's st_map count was incremented
after a page was mapped into memory. The
mapping might have caused disk I/O.
-
mpool-hit (unsigned st_cache_hit, char
*file, unsigned pgno);
-
The hit count was incremented because
pgno from file was found in the cache.
-
mpool-miss (unsigned st_cache_miss, char
*file, unsigned pgno);
-
The miss count was incremented because
pgno from file was not already present in the
cache.
-
mpool-page_create (unsigned st_page_create,
char *file, unsigned pgno);
-
The st_page_create field was
incremented because the pgno of file was
created in the cache.
-
mpool-page_in (unsigned st_page_in, char
*file, unsigned pgno);
-
The st_page_in field was incremented
because the pgno from file was read into the
cache.
-
mpool-page_out (unsigned st_page_out, char
*file, unsigned pgno);
-
The st_page_out field was incremented
because the pgno from file was written out.
-
mutex
-
The mutex category monitors includes shared
latches. The alloc_id value is one of the MTX_XXX
definitions from dbinc/mutex.h
-
mutex-suspend (unsigned mutex, unsigned
excl, unsigned alloc_id, struct __db_mutex_t
*mutexp);
-
This thread is about to suspend itself
because a thread has the mutex or shared latch
locked in a mode which conflicts with the this
request.
-
mutex-resume (unsigned mutex, unsigned excl,
unsigned alloc_id, struct __db_mutex_t
*mutexp);
-
The thread is returning from a suspend
and will attempt to obtain the mutex or shared
latch again. It might need to suspend again.
These mutex counters are included by
--enable-perfmon-statistics.
-
mutex-set_nowait (unsigned
mutex_set_nowait, unsigned mutex);
-
Increment the count of times that the
mutex was free when trying to lock it.
-
mutex-set_wait (unsigned mutex_set_wait,
unsigned mutex);
-
Increment the count of times that the
mutex was busy when trying to lock it.
-
mutex-set_rd_nowait (unsigned
mutex_set_rd_nowait, unsigned mutex);
-
Increment the count of times that the
shared latch was free when trying to get a
shared lock on it.
-
mutex-set_rd_wait (unsigned
mutex_set_rd_wait, unsigned mutex);
-
Increment the count of times that the
shared latch was already exclusively latched
when trying to get a shared lock on it.
-
mutex-hybrid_wait (unsigned hybrid_wait,
unsigned mutex);
-
Increment the count of times that a
hybrid mutex had to block on its condition
variable. n a busy system this might happen
several times before the corresponding
hybrid_wakeup.
-
mutex-hybrid_wakeup (unsigned
hybrid_wakeup, unsigned mutex);
-
Increment the count of times that a
hybrid mutex finished one or more waits for
its condition variable.
-
txn
-
The txn category covers the basic transaction
operations.
-
txn-begin (unsigned txnid, unsigned flags);
-
A transaction was successfully begun.
-
txn-commit (unsigned txnid, unsigned
flags);
-
A transaction is starting to commit.
-
txn-prepare (unsigned txnid, uint8_t *gid);
-
The transaction is starting to prepare,
flushing the log so that a future commit can
be guaranteed to succeed. The global
identifier field is 128 bytes long.
-
txn-abort (unsigned txnid);
-
The transaction is about to abort.
These txn counters are included by
--enable-perfmon-statistics.
-
txn-nbegins (unsigned st_nbegins, unsigned
txnid);
-
Beginning the transaction incremented
st_nbegins.
-
txn-naborts (unsigned st_nbegins, unsigned
txnid);
-
Aborting the transaction incremented
st_naborts.
-
txn-ncommits (unsigned st_ncommits,
unsigned txnid);
-
Committing the transaction incremented
st_ncommits.
-
txn-nactive (unsigned st_nactive, unsigned
txnid);
-
Beginning or ending the transaction
updated the number of active transactions.
-
txn-maxnactive (unsigned st_maxnactive,
unsigned txnid);
-
The creation of the transaction set a
new maximum number of active transactions.