Timer and Profiler Units
Timers
MPINative
Flash-X includes an interface to a set of stopwatch-like timing routines
for monitoring performance. The interface is defined in the
monitors/Timers
unit, and an implementation that uses the timing
functionality provided by MPI is provided in
monitors/Timers/TimersMain/MPINative
. Future implementations might
use the PAPI framework to track hardware counter details.
The performance routines start or stop a timer at the beginning or end of a section of code to be monitored, and accumulate performance information in dynamically assigned accounting segments. The code also has an interface to write the timing summary to the Flash-X logfile. These routines are not recommended for timing very short segments of code due to the overhead in accounting.
There are two ways of using the Timers
routines in your code. One
mode is to simply pass timer names as strings to the start and stop
routines. In this first way, a timer with the given name will be created
if it doesn’t exist, or otherwise reference the one already in
existence. The second mode of using the timers references them not by
name but by an integer key. This technique offers potentially faster
access if a timer is to be started and stopped many times (although
still not recommended because of the overhead). The integer key is
obtained by calling with a string name monitors/Timers/Timers_create
which will only create the timer if it doesn’t exist and will return the
integer key. This key can then be passed to the start and stop routines.
The typical usage pattern for the timers is implemented in the default
Driver
implementation. This pattern is: call
monitors/Timers/Timers_init
once at the beginning of a run, call
monitors/Timers/Timers_start
and monitors/Timers/Timers_stop
around sections of code, and call monitors/Timers/Timers_getSummary
at the end of the run to report the timing summary at the end of the
logfile. However, it is possible to call
monitors/Timers/Timers_reset
in the middle of a run to reset all
timing information. This could be done along with writing the summary
once per-timestep to report code times on a per-timestep basis, which
might be relevant, for instance, for certain non-fixed operation count
solvers. Since monitors/Timers/Timers_reset
does not reset the
integer key mappings, it is safe to obtain a key through
monitors/Timers/Timers_create
once in a saved variable, and continue
to use it after calling monitors/Timers/Timers_reset
.
Two runtime parameters control the Timer
unit and are described
below.
Parameter |
Type |
Default value |
Description |
---|---|---|---|
Type |
Default value |
Description |
|
|
|
|
Should each
process write
its summary to
its own file? If
true, each
process will
write its
summary to a
file named
tim
er_summary_:m
ath:<process
id |
|
|
|
Should timers write the max/min/avg values for timers to the logfile? |
monitors/Timers/TimersMain/MPINative
writes two summaries to the
logfile: the first gives the timer execution of the master processor,
and the second gives the statistics of max, min, and avg times for
timers on all processors. The secondary max, min, and avg times will not
be written if some process executed timers differently than another. For
example, this anomaly happens if not all processors contain at least one
block. In this case, the Hydro
timers only execute on the processors
that possess blocks. See for an example of this type of output. The max,
min, and avg summary can be disabled by setting the runtime parameter
Timers/writeStatSummary
to false. In addition, each process can
write its summary to its own file named timer_summary_<process id>
.
To prohibit each process from writing its summary to its own file, set
the runtime parameter Timers/eachProcWritesSummary
to false.
Tau
In |flashx|3.1
we add an alternative Timers
implementation which
is designed to be used with the Tau
framework
(http://acts.nersc.gov/tau/). Here, we use Tau
API calls to time the
|flashx|
labeled code sections (marked by Timers_start
and
Timers_stop
). After running the simulation, the Tau
profile
contains timing information for both |flashx|
labeled code sections
and all individual subroutines / functions. This is useful because fine
grained subroutine / function level data can be overwhelming in a huge
code like |flashx|
. Also, the callpaths are preserved, meaning we can
see how long is spent in individual subroutines / functions when they
are called from within a particular |flashx|
labeled code section.
Another reason to use the Tau
version is that the MPINative
version (See ) is implemented using recursion, and so incurs significant
overhead for fine grain measurements.
To use this implementation we must compile the |flashx|
source code
with the Tau
compiler wrapper scripts. These are set as the default
compilers automatically whenever we specify the -tau
option (see )
to the setup script. In addition to the -tau
option we must specify
–with-unit=monitors/Timers/TimersMain/Tau
as this Timers
implementation is not the default.
Profiler
In addition to an interface for simple timers, Flash-X includes a
generic interface for third-party profiling or tracing libraries. This
interface is defined in the monitors/Profiler
unit.
In Flash-X we created an interface to the IBM profiling libraries
libmpihpm.a and libmpihpm_smp.a and also to HPCToolkit
http://hpctoolkit.org/ (Rice University). We make use of this interface
to profile Flash-X evolution only, i.e. not initialization. To use this
style of profiling add -unit=monitors/Profiler/ProfilerMain/mpihpm
or -unit=monitors/Profiler/ProfilerMain/hpctoolkit
to your setup
line and also set the Flash-X runtime parameter profileEvolutionOnly
= .true.
For the IBM profiling library (mpihpm) you need to add LIB_MPIHPM and LIB_MPIHPM_SMP macros to your Makefile.h to link Flash-X to the profiling libraries. The actual macro used in the link line depends on whether you setup Flash-X with multithreading support (LIB_MPIHPM for MPI-only Flash-X and LIB_MPIHPM_SMP for multithreaded Flash-X). Example values from sites/miralac1/Makefile.h follow
LIB_MPI = HPM_COUNTERS = /bgsys/drivers/ppcfloor/bgpm/lib/libbgpm.a
LIB_MPIHPM = -L/soft/perftools/hpctw -lmpihpm
For HPCToolkit you need to set the environmental variable HPCRUN_DELAY_SAMPLING=1 at job launch to enable selective profiling (see the HPCToolkit user guide).