Timer and Profiler Units
Timers
MPINative
Flash-X includes an interface to a set of stopwatch-like timing routines
for monitoring performance. The interface is defined in the
monitors/Timers unit, and an implementation that uses the timing
functionality provided by MPI is provided in
monitors/Timers/TimersMain/MPINative. Future implementations might
use the PAPI framework to track hardware counter details.
The performance routines start or stop a timer at the beginning or end of a section of code to be monitored, and accumulate performance information in dynamically assigned accounting segments. The code also has an interface to write the timing summary to the Flash-X logfile. These routines are not recommended for timing very short segments of code due to the overhead in accounting.
There are two ways of using the Timers routines in your code. One
mode is to simply pass timer names as strings to the start and stop
routines. In this first way, a timer with the given name will be created
if it doesn’t exist, or otherwise reference the one already in
existence. The second mode of using the timers references them not by
name but by an integer key. This technique offers potentially faster
access if a timer is to be started and stopped many times (although
still not recommended because of the overhead). The integer key is
obtained by calling with a string name monitors/Timers/Timers_create
which will only create the timer if it doesn’t exist and will return the
integer key. This key can then be passed to the start and stop routines.
The typical usage pattern for the timers is implemented in the default
Driver implementation. This pattern is: call
monitors/Timers/Timers_init once at the beginning of a run, call
monitors/Timers/Timers_start and monitors/Timers/Timers_stop
around sections of code, and call monitors/Timers/Timers_getSummary
at the end of the run to report the timing summary at the end of the
logfile. However, it is possible to call
monitors/Timers/Timers_reset in the middle of a run to reset all
timing information. This could be done along with writing the summary
once per-timestep to report code times on a per-timestep basis, which
might be relevant, for instance, for certain non-fixed operation count
solvers. Since monitors/Timers/Timers_reset does not reset the
integer key mappings, it is safe to obtain a key through
monitors/Timers/Timers_create once in a saved variable, and continue
to use it after calling monitors/Timers/Timers_reset.
Two runtime parameters control the Timer unit and are described
below.
Parameter |
Type |
Default value |
Description |
|---|---|---|---|
Type |
Default value |
Description |
|
|
|
|
Should each process write its summary to its own file? If true, each process will write its summary to a file named tim er_summary_:m ath:<process id\(>\) |
|
|
|
Should timers write the max/min/avg values for timers to the logfile? |
monitors/Timers/TimersMain/MPINative writes two summaries to the
logfile: the first gives the timer execution of the master processor,
and the second gives the statistics of max, min, and avg times for
timers on all processors. The secondary max, min, and avg times will not
be written if some process executed timers differently than another. For
example, this anomaly happens if not all processors contain at least one
block. In this case, the Hydro timers only execute on the processors
that possess blocks. See for an example of this type of output. The max,
min, and avg summary can be disabled by setting the runtime parameter
Timers/writeStatSummary to false. In addition, each process can
write its summary to its own file named timer_summary_<process id>.
To prohibit each process from writing its summary to its own file, set
the runtime parameter Timers/eachProcWritesSummary to false.
Tau
In |flashx|3.1 we add an alternative Timers implementation which
is designed to be used with the Tau framework
(http://acts.nersc.gov/tau/). Here, we use Tau API calls to time the
|flashx| labeled code sections (marked by Timers_start and
Timers_stop). After running the simulation, the Tau profile
contains timing information for both |flashx| labeled code sections
and all individual subroutines / functions. This is useful because fine
grained subroutine / function level data can be overwhelming in a huge
code like |flashx|. Also, the callpaths are preserved, meaning we can
see how long is spent in individual subroutines / functions when they
are called from within a particular |flashx| labeled code section.
Another reason to use the Tau version is that the MPINative
version (See ) is implemented using recursion, and so incurs significant
overhead for fine grain measurements.
To use this implementation we must compile the |flashx| source code
with the Tau compiler wrapper scripts. These are set as the default
compilers automatically whenever we specify the -tau option (see )
to the setup script. In addition to the -tau option we must specify
–with-unit=monitors/Timers/TimersMain/Tau as this Timers
implementation is not the default.
Profiler
In addition to an interface for simple timers, Flash-X includes a
generic interface for third-party profiling or tracing libraries. This
interface is defined in the monitors/Profiler unit.
In Flash-X we created an interface to the IBM profiling libraries
libmpihpm.a and libmpihpm_smp.a and also to HPCToolkit
http://hpctoolkit.org/ (Rice University). We make use of this interface
to profile Flash-X evolution only, i.e. not initialization. To use this
style of profiling add -unit=monitors/Profiler/ProfilerMain/mpihpm
or -unit=monitors/Profiler/ProfilerMain/hpctoolkit to your setup
line and also set the Flash-X runtime parameter profileEvolutionOnly
= .true.
For the IBM profiling library (mpihpm) you need to add LIB_MPIHPM and LIB_MPIHPM_SMP macros to your Makefile.h to link Flash-X to the profiling libraries. The actual macro used in the link line depends on whether you setup Flash-X with multithreading support (LIB_MPIHPM for MPI-only Flash-X and LIB_MPIHPM_SMP for multithreaded Flash-X). Example values from sites/miralac1/Makefile.h follow
LIB_MPI = HPM_COUNTERS = /bgsys/drivers/ppcfloor/bgpm/lib/libbgpm.a LIB_MPIHPM = -L/soft/perftools/hpctw -lmpihpm \((HPM_COUNTERS)\)(LIB_MPI) LIB_MPIHPM_SMP = -L/soft/perftools/hpctw -lmpihpm_smp \((HPM_COUNTERS)\)(LIB_MPI)
For HPCToolkit you need to set the environmental variable HPCRUN_DELAY_SAMPLING=1 at job launch to enable selective profiling (see the HPCToolkit user guide).