5. Configuration details

5.1. Hardware performance counters

Most processor models include performance monitor units that can be configured to monitor (count) various types of hardware events. This section is where you can find architecture-specific information to help you use these events for profiling. You do not really need to read this section unless you are interested in using events other than the default event chosen by OProfile.

Note

Your CPU type may not include the requisite support for hardware performance counters, in which case you must use OProfile in timer mode (see Section 5.2, “OProfile in timer interrupt mode”).

The Intel hardware performance counters are detailed in the Intel IA-32 Architecture Manual, Volume 3, available from http://developer.intel.com/. The AMD Athlon/Opteron/Phenom/Turion implementation is detailed in http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf. For IBM PowerPC processors, documentation is available at https://www.power.org/. For example, https://www.power.org/events/Power7 contains specific information on the performance monitor unit for the IBM POWER7.

These processors are capable of delivering an interrupt when a counter overflows. This is the basic mechanism on which OProfile is based. The delivery mode is NMI, so blocking interrupts in the kernel does not prevent profiling. When the interrupt handler is called, the current PC value and the current task are recorded into the profiling structure. This allows the overflow event to be attached to a specific assembly instruction in a binary image. OProfile receives this data from the kernel and writes it to the sample files.

If we use an event such as CPU_CLK_UNHALTED or INST_RETIRED (GLOBAL_POWER_EVENTS or INSTR_RETIRED, respectively, on the Pentium 4), we can use the overflow counts as an estimate of actual time spent in each part of code. Alternatively we can profile interesting data such as the cache behaviour of routines with the other available counters.

However there are several caveats. First, there are those issues listed in the Intel manual. There is a delay between the counter overflow and the interrupt delivery that can skew results on a small scale - this means you cannot rely on the profiles at the instruction level as being perfectly accurate. If you are using an "event-mode" counter such as the cache counters, a count registered against it doesn't mean that it is responsible for that event. However, it implies that the counter overflowed in the dynamic vicinity of that instruction, to within a few instructions. Further details on this problem can be found in Chapter 5, Interpreting profiling results and also in the Digital paper "ProfileMe: A Hardware Performance Counter".

Each counter has several configuration parameters. First, there is the unit mask: this simply further specifies what to count. Second, there is the counter value, discussed below. Third, there is a parameter whether to increment counts whilst in kernel or user space. You can configure these separately for each counter.

After each overflow event, the counter will be re-initialized such that another overflow will occur after this many events have been counted. Thus, higher values mean less-detailed profiling, and lower values mean more detail, but higher overhead. Picking a good value for this parameter is, unfortunately, somewhat of a black art. It is of course dependent on the event you have chosen. Specifying too large a value will mean not enough interrupts are generated to give a realistic profile (though this problem can be ameliorated by profiling for longer). Specifying too small a value can lead to higher performance overhead.

5.2. OProfile in timer interrupt mode

Some CPU types do not provide the needed hardware support to use the hardware performance counters. This includes some laptops, classic Pentiums, and other CPU types not yet supported by OProfile (such as Cyrix). On these machines, OProfile falls back to using the timer interrupt for profiling, back to using the real-time clock interrupt to collect samples. In timer mode, OProfile is not able to profile code that has interrupts disabled.

You can force use of the timer interrupt by using the timer=1 module parameter (or oprofile.timer=1 on the boot command line if OProfile is built-in). If OProfile was built as a kernel module, then you must pass the 'timer=1' parameter with the modprobe command. Do this before executing 'opcontrol --init' or edit the opcontrol command's invocation of modprobe to pass the 'timer=1' parameter.

Note

Timer mode is only available using the legacy opcontrol command.

5.3. Pentium 4 support

The Pentium 4 / Xeon performance counters are organized around 3 types of model specific registers (MSRs): 45 event selection control registers (ESCRs), 18 counter configuration control registers (CCCRs) and 18 counters. ESCRs describe a particular set of events which are to be recorded, and CCCRs bind ESCRs to counters and configure their operation. Unfortunately the relationship between these registers is quite complex; they cannot all be used with one another at any time. There is, however, a subset of 8 counters, 8 ESCRs, and 8 CCCRs which can be used independently of one another, so OProfile only accesses those registers, treating them as a bank of 8 "normal" counters, similar to those in the P6 or Athlon/Opteron/Phenom/Turion families of CPU.

There is currently no support for Precision Event-Based Sampling (PEBS), nor any advanced uses of the Debug Store (DS). Current support is limited to the conservative extension of OProfile's existing interrupt-based model described above.

5.4. Intel Itanium 2 support

The Itanium 2 performance monitoring unit (PMU) organizes the counters as four pairs of performance event monitoring registers. Each pair is composed of a Performance Monitoring Configuration (PMC) register and Performance Monitoring Data (PMD) register. The PMC selects the performance event being monitored and the PMD determines the sampling interval. The IA64 Performance Monitoring Unit (PMU) triggers sampling with maskable interrupts. Thus, samples will not occur in sections of the IA64 kernel where interrupts are disabled.

None of the advance features of the Itanium 2 performance monitoring unit such as opcode matching, address range matching, or precise event sampling are supported by this version of OProfile. The Itanium 2 support only maps OProfile's existing interrupt-based model to the PMU hardware.

5.5. PowerPC64 support

The performance monitoring unit (PMU) for the IBM PowerPC 64-bit processors consists of between 4 and 8 counters (depending on the model), plus three special purpose registers used for programming the counters -- MMCR0, MMCR1, and MMCRA. Advanced features such as instruction matching and thresholding are not supported by this version of OProfile.

Note

Later versions of the IBM POWER5+ processor (beginning with revision 3.0) run the performance monitor unit in POWER6 mode, effectively removing OProfile's access to counters 5 and 6. These two counters are dedicated to counting instructions completed and cycles, respectively. In POWER6 mode, however, the counters do not generate an interrupt on overflow and so are unusable by OProfile. Kernel versions 2.6.23 and higher will recognize this mode and export "ppc64/power5++" as the cpu_type to the oprofilefs pseudo filesystem. OProfile userspace responds to this cpu_type by removing these counters from the list of potential events to count. Without this kernel support, attempts to profile using an event from one of these counters will yield incorrect results -- typically, zero (or near zero) samples in the generated report.

5.6. Cell Broadband Engine support

The Cell Broadband Engine (CBE) processor core consists of a PowerPC Processing Element (PPE) and 8 Synergistic Processing Elements (SPE). PPEs and SPEs each consist of a processing unit (PPU and SPU, respectively) and other hardware components, such as memory controllers.

A PPU has two hardware threads (aka "virtual CPUs"). The performance monitor unit of the CBE collects event information on one hardware thread at a time. Therefore, when profiling PPE events, OProfile collects the profile based on the selected events by time slicing the performance counter hardware between the two threads. The user must ensure the collection interval is long enough so that the time spent collecting data for each PPU is sufficient to obtain a good profile.

To profile an SPU application, the user should specify the SPU_CYCLES event. When starting OProfile with SPU_CYCLES, the opcontrol script enforces certain separation parameters (separate=cpu,lib) to ensure that sufficient information is collected in the sample data in order to generate a complete report. The --merge=cpu option can be used to obtain a more readable report if analyzing the performance of each separate SPU is not necessary.

Profiling with an SPU event (events 4100 through 4163) is not compatible with any other event. Further more, only one SPU event can be specified at a time. The hardware only supports profiling on one SPU per node at a time. The OProfile kernel code time slices between the eight SPUs to collect data on all SPUs.

SPU profile reports have some unique characteristics compared to reports for standard architectures:

  • Typically no "app name" column. This is really standard OProfile behavior when the report contains samples for just a single application, which is commonly the case when profiling SPUs.
  • "CPU" equates to "SPU"
  • Specifying '--long-filenames' on the opreport command does not always result in long filenames. This happens when the SPU application code is embedded in the PPE executable or shared library. The embedded SPU ELF data contains only the short filename (i.e., no path information) for the SPU binary file that was used as the source for embedding. The reason that just the short filename is used is because the original SPU binary file may not exist or be accessible at runtime. The performance analyst must have sufficient knowledge of the application to be able to correlate the SPU binary image names found in the report to the application's source files.

    Note

    Compile the application with -g and generate the OProfile report with -g to facilitate finding the right source file(s) on which to focus.

5.7. AMD64 (x86_64) Instruction-Based Sampling (IBS) support

Instruction-Based Sampling (IBS) is a new performance measurement technique available on AMD Family 10h processors. Traditional performance counter sampling is not precise enough to isolate performance issues to individual instructions. IBS, however, precisely identifies instructions which are not making the best use of the processor pipeline and memory hierarchy. For more information, please refer to the "Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors" ( http://developer.amd.com/assets/AMD_IBS_paper_EN.pdf). There are two types of IBS profile types, described in the following sections.

Note

Profiling on IBS events is only supported with legacy mode profiling (i.e., with opcontrol).

5.7.1. IBS Fetch

IBS fetch sampling is a statistical sampling method which counts completed fetch operations. When the number of completed fetch operations reaches the maximum fetch count (the sampling period), IBS tags the fetch operation and monitors that operation until it either completes or aborts. When a tagged fetch completes or aborts, a sampling interrupt is generated and an IBS fetch sample is taken. An IBS fetch sample contains a timestamp, the identifier of the interrupted process, the virtual fetch address, and several event flags and values that describe what happened during the fetch operation.

5.7.2. IBS Op

IBS op sampling selects, tags, and monitors macro-ops as issued from AMD64 instructions. Two options are available for selecting ops for sampling:

  • Cycles-based selection counts CPU clock cycles. The op is tagged and monitored when the count reaches a threshold (the sampling period) and a valid op is available.
  • Dispatched op-based selection counts dispatched macro-ops. When the count reaches a threshold, the next valid op is tagged and monitored.

In both cases, an IBS sample is generated only if the tagged op retires. Thus, IBS op event information does not measure speculative execution activity. The execution stages of the pipeline monitor the tagged macro-op. When the tagged macro-op retires, a sampling interrupt is generated and an IBS op sample is taken. An IBS op sample contains a timestamp, the identifier of the interrupted process, the virtual address of the AMD64 instruction from which the op was issued, and several event flags and values that describe what happened when the macro-op executed.

Enabling IBS profiling is done simply by specifying IBS performance events through the "--event=" options. These events are listed in the opcontrol --list-events.

opcontrol --event=IBS_FETCH_XXX:<count>:<um>:<kernel>:<user>
opcontrol --event=IBS_OP_XXX:<count>:<um>:<kernel>:<user>

Note: * All IBS fetch event must have the same event count and unitmask,
        as do those for IBS op.

5.8. IBM System z hardware sampling support

IBM System z provides a facility which does instruction sampling as part of the CPU. This has great advantages over the timer based sampling approach like better sampling resolution with less overhead and the possibility to get samples within code sections where interrupts are disabled (useful especially for Linux kernel code).

Note

Profiling with the instruction sampling facility is currently only supported with legacy mode profiling (i.e., with opcontrol).

A public description of the System z CPU-Measurement Facilities can be found here: The Load-Program-Parameter and CPU-Measurement Facilities

System z hardware sampling can be used for Linux instances in LPAR mode. The hardware sampling support used by OProfile was introduced for System z10 in October 2008.

To enable hardware sampling for an LPAR you must activate the LPAR with authorization for basic sampling control. See the "Support Element Operations Guide" for your mainframe system for more information.

The hardware sampling facility can be enabled and disabled using the event interface. A `virtual' counter 0 has been defined that only supports a single event, HWSAMPLING. By default the HWSAMPLING event is enabled on machines providing the facility. For both events only the `count', `kernel' and `user' options are evaluated by the kernel module.

The `count' value is the sampling rate as it is passed to the CPU measurement facility. A sample will be taken by the hardware every `count' cycles. Using low values here will quickly fill up the sampling buffers and will generate CPU load on the OProfile daemon and the kernel module being busy flushing the hardware buffers. This might considerably impact the workload to be profiled.

The unit mask `um' is required to be zero.

The opcontrol tool provides a new option specific to System z hardware sampling:

  • --s390hwsampbufsize="num": Number of 2MB areas used per CPU for storing sample data. The best size for the sample memory depends on the particular system and the workload to be measured. Providing the sampler with too little memory results in lost samples. Reserving too much system memory for the sampler impacts the overall performance and, hence, also the workload to be measured.

A special counter /dev/oprofile/timer is provided by the kernel module allowing to switch back to timer mode sampling dynamically. The TIMER event is limited to be used only with this counter. The TIMER event can be specified using the --event= as with every other event.

opcontrol --event=TIMER:1

On z10 or later machines the default event is set to TIMER in case the hardware sampling facility is not available.

Although required, the 'count' parameter of the TIMER event is ignored. The value may eventually be used for timer based sampling with a configurable sampling frequency, but this is currently not supported.

5.9. Dangerous counter settings

OProfile is a low-level profiler which allows continuous profiling with a low-overhead cost. When using OProfile legacy mode profiling, it may be possible to configure such a low a counter reset value (i.e., high sampling rate) that the system can become overloaded with counter interrupts and your system's responsiveness may be severely impacted. Whilst some validation is done on the count values you pass to opcontrol with your event specification, it is not foolproof.

Note

This can happen as follows: When the profiler count reaches zero, an NMI handler is called which stores the sample values in an internal buffer, then resets the counter to its original value. If the reset count you specified is very low, a pending NMI can be sent before the NMI handler has completed. Due to the priority of the NMI, the pending interrupt is delivered immediately after completion of the previous interrupt handler, and control never returns to other parts of the system. If all processors are stuck in this mode, the system will appear to be frozen.

If this happens, it will be impossible to bring the system back to a workable state. There is no way to provide real security against this happening, other than making sure to use a reasonable value for the counter reset. For example, setting CPU_CLK_UNHALTED event type with a ridiculously low reset count (e.g. 500) is likely to freeze the system.

In short : Don't try a foolish sample count value. Unfortunately the definition of a foolish value is really dependent on the event type. If ever in doubt, post a message to

Note

The scenario described above cannot occur if you use operf for profiling instead of opcontrol, because the perf_events kernel subsystem automatically detects when performance monitor interrupts are arriving at a dangerous level and will throttle back the sampling rate.