3. Configuration details

3.1. Hardware performance counters

Most processor models include performance monitor units that can be configured to monitor (count) various types of hardware events. This section is where you can find architecture-specific information to help you use these events for profiling. You do not really need to read this section unless you are interested in using events other than the default event chosen by OProfile.

Note

Your CPU type may not include the requisite support for hardware performance counters, in which case you must use OProfile in timer mode (see Section 3.2, “OProfile timer interrupt mode”), which is only available in OProfile releases prior to 1.0.

The Intel hardware performance counters are detailed in the Intel IA-32 Architecture Manual, Volume 3, available from http://developer.intel.com/. The AMD Athlon/Opteron/Phenom/Turion implementation is detailed in http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf. For IBM PowerPC processors, documentation is available at https://www.power.org/. For example, https://www.power.org/events/Power7 contains specific information on the performance monitor unit for the IBM POWER7.

A physical performance monitor counter (PMC) is configured by a profiling tool to count a particular type of event. When the counter overflows, an interrupt is delivered to the processor. This is the basic mechanism on which OProfile is based. The delivery mode is NMI, so blocking interrupts in the kernel does not prevent profiling. When the interrupt handler is called, the current PC (program counter) value and the current task are recorded into the profiling structure. This allows the overflow event to be attributed to a specific assembly instruction in a specific binary image. OProfile receives this data (commonly referred to as a "sample") from the kernel and writes it to the sample files.

If we use an event such as CPU_CLK_UNHALTED or INST_RETIRED (GLOBAL_POWER_EVENTS or INSTR_RETIRED, respectively, on the Pentium 4), we can use the overflow counts (samples) as an estimate of actual time spent in each part of code. Alternatively we can profile interesting data such as the cache behaviour of routines with the other available counters.

However there are several caveats. First, there are those issues listed in the Intel manual. There is a delay between the counter overflow and the interrupt delivery that can skew results on a small scale - this means you cannot rely on the profiles at the instruction level as being perfectly accurate. For example, if you are profiling an application with an event that counts L1 cache misses, a sample attributed to a particular instruction in the application doesn't necessarily mean that exact instruction is responsible for that event; instead, it means the sample was taken in the dynamic vicinity of that instruction, usually with a margin of error of a few instructions. Further details on this problem can be found in Chapter 5, Interpreting profiling results and also in the Digital paper "ProfileMe: A Hardware Performance Counter".

Each counter has several configuration parameters besides the type of event to count. First, there is the unit mask, which is used to further qualify exactly what to count. Second, there is the count field, discussed below. Third, there are parameters to specify whether to increment counts whilst in kernel or user space. You can configure these separately for each counter.

When the profiler is initially setup, a performance monitor counter is chosen for counting the event, and it is initialized using the count value. Once profiling begins, the counter increments with each event detected, and the counter overflows when the count value is reached. As described above, the counter overflow generates an interrupt, and the sample is recorded. After each overflow event, the counter is re-initialized using the count value, and counting begins anew for the next sample. Higher values for count result in samples being taken less frequently, and therefore less-detailed (and, potentially, less accurate) profiling. Lower values mean more detail, but higher overhead. Picking a good value for this parameter is, unfortunately, somewhat of a black art. It is of course dependent on the event you have chosen. Specifying too large a value will mean not enough interrupts are generated to give a realistic profile (though this problem can be ameliorated by profiling for longer time periods. Specifying too small a value can lead to higher performance overhead.

3.2. OProfile timer interrupt mode

Some CPU types do not provide the needed hardware support for hardware performance counters. Additionally, some older architectures are not supported by the perf_events kernel subsystem. On such machines, the operf and ocount commands will exit with a message indicating the processor type is not supported. However, you can install OProfile 0.9.9 and use the legacy opcontrol-based profiler, which will fall back to using timer interrupts for profiling. Note that in timer mode, OProfile is not able to profile code that has interrupts disabled.

Note

Timer mode is only available using the legacy opcontrol command, available in releases prior to 1.0.

3.3. Architecture-specific configuration notes

3.3.1. Pentium 4 support

The Pentium 4 / Xeon performance counters are organized around 3 types of model specific registers (MSRs): 45 event selection control registers (ESCRs), 18 counter configuration control registers (CCCRs) and 18 counters. ESCRs describe a particular set of events which are to be recorded, and CCCRs bind ESCRs to counters and configure their operation. Unfortunately the relationship between these registers is quite complex; they cannot all be used with one another at any time. There is, however, a subset of 8 counters, 8 ESCRs, and 8 CCCRs which can be used independently of one another, so OProfile only accesses those registers, treating them as a bank of 8 "normal" counters, similar to those in the P6 or Athlon/Opteron/Phenom/Turion families of CPU.

There is currently no support for Precision Event-Based Sampling (PEBS), nor any advanced uses of the Debug Store (DS). Current support is limited to the conservative extension of OProfile's existing interrupt-based model described above.

3.3.2. PowerPC64 support

The performance monitoring unit (PMU) for the IBM PowerPC 64-bit processors consists of between 4 and 8 counters (depending on the model). Advanced features such as instruction matching and thresholding are not supported by OProfile.