Your CPU type may not include the requisite support for hardware performance counters, in which case you must use OProfile in RTC mode in 2.4 (see Section 4.2, “OProfile in RTC mode”), or timer mode in 2.6 (see Section 4.3, “OProfile in timer interrupt mode”). You do not really need to read this section unless you are interested in using events other than the default event chosen by OProfile.
The Intel hardware performance counters are detailed in the Intel IA-32 Architecture Manual, Volume 3, available from http://developer.intel.com/. The AMD Athlon/Opteron/Phenom/Turion implementation is detailed in http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf. For PowerPC64 processors in IBM iSeries, pSeries, and blade server systems, processor documentation is available at http://www-01.ibm.com/chips/techlib/techlib.nsf/productfamilies/PowerPC. (For example, the specific publication containing information on the performance monitor unit for the PowerPC970 is "IBM PowerPC 970FX RISC Microprocessor User's Manual.") These processors are capable of delivering an interrupt when a counter overflows. This is the basic mechanism on which OProfile is based. The delivery mode is NMI, so blocking interrupts in the kernel does not prevent profiling. When the interrupt handler is called, the current PC value and the current task are recorded into the profiling structure. This allows the overflow event to be attached to a specific assembly instruction in a binary image. The daemon receives this data from the kernel, and writes it to the sample files.
If we use an event such as CPU_CLK_UNHALTED or INST_RETIRED
(GLOBAL_POWER_EVENTS or INSTR_RETIRED, respectively, on the Pentium 4), we can
use the overflow counts as an estimate of actual time spent in each part of code. Alternatively we can profile interesting
data such as the cache behaviour of routines with the other available counters.
However there are several caveats. First, there are those issues listed in the Intel manual. There is a delay between the counter overflow and the interrupt delivery that can skew results on a small scale - this means you cannot rely on the profiles at the instruction level as being perfectly accurate. If you are using an "event-mode" counter such as the cache counters, a count registered against it doesn't mean that it is responsible for that event. However, it implies that the counter overflowed in the dynamic vicinity of that instruction, to within a few instructions. Further details on this problem can be found in Chapter 5, Interpreting profiling results and also in the Digital paper "ProfileMe: A Hardware Performance Counter".
Each counter has several configuration parameters. First, there is the unit mask: this simply further specifies what to count. Second, there is the counter value, discussed below. Third, there is a parameter whether to increment counts whilst in kernel or user space. You can configure these separately for each counter.
After each overflow event, the counter will be re-initialized such that another overflow will occur after this many events have been counted. Thus, higher values mean less-detailed profiling, and lower values mean more detail, but higher overhead. Picking a good value for this parameter is, unfortunately, somewhat of a black art. It is of course dependent on the event you have chosen. Specifying too large a value will mean not enough interrupts are generated to give a realistic profile (though this problem can be ameliorated by profiling for longer). Specifying too small a value can lead to higher performance overhead.
This section applies to 2.2/2.4 kernels only.
Some CPU types do not provide the needed hardware support to use the hardware performance counters. This includes some laptops, classic Pentiums, and other CPU types not yet supported by OProfile (such as Cyrix). On these machines, OProfile falls back to using the real-time clock interrupt to collect samples. This interrupt is also used by the rtc module: you cannot have both the OProfile and rtc modules loaded nor the rtc support compiled in the kernel.
RTC mode is less capable than the hardware counters mode; in particular, it is unable to profile sections of the kernel where interrupts are disabled. There is just one available event, "RTC interrupts", and its value corresponds to the number of interrupts generated per second (that is, a higher number means a better profiling resolution, and higher overhead). The current implementation of the real-time clock supports only power-of-two sampling rates from 2 to 4096 per second. Other values within this range are rounded to the nearest power of two.
You can force use of the RTC interrupt with the force_rtc=1 module parameter.
Setting the value from the GUI should be straightforward. On the command line, you need to specify the event to opcontrol, e.g. :
opcontrol --event=RTC_INTERRUPTS:256
This section applies to 2.6 kernels and above only.
In 2.6 kernels on CPUs without OProfile support for the hardware performance counters, the driver falls back to using the timer interrupt for profiling. Like the RTC mode in 2.4 kernels, this is not able to profile code that has interrupts disabled. Note that there are no configuration parameters for setting this, unlike the RTC and hardware performance counter setup.
You can force use of the timer interrupt by using the timer=1 module
parameter (or oprofile.timer=1 on the boot command line if OProfile is
built-in).
The Pentium 4 / Xeon performance counters are organized around 3 types of model specific registers (MSRs): 45 event selection control registers (ESCRs), 18 counter configuration control registers (CCCRs) and 18 counters. ESCRs describe a particular set of events which are to be recorded, and CCCRs bind ESCRs to counters and configure their operation. Unfortunately the relationship between these registers is quite complex; they cannot all be used with one another at any time. There is, however, a subset of 8 counters, 8 ESCRs, and 8 CCCRs which can be used independently of one another, so OProfile only accesses those registers, treating them as a bank of 8 "normal" counters, similar to those in the P6 or Athlon/Opteron/Phenom/Turion families of CPU.
There is currently no support for Precision Event-Based Sampling (PEBS), nor any advanced uses of the Debug Store (DS). Current support is limited to the conservative extension of OProfile's existing interrupt-based model described above. Performance monitoring hardware on Pentium 4 / Xeon processors with Hyperthreading enabled (multiple logical processors on a single die) is not supported in 2.4 kernels (you can use OProfile if you disable hyper-threading, though).
The Itanium 2 performance monitoring unit (PMU) organizes the counters as four pairs of performance event monitoring registers. Each pair is composed of a Performance Monitoring Configuration (PMC) register and Performance Monitoring Data (PMD) register. The PMC selects the performance event being monitored and the PMD determines the sampling interval. The IA64 Performance Monitoring Unit (PMU) triggers sampling with maskable interrupts. Thus, samples will not occur in sections of the IA64 kernel where interrupts are disabled.
None of the advance features of the Itanium 2 performance monitoring unit such as opcode matching, address range matching, or precise event sampling are supported by this version of OProfile. The Itanium 2 support only maps OProfile's existing interrupt-based model to the PMU hardware.
The performance monitoring unit (PMU) for the IBM PowerPC 64-bit processors consists of between 4 and 8 counters (depending on the model), plus three special purpose registers used for programming the counters -- MMCR0, MMCR1, and MMCRA. Advanced features such as instruction matching and thresholding are not supported by this version of OProfile.
The Cell Broadband Engine (CBE) processor core consists of a PowerPC Processing Element (PPE) and 8 Synergistic Processing Elements (SPE). PPEs and SPEs each consist of a processing unit (PPU and SPU, respectively) and other hardware components, such as memory controllers.
A PPU has two hardware threads (aka "virtual CPUs"). The performance monitor unit of the CBE collects event information on one hardware thread at a time. Therefore, when profiling PPE events, OProfile collects the profile based on the selected events by time slicing the performance counter hardware between the two threads. The user must ensure the collection interval is long enough so that the time spent collecting data for each PPU is sufficient to obtain a good profile.
To profile an SPU application, the user should specify the SPU_CYCLES event. When starting OProfile with SPU_CYCLES, the opcontrol script enforces certain separation parameters (separate=cpu,lib) to ensure that sufficient information is collected in the sample data in order to generate a complete report. The --merge=cpu option can be used to obtain a more readable report if analyzing the performance of each separate SPU is not necessary.
Profiling with an SPU event (events 4100 through 4163) is not compatible with any other event. Further more, only one SPU event can be specified at a time. The hardware only supports profiling on one SPU per node at a time. The OProfile kernel code time slices between the eight SPUs to collect data on all SPUs.
SPU profile reports have some unique characteristics compared to reports for standard architectures:
Instruction-Based Sampling (IBS) is a new performance measurement technique available on AMD Family 10h processors. Traditional performance counter sampling is not precise enough to isolate performance issues to individual instructions. IBS, however, precisely identifies instructions which are not making the best use of the processor pipeline and memory hierarchy. For more information, please refer to the "Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors" ( http://developer.amd.com/assets/AMD_IBS_paper_EN.pdf). There are two types of IBS profile types, described in the following sections.
IBS fetch sampling is a statistical sampling method which counts completed fetch operations. When the number of completed fetch operations reaches the maximum fetch count (the sampling period), IBS tags the fetch operation and monitors that operation until it either completes or aborts. When a tagged fetch completes or aborts, a sampling interrupt is generated and an IBS fetch sample is taken. An IBS fetch sample contains a timestamp, the identifier of the interrupted process, the virtual fetch address, and several event flags and values that describe what happened during the fetch operation.
IBS op sampling selects, tags, and monitors macro-ops as issued from AMD64 instructions. Two options are available for selecting ops for sampling:
In both cases, an IBS sample is generated only if the tagged op retires. Thus, IBS op event information does not measure speculative execution activity. The execution stages of the pipeline monitor the tagged macro-op. When the tagged macro-op retires, a sampling interrupt is generated and an IBS op sample is taken. An IBS op sample contains a timestamp, the identifier of the interrupted process, the virtual address of the AMD64 instruction from which the op was issued, and several event flags and values that describe what happened when the macro-op executed.
Enabling IBS profiling is done simply by specifying IBS performance events
through the "--event=" options. These events are listed in the
opcontrol --list-events.
opcontrol --event=IBS_FETCH_XXX:<count>:<um>:<kernel>:<user>
opcontrol --event=IBS_OP_XXX:<count>:<um>:<kernel>:<user>
Note: * All IBS fetch event must have the same event count and unitmask,
as do those for IBS op.
|
OProfile is a low-level profiler which allow continuous profiling with a low-overhead cost. If too low a count reset value is set for a counter, the system can become overloaded with counter interrupts, and seem as if the system has frozen. Whilst some validation is done, it is not foolproof.
This can happen as follows: When the profiler count reaches zero an NMI handler is called which stores the sample values in an internal buffer, then resets the counter to its original value. If the count is very low, a pending NMI can be sent before the NMI handler has completed. Due to the priority of the NMI, the local APIC delivers the pending interrupt immediately after completion of the previous interrupt handler, and control never returns to other parts of the system. In this way the system seems to be frozen.
If this happens, it will be impossible to bring the system back to a workable state.
There is no way to provide real security against this happening, other than making sure to use a reasonable value
for the counter reset. For example, setting CPU_CLK_UNHALTED event type with a ridiculously low reset count (e.g. 500)
is likely to freeze the system.
In short : Don't try a foolish sample count value. Unfortunately the definition of a foolish value is really dependent on the event type - if ever in doubt, e-mail