Chapter 1. Introduction

Table of Contents

1. Overview
2. Components of the OProfile system
2.1. Architecture-specific components
2.2. oprofilefs
2.3. Generic kernel driver
2.4. The OProfile daemon
2.5. Post-profiling tools

This document is current for OProfile version 1.2.0. This document provides some details on the internal workings of OProfile for the interested hacker. This document assumes strong C, working C++, plus some knowledge of kernel internals and CPU hardware.


Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4 uses a very different kernel module implementation and daemon to produce the sample files.

1. Overview

OProfile is a statistical continuous profiler. In other words, profiles are generated by regularly sampling the current registers on each CPU (from an interrupt handler, the saved PC value at the time of interrupt is stored), and converting that runtime PC value into something meaningful to the programmer.

OProfile achieves this by taking the stream of sampled PC values, along with the detail of which task was running at the time of the interrupt, and converting into a file offset against a particular binary file. Because applications mmap() the code they run (be it /bin/bash, /lib/ or whatever), it's possible to find the relevant binary file and offset by walking the task's list of mapped memory areas. Each PC value is thus converted into a tuple of binary-image,offset. This is something that the userspace tools can use directly to reconstruct where the code came from, including the particular assembly instructions, symbol, and source line (via the binary's debug information if present).

Regularly sampling the PC value like this approximates what actually was executed and how often - more often than not, this statistical approximation is good enough to reflect reality. In common operation, the time between each sample interrupt is regulated by a fixed number of clock cycles. This implies that the results will reflect where the CPU is spending the most time; this is obviously a very useful information source for performance analysis.

Sometimes though, an application programmer needs different kinds of information: for example, "which of the source routines cause the most cache misses ?". The rise in importance of such metrics in recent years has led many CPU manufacturers to provide hardware performance counters capable of measuring these events on the hardware level. Typically, these counters increment once per each event, and generate an interrupt on reaching some pre-defined number of events. OProfile can use these interrupts to generate samples: then, the profile results are a statistical approximation of which code caused how many of the given event.

Consider a simplified system that only executes two functions A and B. A takes one cycle to execute, whereas B takes 99 cycles. Imagine we run at 100 cycles a second, and we've set the performance counter to create an interrupt after a set number of "events" (in this case an event is one clock cycle). It should be clear that the chances of the interrupt occurring in function A is 1/100, and 99/100 for function B. Thus, we statistically approximate the actual relative performance features of the two functions over time. This same analysis works for other types of events, providing that the interrupt is tied to the number of events occurring (that is, after N events, an interrupt is generated).

There are typically more than one of these counters, so it's possible to set up profiling for several different event types. Using these counters gives us a powerful, low-overhead way of gaining performance metrics. If OProfile, or the CPU, does not support performance counters, then a simpler method is used: the kernel timer interrupt feeds samples into OProfile itself.

The rest of this document concerns itself with how we get from receiving samples at interrupt time to producing user-readable profile information.