Intel Nehalem microarchitecture performance counter events

Intel Nehalem Microarchitecture events

This is a list of all Intel Nehalem Microarchitecture performance counter event types. Please see Intel Architecture Developer's Manual Volume 3B, Appendix A and Intel Architecture Optimization Reference Manual (730795-001)

Name	Description	Counters usable	Unit mask options
UNHALTED_REFERENCE_CYCLES	Unhalted reference cycles	all	0x01: No unit mask
LLC_MISSES	Last level cache demand requests from this core that missed the LLC	all	0x41: No unit mask
LLC_REFS	Last level cache demand requests from this core	all	0x4f: No unit mask
INST_RETIRED	number of instructions retired	all	0x01: (name=any_p) instructions retired 0x02: (name=x87) Counts the number of floating point computational operations retired: floating point computational operations executed by the assist handler and sub-operations of complex floating point instructions like transcendental instructions
BR_INST_RETIRED	number of branch instructions retired	all	0x00: (name=all_branches) See Table A-1 0x01: (name=conditional) Counts the number of conditional branch instructions retired 0x02: (name=near_call) Counts the number of direct & indirect near unconditional calls retired 0x04: (name=all_branches) Counts the number of branch instructions retired
BR_MISS_PRED_RETIRED	number of mispredicted branches retired (precise)	all	0x00: (name=all_branches) See Table A-1 0x02: (name=near_call) Counts mispredicted direct & indirect near unconditional retired calls
SB_FORWARD	Counts the number of store forwards.	all	0x01: (name=any) Counts the number of store forwards
LOAD_BLOCK	Counts the number of loads blocked	all	0x01: (name=std) Counts the number of loads blocked by a preceding store with unknown data 0x04: (name=address_offset) Counts the number of loads blocked by a preceding store address
SB_DRAIN	Counts the cycles of store buffer drains.	all	0x01: (name=cycles) Counts the cycles of store buffer drains
MISALIGN_MEM_REF	Counts the number of misaligned load references	all	0x01: (name=load) Counts the number of misaligned load references 0x02: (name=store) Counts the number of misaligned store references 0x03: (name=any) Counts the number of misaligned memory references
STORE_BLOCKS	This event counts the number of load operations delayed caused by preceding stores.	all	0x01: (name=not_sta) This event counts the number of load operations delayed caused by preceding stores whose addresses are known but whose data is unknown, and preceding stores that conflict with the load but which incompletely overlap the load 0x02: (name=sta) This event counts load operations delayed caused by preceding stores whose addresses are unknown (STA block) 0x04: (name=at_ret) Counts number of loads delayed with at-Retirement block code 0x08: (name=l1d_block) Cacheable loads delayed with L1D block code 0x0f: any All loads delayed due to store blocks
PARTIAL_ADDRESS_ALIAS	Counts false dependency due to partial address aliasing	all	0x01: No unit mask
DTLB_LOAD_MISSES	Counts dtlb page walks	all	0x01: (name=any) Counts all load misses that cause a page walk 0x02: (name=walk_completed) Counts number of completed page walks due to load miss in the STLB 0x10: (name=stlb_hit) Number of cache load STLB hits 0x20: (name=pde_miss) Number of DTLB cache load misses where the low part of the linear to physical address translation was missed 0x40: (name=pdp_miss) Number of DTLB cache load misses where the high part of the linear to physical address translation was missed 0x80: (name=large_walk_completed) Counts number of completed large page walks due to load miss in the STLB
MEMORY_DISAMBIGURATION	Counts memory disambiguration events	all	0x01: (name=reset) Counts memory disambiguration reset cycles 0x02: (name=success) Counts the number of loads that memory disambiguration succeeded 0x04: (name=watchdog) Counts the number of times the memory disambiguration watchdog kicked in 0x08: (name=watch_cycles) Counts the cycles that the memory disambiguration watchdog is active
MEM_INST_RETIRED	Counts the number of instructions with an architecturally-visible load/store retired on the architected path.	all	0x01: (name=loads) Counts the number of instructions with an architecturally-visible store retired on the architected path 0x02: (name=stores) Counts the number of instructions with an architecturally-visible store retired on the architected path
MEM_STORE_RETIRED	The event counts the number of retired stores that missed the DTLB. The DTLB miss is not counted if the store operation causes a fault. Does not count prefetches. Counts both primary and secondary misses to the TLB	all	0x01: (name=dtlb_miss) The event counts the number of retired stores that missed the DTLB
UOPS_ISSUED	Counts the number of Uops issued by the Register Allocation Table to the Reservation Station, i.e. the UOPs issued from the front end to the back end.	all	0x01: (name=any) Counts the number of Uops issued by the Register Allocation Table to the Reservation Station, i 0x01: (name=stalled_cycles) Counts the number of cycles no Uops issued by the Register Allocation Table to the Reservation Station, i 0x02: (name=fused) Counts the number of fused Uops that were issued from the Register Allocation Table to the Reservation Station
MEM_UNCORE_RETIRED	Counts number of memory load instructions retired where the memory reference hit modified data in another core	all	0x02: (name=other_core_l2_hitm) Counts number of memory load instructions retired where the memory reference hit modified data in a sibling core residing on the same socket 0x08: (name=remote_cache_local_home_hit) Counts number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and HIT in a remote socket's cache 0x10: (name=remote_dram) Counts number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and was remotely homed 0x20: (name=local_dram) Counts number of memory load instructions retired where the memory reference missed the L1, L2 and LLC caches and required a local socket memory reference
FP_COMP_OPS_EXE	Counts the number of FP Computational Uops Executed.	all	0x01: (name=x87) Counts the number of FP Computational Uops Executed 0x02: (name=mmx) Counts number of MMX Uops executed 0x04: (name=sse_fp) Counts number of SSE and SSE2 FP uops executed 0x08: (name=sse2_integer) Counts number of SSE2 integer uops executed 0x10: (name=sse_fp_packed) Counts number of SSE FP packed uops executed 0x20: (name=sse_fp_scalar) Counts number of SSE FP scalar uops executed 0x40: (name=sse_single_precision) Counts number of SSE* FP single precision uops executed 0x80: (name=sse_double_precision) Counts number of SSE* FP double precision uops executed
SIMD_INT_128	Counts number of 128 bit SIMD integer operations.	all	0x01: (name=packed_mpy) Counts number of 128 bit SIMD integer multiply operations 0x02: (name=packed_shift) Counts number of 128 bit SIMD integer shift operations 0x04: (name=pack) Counts number of 128 bit SIMD integer pack operations 0x08: (name=unpack) Counts number of 128 bit SIMD integer unpack operations 0x10: (name=packed_logical) Counts number of 128 bit SIMD integer logical operations 0x20: (name=packed_arith) Counts number of 128 bit SIMD integer arithmetic operations 0x40: (name=shuffle_move) Counts number of 128 bit SIMD integer shuffle and move operations
LOAD_DISPATCH	Counts number of loads dispatched from the Reservation Station that bypass.	all	0x01: (name=rs) Counts number of loads dispatched from the Reservation Station that bypass the Memory Order Buffer 0x02: (name=rs_delayed) Counts the number of delayed RS dispatches at the stage latch 0x04: (name=mob) Counts the number of loads dispatched from the Reservation Station to the Memory Order Buffer 0x07: (name=any) Counts all loads dispatched from the Reservation Station
ARITH	Counts division cycles and number of multiplies. Includes integer and FP, but excludes DPPS/MPSAD.	all	0x01: (name=cycles_div_busy) Counts the number of cycles the divider is busy executing divide or square root operations 0x02: (name=mul) Counts the number of multiply operations executed
INST_QUEUE_WRITES	Counts the number of instructions written into the instruction queue every cycle.	all	0x01: No unit mask
INST_DECODED	Counts number of instructions that require decoder 0 to be decoded. Usually, this means that the instruction maps to more than 1 uop	all	0x01: (name=dec0) Counts number of instructions that require decoder 0 to be decoded
TWO_UOP_INSTS_DECODED	An instruction that generates two uops was decoded	all	0x01: No unit mask
HW_INT	Counts hardware interrupt events.	all	0x01: (name=rcv) Number of interrupt received 0x02: (name=cycles_masked) Number of cycles interrupt are masked 0x04: (name=cycles_pending_and_masked) Number of cycles interrupts are pending and masked
INST_QUEUE_WRITE_CYCLES	This event counts the number of cycles during which instructions are written to the instruction queue. Dividing this counter by the number of instructions written to the instruction queue (INST_QUEUE_WRITES) yields the average number of instructions decoded each cycle. If this number is less than four and the pipe stalls, this indicates that the decoder is failing to decode enough instructions per cycle to sustain the 4-wide pipeline.	all	0x01: No unit mask
L2_RQSTS	Counts number of L2 data loads	all	0x01: (name=ld_hit) Counts number of loads that hit the L2 cache 0x02: (name=ld_miss) Counts the number of loads that miss the L2 cache 0x03: (name=loads) Counts all L2 load requests 0x04: (name=rfo_hit) Counts the number of store RFO requests that hit the L2 cache 0x08: (name=rfo_miss) Counts the number of store RFO requests that miss the L2 cache 0x0c: rfos Counts all L2 store RFO requests 0x10: (name=ifetch_hit) Counts number of instruction fetches that hit the L2 cache 0x20: (name=ifetch_miss) Counts number of instruction fetches that miss the L2 cache 0x30: (name=ifetches) Counts all instruction fetches 0x40: (name=prefetch_hit) Counts L2 prefetch hits for both code and data 0x80: (name=prefetch_miss) Counts L2 prefetch misses for both code and data 0xc0: prefetches Counts all L2 prefetches for both code and data 0xaa: miss Counts all L2 misses for both code and data 0xff: references Counts all L2 requests for both code and data
L2_DATA_RQSTS	More L2 data loads.	all	0x01: (name=i_state) Counts number of L2 data demand loads where the cache line to be loaded is in the I (invalid) state, i 0x02: (name=s_state) Counts number of L2 data demand loads where the cache line to be loaded is in the S (shared) state 0x04: (name=e_state) Counts number of L2 data demand loads where the cache line to be loaded is in the E (exclusive) state 0x08: (name=m_state) Counts number of L2 data demand loads where the cache line to be loaded is in the M (modified) state 0x0f: mesi Counts all L2 data demand requests 0x10: (name=i_state) Counts number of L2 prefetch data loads where the cache line to be loaded is in the I (invalid) state, i 0x20: (name=s_state) Counts number of L2 prefetch data loads where the cache line to be loaded is in the S (shared) state 0x40: (name=e_state) Counts number of L2 prefetch data loads where the cache line to be loaded is in the E (exclusive) state 0x80: (name=m_state) Counts number of L2 prefetch data loads where the cache line to be loaded is in the M (modified) state 0xf0: mesi Counts all L2 prefetch requests 0xff: any Counts all L2 data requests
L2_WRITE	Counts number of L2 writes	all	0x01: (name=i_state) Counts number of L2 demand store RFO requests where the cache line to be loaded is in the I (invalid) state, i 0x02: (name=s_state) Counts number of L2 store RFO requests where the cache line to be loaded is in the S (shared) state 0x04: (name=e_state) Counts number of L2 store RFO requests where the cache line to be loaded is in the E (exclusive) state 0x08: (name=m_state) Counts number of L2 store RFO requests where the cache line to be loaded is in the M (modified) state 0x0e: hit Counts number of L2 store RFO requests where the cache line to be loaded is in either the S, E or M states 0x0f: mesi Counts all L2 store RFO requests 0x10: (name=i_state) Counts number of L2 demand lock RFO requests where the cache line to be loaded is in the I (invalid) state, i 0x20: (name=s_state) Counts number of L2 lock RFO requests where the cache line to be loaded is in the S (shared) state 0x40: (name=e_state) Counts number of L2 demand lock RFO requests where the cache line to be loaded is in the E (exclusive) state 0x80: (name=m_state) Counts number of L2 demand lock RFO requests where the cache line to be loaded is in the M (modified) state 0xe0: hit Counts number of L2 demand lock RFO requests where the cache line to be loaded is in either the S, E, or M state 0xf0: mesi Counts all L2 demand lock RFO requests
L1D_WB_L2	Counts number of L1 writebacks to the L2.	all	0x01: (name=i_state) Counts number of L1 writebacks to the L2 where the cache line to be written is in the I (invalid) state, i 0x02: (name=s_state) Counts number of L1 writebacks to the L2 where the cache line to be written is in the S state 0x04: (name=e_state) Counts number of L1 writebacks to the L2 where the cache line to be written is in the E (exclusive) state 0x08: (name=m_state) Counts number of L1 writebacks to the L2 where the cache line to be written is in the M (modified) state 0x0f: mesi Counts all L1 writebacks to the L2
LONGEST_LAT_CACHE	Count LLC cache reference latencies.	all	0x4f: reference This event counts requests originating from the core that reference a cache line in the last level cache 0x41: (name=miss) This event counts each cache miss condition for references to the last level cache
CPU_CLK_UNHALTED	Counts the number of thread cycles while the thread is not in a halt state.	all	0x00: (name=thread_p) Counts the number of thread cycles while the thread is not in a halt state 0x01: (name=ref_p) Increments at the frequency of a slower reference clock when not halted
UOPS_DECODED_DEC0	Counts micro-ops decoded by decoder 0.	all	0x01: No unit mask
L1D_CACHE_LD	Counts L1 data cache read requests.	0, 1	0x01: (name=i_state) Counts L1 data cache read requests where the cache line to be loaded is in the I (invalid) state, i 0x02: (name=s_state) Counts L1 data cache read requests where the cache line to be loaded is in the S (shared) state 0x04: (name=e_state) Counts L1 data cache read requests where the cache line to be loaded is in the E (exclusive) state 0x08: (name=m_state) Counts L1 data cache read requests where the cache line to be loaded is in the M (modified) state 0x0f: mesi Counts L1 data cache read requests
L1D_CACHE_ST	Counts L1 data cache stores.	0, 1	0x01: (name=i_state) Counts L1 data cache store RFO requests where the cache line to be loaded is in the I state 0x02: (name=s_state) Counts L1 data cache store RFO requests where the cache line to be loaded is in the S (shared) state 0x04: (name=e_state) Counts L1 data cache store RFO requests where the cache line to be loaded is in the E (exclusive) state 0x08: (name=m_state) Counts L1 data cache store RFO requests where cache line to be loaded is in the M (modified) state 0x0f: mesi Counts L1 data cache store RFO requests
L1D_CACHE_LOCK	Counts retired load locks in the L1D cache.	0, 1	0x01: (name=hit) Counts retired load locks that hit in the L1 data cache or hit in an already allocated fill buffer 0x02: (name=s_state) Counts L1 data cache retired load locks that hit the target cache line in the shared state 0x04: (name=e_state) Counts L1 data cache retired load locks that hit the target cache line in the exclusive state 0x08: (name=m_state) Counts L1 data cache retired load locks that hit the target cache line in the modified state
L1D_ALL_REF	Counts all references to the L1 data cache,	0, 1	0x01: (name=any) Counts all references (uncached, speculated and retired) to the L1 data cache, including all loads and stores with any memory types 0x02: (name=cacheable) Counts all data reads and writes (speculated and retired) from cacheable memory, including locked operations
DTLB_MISSES	Counts the number of misses in the STLB	all	0x01: (name=any) Counts the number of misses in the STLB which causes a page walk 0x02: (name=walk_completed) Counts number of misses in the STLB which resulted in a completed page walk 0x10: (name=stlb_hit) Counts the number of DTLB first level misses that hit in the second level TLB 0x20: (name=pde_miss) Number of DTLB cache misses where the low part of the linear to physical address translation was missed 0x40: (name=pdp_miss) Number of DTLB misses where the high part of the linear to physical address translation was missed 0x80: (name=large_walk_completed) Counts number of completed large page walks due to misses in the STLB
SSE_MEM_EXEC	Counts number of SSE instructions which missed the L1 data cache.	all	0x01: (name=nta) Counts number of SSE NTA prefetch/weakly-ordered instructions which missed the L1 data cache 0x08: (name=streaming_stores) Counts number of SSE nontemporal stores
LOAD_HIT_PRE	Counts load operations sent to the L1 data cache while a previous SSE prefetch instruction to the same cache line has started prefetching but has not yet finished.	all	0x01: No unit mask
SFENCE_CYCLES	Counts store fence cycles	all	0x01: No unit mask
L1D_PREFETCH	Counts number of hardware prefetch requests.	all	0x01: (name=requests) Counts number of hardware prefetch requests dispatched out of the prefetch FIFO 0x02: (name=miss) Counts number of hardware prefetch requests that miss the L1D 0x04: (name=triggers) Counts number of prefetch requests triggered by the Finite State Machine and pushed into the prefetch FIFO
EPT	Counts Extended Page Directory Entry accesses. The Extended Page Directory cache is used by Virtual Machine operating systems while the guest operating systems use the standard TLB caches.	all	0x02: (name=epde_miss) Counts Extended Page Directory Entry misses 0x04: (name=epdpe_hit) Counts Extended Page Directory Pointer Entry hits 0x08: (name=epdpe_miss) Counts Extended Page Directory Pointer Entry misses
L1D	Counts the number of lines brought from/to the L1 data cache.	0, 1	0x01: (name=repl) Counts the number of lines brought into the L1 data cache 0x02: (name=m_repl) Counts the number of modified lines brought into the L1 data cache 0x04: (name=m_evict) Counts the number of modified lines evicted from the L1 data cache due to replacement 0x08: (name=m_snoop_evict) Counts the number of modified lines evicted from the L1 data cache due to snoop HITM intervention
L1D_CACHE_PREFETCH_LOCK_FB_HIT	Counts the number of cacheable load lock speculated instructions accepted into the fill buffer.	all	0x01: No unit mask
L1D_CACHE_LOCK_FB_HIT	Counts the number of cacheable load lock speculated or retired instructions accepted into the fill buffer.	all	0x01: No unit mask
OFFCORE_REQUESTS_OUTSTANDING	Counts weighted cycles of offcore requests.	all	0x01: (name=read_data) Counts weighted cycles of offcore demand data read requests 0x02: (name=read_code) Counts weighted cycles of offcore demand code read requests 0x04: (name=rfo) Counts weighted cycles of offcore demand RFO requests 0x08: (name=read) Counts weighted cycles of offcore read requests of any kind
CACHE_LOCK_CYCLES	Cycle count during which the L1/L2 caches are locked. A lock is asserted when there is a locked memory access, due to uncacheable memory, a locked operation that spans two cache lines, or a page walk from an uncacheable page table.	0, 1	0x01: (name=l1d_l2) Cycle count during which the L1D and L2 are locked 0x02: (name=l1d) Counts the number of cycles that cacheline in the L1 data cache unit is locked
IO_TRANSACTIONS	Counts the number of completed I/O transactions.	all	0x01: No unit mask
L1I	Counts L1i instruction cache accesses.	all	0x01: (name=hits) Counts all instruction fetches that hit the L1 instruction cache 0x02: (name=misses) Counts all instruction fetches that miss the L1I cache 0x03: (name=reads) Counts all instruction fetches, including uncacheable fetches that bypass the L1I 0x04: (name=cycles_stalled) Cycle counts for which an instruction fetch stalls due to a L1I cache miss, ITLB miss or ITLB fault
IFU_IVC	Instruction Fetch unit events	all	0x01: (name=full) Instruction Fetche unit victim cache full 0x02: (name=l1i_eviction) L1 Instruction cache evictions
LARGE_ITLB	Counts number of large ITLB accesses	all	0x01: (name=hit) Counts number of large ITLB hits
L1I_OPPORTUNISTIC_HITS	Opportunistic hits in streaming.	all	0x01: No unit mask
ITLB_MISSES	Counts the number of ITLB misses in various variants	all	0x01: (name=any) Counts the number of misses in all levels of the ITLB which causes a page walk 0x02: (name=walk_completed) Counts number of misses in all levels of the ITLB which resulted in a completed page walk 0x04: (name=walk_cycles) Counts ITLB miss page walk cycles 0x04: (name=pmh_busy_cycles) Counts PMH busy cycles 0x10: (name=stlb_hit) Counts the number of ITLB misses that hit in the second level TLB 0x20: (name=pde_miss) Number of ITLB misses where the low part of the linear to physical address translation was missed 0x40: (name=pdp_miss) Number of ITLB misses where the high part of the linear to physical address translation was missed 0x80: (name=large_walk_completed) Counts number of completed large page walks due to misses in the STLB
ILD_STALL	Cycles Instruction Length Decoder stalls	all	0x01: (name=lcp) Cycles Instruction Length Decoder stalls due to length changing prefixes: 66, 67 or REX 0x02: (name=mru) Instruction Length Decoder stall cycles due to Brand Prediction Unit (PBU) Most Recently Used (MRU) bypass 0x04: (name=iq_full) Stall cycles due to a full instruction queue 0x08: (name=regen) Counts the number of regen stalls 0x0f: any Counts any cycles the Instruction Length Decoder is stalled
BR_INST_EXEC	Counts the number of near branch instructions executed, but not necessarily retired.	all	0x01: (name=cond) Counts the number of conditional near branch instructions executed, but not necessarily retired 0x02: (name=direct) Counts all unconditional near branch instructions excluding calls and indirect branches 0x04: (name=indirect_non_call) Counts the number of executed indirect near branch instructions that are not calls 0x07: (name=non_calls) Counts all non call near branch instructions executed, but not necessarily retired 0x08: (name=return_near) Counts indirect near branches that have a return mnemonic 0x10: (name=direct_near_call) Counts unconditional near call branch instructions, excluding non call branch, executed 0x20: (name=indirect_near_call) Counts indirect near calls, including both register and memory indirect, executed 0x30: (name=near_calls) Counts all near call branches executed, but not necessarily retired 0x40: (name=taken) Counts taken near branches executed, but not necessarily retired 0x7f: any Counts all near executed branches (not necessarily retired)
BR_MISP_EXEC	Counts the number of mispredicted conditional near branch instructions executed, but not necessarily retired.	all	0x01: (name=cond) Counts the number of mispredicted conditional near branch instructions executed, but not necessarily retired 0x02: (name=direct) Counts mispredicted macro unconditional near branch instructions, excluding calls and indirect branches (should always be 0) 0x04: (name=indirect_non_call) Counts the number of executed mispredicted indirect near branch instructions that are not calls 0x07: (name=non_calls) Counts mispredicted non call near branches executed, but not necessarily retired 0x08: (name=return_near) Counts mispredicted indirect branches that have a rear return mnemonic 0x10: (name=direct_near_call) Counts mispredicted non-indirect near calls executed, (should always be 0) 0x20: (name=indirect_near_call) Counts mispredicted indirect near calls exeucted, including both register and memory indirect 0x30: (name=near_calls) Counts all mispredicted near call branches executed, but not necessarily retired 0x40: (name=taken) Counts executed mispredicted near branches that are taken, but not necessarily retired 0x7f: any Counts the number of mispredicted near branch instructions that were executed, but not necessarily retired
RESOURCE_STALLS	Counts the number of Allocator resource related stalls. Includes register renaming buffer entries, memory buffer entries. In addition to resource related stalls, this event counts some other events. Includes stalls arising during branch misprediction recovery, such as if retirement of the mispredicted branch is delayed and stalls arising while store buffer is draining from synchronizing operations.	all	0x01: (name=any) Counts the number of Allocator resource related stalls 0x02: (name=load) Counts the cycles of stall due to lack of load buffer for load operation 0x04: (name=rs_full) This event counts the number of cycles when the number of instructions in the pipeline waiting for execution reaches the limit the processor can handle 0x08: (name=store) This event counts the number of cycles that a resource related stall will occur due to the number of store instructions reaching the limit of the pipeline, (i 0x10: (name=rob_full) Counts the cycles of stall due to reorder buffer full 0x20: (name=fpcw) Counts the number of cycles while execution was stalled due to writing the floating-point unit (FPU) control word 0x40: (name=mxcsr) Stalls due to the MXCSR register rename occurring to close to a previous MXCSR rename 0x80: (name=other) Counts the number of cycles while execution was stalled due to other resource issues
MACRO_INSTS_FUSED	Counts the number of instructions decoded that are macro-fused but not necessarily executed or retired.	all	0x01: No unit mask
BACLEAR_FORCE_IQ	Counts number of times a BACLEAR was forced by the Instruction Queue. The IQ is also responsible for providing conditional branch prediciton direction based on a static scheme and dynamic data provided by the L2 Branch Prediction Unit. If the conditional branch target is not found in the Target Array and the IQ predicts that the branch is taken, then the IQ will force the Branch Address Calculator to issue a BACLEAR. Each BACLEAR asserted by the BAC generates approximately an 8 cycle bubble in the instruction fetch pipeline.	all	0x01: No unit mask
LSD	Counts the number of micro-ops delivered by loop stream detector	all	0x01: No unit mask
ITLB_FLUSH	Counts the number of ITLB flushes	all	0x01: No unit mask
OFFCORE_REQUESTS	Counts number of offcore data requests.	all	0x01: (name=demand_read_data) Counts number of offcore demand data read requests 0x02: (name=demand_read_code) Counts number of offcore demand code read requests 0x04: (name=demand_rfo) Counts number of offcore demand RFO requests 0x08: (name=any_read) Counts number of offcore read requests 0x10: (name=any_rfo) Counts number of offcore RFO requests 0x20: (name=uncached_mem) Counts number of offcore uncached memory requests 0x40: (name=l1d_writeback) Counts number of L1D writebacks to the uncore 0x80: (name=any) Counts all offcore requests
UOPS_EXECUTED	Counts number of Uops executed that were issued on various ports	all	0x01: (name=port0) Counts number of Uops executed that were issued on port 0 0x02: (name=port1) Counts number of Uops executed that were issued on port 1 0x04: (name=port2_core) Counts number of Uops executed that were issued on port 2 0x08: (name=port3_core) Counts number of Uops executed that were issued on port 3 0x10: (name=port4_core) Counts number of Uops executed that where issued on port 4 0x20: (name=port5) Counts number of Uops executed that where issued on port 5 0x40: (name=port015) Counts number of Uops executed that where issued on port 0, 1, or 5 0x80: (name=port234) Counts number of Uops executed that where issued on port 2, 3, or 4
OFFCORE_REQUESTS_SQ_FULL	Counts number of cycles the SQ is full to handle off-core requests.	all	0x01: No unit mask
SNOOPQ_REQUESTS_OUTSTANDING	Counts weighted cycles of snoopq requests.	all	0x01: (name=data) Counts weighted cycles of snoopq requests for data 0x02: (name=invalidate) Counts weighted cycles of snoopq invalidate requests 0x04: (name=code) Counts weighted cycles of snoopq requests for code
OOF_CORE_RESPONSE_0	Off-core Response Performance Monitoring in the Processor Core. Requires special setup.	all	0x01: No unit mask
SNOOP_RESPONSE	Counts HIT snoop response sent by this thread in response to a snoop request.	all	0x01: (name=hit) Counts HIT snoop response sent by this thread in response to a snoop request 0x02: (name=hite) Counts HIT E snoop response sent by this thread in response to a snoop request 0x04: (name=hitm) Counts HIT M snoop response sent by this thread in response to a snoop request
PIC_ACCESSES	Counts number of TPR accesses	all	0x01: (name=tpr_reads) Counts number of TPR reads 0x02: (name=tpr_writes) Counts number of TPR writes
UOPS_RETIRED	Counts the number of micro-ops retired, (macro-fused=1, micro-fused=2, others=1; maximum count of 8 per cycle). Most instructions are composed of one or two microops. Some instructions are decoded into longer sequences such as repeat instructions, floating point transcendental instructions, and assists	all	0x01: (name=any) Counts the number of micro-ops retired, (macro-fused=1, micro-fused=2, others=1; maximum count of 8 per cycle) 0x02: (name=retire_slots) Counts the number of retirement slots used each cycle 0x04: (name=macro_fused) Counts number of macro-fused uops retired
MACHINE_CLEARS	Counts the cycles machine clear is asserted.	all	0x01: (name=cycles) Counts the cycles machine clear is asserted 0x02: (name=mem_order) Counts the number of machine clears due to memory order conflicts 0x04: (name=smc) Counts the number of times that a program writes to a code section 0x10: (name=fusion_assist) Counts the number of macro-fusion assists
SSEX_UOPS_RETIRED	Counts SIMD packed single-precision floating point Uops retired.	all	0x01: (name=packed_single) Counts SIMD packed single-precision floating point Uops retired 0x02: (name=scalar_single) Counts SIMD calar single-precision floating point Uops retired 0x04: (name=packed_double) Counts SIMD packed double-precision floating point Uops retired 0x08: (name=scalar_double) Counts SIMD scalar double-precision floating point Uops retired 0x10: (name=vector_integer) Counts 128-bit SIMD vector integer Uops retired
ITLB_MISS_RETIRED	Counts the number of retired instructions that missed the ITLB when the instruction was fetched.	all	0x20: No unit mask
MEM_LOAD_RETIRED	Counts number of retired loads.	all	0x01: (name=l1d_hit) Counts number of retired loads that hit the L1 data cache 0x02: (name=l2_hit) Counts number of retired loads that hit the L2 data cache 0x04: (name=llc_unshared_hit) Counts number of retired loads that hit their own, unshared lines in the LLC cache 0x08: (name=other_core_l2_hit_hitm) Counts number of retired loads that hit in a sibling core's L2 (on die core) 0x10: (name=llc_miss) Counts number of retired loads that miss the LLC cache 0x40: (name=hit_lfb) Counts number of retired loads that miss the L1D and the address is located in an allocated line fill buffer and will soon be committed to cache 0x80: (name=dtlb_miss) Counts the number of retired loads that missed the DTLB
FP_MMX_TRANS	Counts transitions between MMX and x87 state.	all	0x01: (name=to_fp) Counts the first floating-point instruction following any MMX instruction 0x02: (name=to_mmx) Counts the first MMX instruction following a floating-point instruction 0x03: (name=any) Counts all transitions from floating point to MMX instructions and from MMX instructions to floating point instructions
MACRO_INSTS	Counts the number of instructions decoded, (but not necessarily executed or retired).	all	0x01: (name=decoded) Counts the number of instructions decoded, (but not necessarily executed or retired)
UOPS_DECODED	Counts the number of Uops decoded by various subsystems.	all	0x02: (name=ms) Counts the number of Uops decoded by the Microcode Sequencer, MS 0x04: (name=esp_folding) Counts number of stack pointer (ESP) instructions decoded: push , pop , call , ret, etc 0x08: (name=esp_sync) Counts number of stack pointer (ESP) sync operations where an ESP instruction is corrected by adding the ESP offset register to the current value of the ESP register
RAT_STALLS	Counts the number of cycles during which execution stalled due to several reason	all	0x01: (name=flags) Counts the number of cycles during which execution stalled due to several reasons, one of which is a partial flag register stall 0x02: (name=registers) This event counts the number of cycles instruction execution latency became longer than the defined latency because the instruction used a register that was partially written by previous instruction 0x04: (name=rob_read_port) Counts the number of cycles when ROB read port stalls occurred, which did not allow new micro-ops to enter the out-of-order pipeline 0x08: (name=scoreboard) Counts the cycles where we stall due to microarchitecturally required serialization 0x0f: any Counts all Register Allocation Table stall cycles due to: Cycles when ROB read port stalls occurred, which did not allow new micro-ops to enter the execution pipe
SEG_RENAME_STALLS	Counts the number of stall cycles due to the lack of renaming resources for the ES, DS, FS, and GS segment registers. If a segment is renamed but not retired and a second update to the same segment occurs, a stall occurs in the front-end of the pipeline until the renamed segment retires.	all	0x01: No unit mask
ES_REG_RENAMES	Counts the number of times the ES segment register is renamed.	all	0x01: No unit mask
UOP_UNFUSION	Counts unfusion events due to floating point exception to a fused uop.	all	0x01: No unit mask
BR_INST_DECODED	Counts the number of branch instructions decoded.	all	0x01: No unit mask
BOGUS_BR	Counts the number of bogus branches.	all	0x01: No unit mask
BPU_MISSED_CALL_RET	Counts number of times the Branch Prediciton Unit missed predicting a call or return branch.	all	0x01: No unit mask
BACLEAR	Counts the number of times the front end is resteered,	all	0x01: (name=clear) Counts the number of times the front end is resteered, mainly when the Branch Prediction Unit cannot provide a correct prediction and this is corrected by the Branch Address Calculator at the front end 0x02: (name=bad_target) Counts number of Branch Address Calculator clears (BACLEAR) asserted due to conditional branch instructions in which there was a target hit but the direction was wrong
BPU_CLEARS	Counts Branch Prediction Unit clears.	all	0x01: (name=early) Counts early (normal) Branch Prediction Unit clears: BPU predicted a taken branch after incorrectly assuming that it was not taken 0x02: (name=late) Counts late Branch Prediction Unit clears due to Most Recently Used conflicts 0x03: (name=any) Counts all BPU clears
L2_TRANSACTIONS	Counts L2 transactions	all	0x01: (name=load) Counts L2 load operations due to HW prefetch or demand loads 0x02: (name=rfo) Counts L2 RFO operations due to HW prefetch or demand RFOs 0x04: (name=ifetch) Counts L2 instruction fetch operations due to HW prefetch or demand ifetch 0x08: (name=prefetch) Counts L2 prefetch operations 0x10: (name=l1d_wb) Counts L1D writeback operations to the L2 0x20: (name=fill) Counts L2 cache line fill operations due to load, RFO, L1D writeback or prefetch 0x40: (name=wb) Counts L2 writeback operations to the LLC 0x80: (name=any) Counts all L2 cache operations
L2_LINES_IN	Counts the number of cache lines allocated in the L2 cache in various states.	all	0x02: (name=s_state) Counts the number of cache lines allocated in the L2 cache in the S (shared) state 0x04: (name=e_state) Counts the number of cache lines allocated in the L2 cache in the E (exclusive) state 0x07: (name=any) Counts the number of cache lines allocated in the L2 cache
L2_LINES_OUT	Counts L2 cache lines evicted.	all	0x01: (name=demand_clean) Counts L2 clean cache lines evicted by a demand request 0x02: (name=demand_dirty) Counts L2 dirty (modified) cache lines evicted by a demand request 0x04: (name=prefetch_clean) Counts L2 clean cache line evicted by a prefetch request 0x08: (name=prefetch_dirty) Counts L2 modified cache line evicted by a prefetch request 0x0f: any Counts all L2 cache lines evicted for any reason
L2_HW_PREFETCH	Count L2 HW prefetcher events	all	0x01: (name=hit) Count L2 HW prefetcher detector hits 0x02: (name=alloc) Count L2 HW prefetcher allocations 0x04: (name=data_trigger) Count L2 HW data prefetcher triggered 0x08: (name=code_trigger) Count L2 HW code prefetcher triggered 0x10: (name=dca_trigger) Count L2 HW DCA prefetcher triggered 0x20: (name=kick_start) Count L2 HW prefetcher kick started
SQ_MISC	Counts events in the Super Queue below the L2.	all	0x01: (name=promotion) Counts the number of L2 secondary misses that hit the Super Queue 0x02: (name=promotion_post_go) Counts the number of L2 secondary misses during the Super Queue filling L2 0x04: (name=lru_hints) Counts number of Super Queue LRU hints sent to L3 0x08: (name=fill_dropped) Counts the number of SQ L2 fills dropped due to L2 busy 0x10: (name=split_lock) Counts the number of SQ lock splits across a cache line
SQ_FULL_STALL_CYCLES	Counts cycles the Super Queue is full. Neither of the threads on this core will be able to access the uncore.	all	0x01: No unit mask
FP_ASSIST	Counts the number of floating point operations executed that required micro-code assist intervention.	all	0x01: (name=all) Counts the number of floating point operations executed that required micro-code assist intervention 0x02: (name=output) Counts number of floating point micro-code assist when the output value (destination register) is invalid 0x04: (name=input) Counts number of floating point micro-code assist when the input value (one of the source operands to an FP instruction) is invalid
SEGMENT_REG_LOADS	Counts number of segment register loads	all	0x01: No unit mask
SIMD_INT_64	Counts number of SID integer 64 bit packed multiply operations.	all	0x01: (name=packed_mpy) Counts number of SID integer 64 bit packed multiply operations 0x02: (name=packed_shift) Counts number of SID integer 64 bit packed shift operations 0x04: (name=pack) Counts number of SID integer 64 bit pack operations 0x08: (name=unpack) Counts number of SID integer 64 bit unpack operations 0x10: (name=packed_logical) Counts number of SID integer 64 bit logical operations 0x20: (name=packed_arith) Counts number of SID integer 64 bit arithmetic operations 0x40: (name=shuffle_move) Counts number of SID integer 64 bit shift or move operations

Don't speculate - benchmark. - Dan Bernstein

2020/07/20