Understanding Embedded Trace
Embedded Trace is an integral part of nearly all modern processors. This post summarizes the essential facts about this powerful but still far too seldom used functional unit that application engineers, test engineers and project managers should know in order to test, optimize and debug a system efficiently.
Something about Embedded Processor Observation
Monitoring embedded processors can be a difficult task. However, it is essential both during the development process and during deployment:
- Functional Testing: e.g., measure and validate timing behavior;
- Structural Testing: e.g., measure code coverage or data and control flow coupling;
- Debugging: pin down the root cause of anomalies.
Software Instrumentation
One obvious method of observation, which has been in use for a long time, is software instrumentation. The application software is modified to produce the desired information, such as control flow indicators, via standard output channels. Unfortunately, this approach can have a massive impact on the timing behavior and the memory footprint of an application. In the field of embedded systems, this method has serious disadvantages, including:
- Due to the changed timing behavior, the significance of functional tests is very limited.
- Unintended phantom synchronization may be introduced to concurrent programs by the access arbitration to shared output channels.
- Safety-critical software already deployed in the field cannot simply be modified to produce debug information as this would risk functional degradation or the triggering of Heisenbugs.
- The common mitigation of observability constraints by first performing transparent coverage validation tests with software instrumentation before repeating them without software instrumentation leads to a doubled test effort. This is a major disadvantage, especially, when limited and expensive test resources, such as HIL test benches, are involved. Even though the software instrumentation approach is archaic, it is still widely used. The Embedded Trace presented in the next chapters often also offers functions for hardware-supported software instrumentation. These will be presented in detail in one of our following whitepapers.
The Motivation Behind Embedded Trace
In the olden days, processors were slow (some 10 MHz) and only had external program memory. Eavesdropping on the memory bus was sufficient to observe the memory addresses of the instructions fetched by the CPU for execution. Following the program flow was, thus, rather trivial.

CPUs have been becoming faster, and their processor architecture has been changing:
- Instructions and data are cached near the CPU. Individual accesses are, therefore, no longer visible on the external bus system.
- Particularly embedded devices integrate more and more memory (RAM or Flash) internally so as to achieve faster access times and lower the system cost. Accesses to such memory are not at all observable on the external bus system.
- To meet the demand for more and more integrated computing power and to reduce energy consumption, several CPUs are integrated into one processor. Whatever external effects remain observable now also pose the challenge of attributing them to the correct concurrent control flow among the ones scheduled across truly parallel cores. All these advances in processor architecture require a radically new approach to observing what is happening inside the processor: embedded trace.

Embedded trace is the integration of functional units that make the activity of the CPUs observable. However, there is a significant bandwidth problem: Monitoring a single CPU comprehensively, at least, requires information about the executed instructions and the changing CPU registers. For a 32-bit CPU, a naive encoding would amount to a bandwidth of roughly:
[instruction address] + [changing CPU registers and flags] = 32+ ~48 = ~80 bits / CPU clock cycle
If the CPU is now clocked at 1 GHz, this will imply 80 Gbit/s of trace data. This is far beyond an economically reasonable solution. So, the trace data stream must be compressed.
Embedded Trace Protocol
Actual trace implementations such as Arm® CoreSight™ Program Flow Trace™, Arm® CoreSight™ Embedded Trace Macrocell ETMv4.x, Intel® Processor Trace, and Nexus 5001 Forum™ must compress the transmitted control flow trace rigorously. They inject relevant OS-related information, such as context switches, into their output to facilitate a highly efficient context-aware compression. The system observability can be boosted by also tracing the data flow. However, the implied trace data is much harder to compress. This results in significantly higher bandwidth requirements and, hence, more costly trace interfaces. Therefore, most implementations actually refrain from implementing this capability.

In the following, the most relevant trace message types are briefly introduced.
Looking for support on how embedded trace could help to solve your challenges? Our experts at Accemic Technologies are here to help. Feel free to contact us.
Control-Flow Messages
As already mentioned, the continuous output of the program counter alone would consume significant trace bandwidth. This is overcome (a) by assuming that the executed application is known to the observer and (b) by exploiting the default sequential execution of instructions therein.
This allows to use the following strategy for trace data compression:
Synchronization messages are generated in greater intervals, e.g., every 1000 messages. Only they establish a concrete value of the program counter that identifies the reference point for the further trace data interpretation. The execution of the sequential code following this point is implied.
Branch messages communicate actual control flow decisions. A single bit indicates whether or not a conditional branch instruction has been taken to leave the sequential execution path. Branches not taken imply the sequential continuation of the execution. They do not require any further trace data. Neither do taken direct branches as their fixed continuation target can be inferred from the executed application binary. Only taken indirect branches trigger the generation of an alternate message that enables the observer to reconstruct the dynamically computed branch target address.
Unconditional branches are handled differently by the embedded trace architectures. Some choose to establish posts in the control flow and produce execution bits just as they do for executed conditional direct branches (e. g. Arm® CoreSight™ Program Flow Trace™). Others leave out this clearly implied control flow from the emitted trace altogether (e. g. Intel® Processor Trace).
Exception messages are generated for externally induced control flow diversions such as by interrupts. They typically provide a hint on the nature of the exception and contain all information necessary to resume the control flow reconstruction.
This highly efficient compression results in an average of significantly less than one bit of trace data per executed instruction.
Timing Messages
Depending on the embedded trace architecture, different message types conveying timing information may be available. In addition to wall-clock messages, cycle-count messages are of special importance. They indicate how many CPU clock cycles have elapsed since the last timing update. Observers can typically choose to receive cycle-count messages with each branch message, in programmable time intervals or not at all. This way, the significant trace bandwidth consumption by high-frequency cycle-count messages can be balanced against other desired trace quality properties.
Data-Flow Messages
Data trace is difficult to compress. Depending on the trace architecture, the address of a data access, the transmitted value and the type of access can be communicated. In a 32-bit system, each access can result in a trace message of more than 60 bits in length. Due to its high bandwidth requirements, data trace is not available in all architectures and otherwise limited regularly. E.g., it may be constrained to designated address regions or to producing partial information such as the addresses of write accesses only.
Other Messages
There are a number of other trace message types. They establish the trace context as by allowing the OS to communicate context switches and convey trace-specific information as for signaling internal trace buffer overflows.
Trace Data Processing
Offline Capture
There are several options for processing trace data.

Option 1 shows a solution, which captures the trace data stream in system memory. This solution is often used in the desktop environment. However, it has a significant impact on system behavior and allows only short-term observations limited by memory capacity.
To prevent behavioral feedback to the system under observation, trace data can also be captured by an external trace tool via a designated trace interface. Traditionally, this trace tool is essentially a large memory buffer that collects the trace data for their later offline processing on a PC (Option 2).

The approach of buffering the trace data (Option 1 & 2) has two decisive disadvantages:
- The observation time is always strictly limited by the buffer size. Depending on the trace bandwidth, this may usually allow trace snapshots of a few milliseconds or, at most, seconds. Limited observation times lead to the fact that analyses of long-running integration and system tests, statistically significant measurements of worst-case execution times or searches for complex non-deterministic errors cannot be scaled to longer time frames and are thus only possible to a limited extent.
- The later offline processing of the recorded trace data leads to a long latency between the observation and the availability of results. On the one hand, this is an inconvenience for the engineers involved. On the other hand, this precludes any innovative exploitation of the gained system observability that would require a prompt reaction.
Online Analysis
Our latest innovations now enable the live processing of processor trace data. A large buffer memory decoupling the downstream trace processing is no longer necessary.

For their processing on the fly, the highly compressed trace data stream must be decompressed and the control flow executed by the CPU(s) must be reconstructed. This demanding computation must often cope with the execution traces from multiple fast CPUs that are running at nominal clock speeds above 1 GHz.
This decoding may be further challenged by additional abstractions and indirections introduced by the used operating system. The reconstructed control flow must be analyzed into an apt event stream abstraction that is suitable to drive the desired of various possible backend tasks.
For example, (a) branch information for a coverage analysis may be recorded or (b) dynamic properties over the event stream may be computed and validated against a temporal logic specification.

A comparison between the embedded trace monitoring techniques is given below.

It is obvious that the fundamental ability to observe a system for an arbitrarily long timespan, combined with the extremely short latency of result availability, leads to a new quality of non-intrusive observability of embedded processors.
Physical Trace Interfaces
Trace data can be output either via standard interfaces (e. g. PCIe) or via dedicated interfaces. Several industry standards for dedicated trace interfaces exist.

Could you use expert guidance to select the ideal interface for your hardware design? Our specialists at Accemic Technologies are ready to assist—please don’t hesitate to reach out.
Summary
Embedded Trace, an integral part of almost all modern processors, is the key component for the non-intrusive and continuous monitoring of processors, especially in safety-critical embedded systems. Embedded Trace enables instruction-accurate control flow reconstruction and non-intrusive monitoring of operating system activity. Optionally, exact timing information in nanoseconds resolution can be extracted and data accesses can be observed. Therefore, Embedded Trace is crucial for testing, performance optimization and efficient debugging of embedded systems. For a responsible project planning, it is essential to ensure the accessibility of the trace interfaces already during the creation of the requirements specification.


© 2025 Accemic GmbH. All rights reserved.