Posts

Understanding Embedded Trace

Embedded Trace is an integral part of nearly all modern processors. This post summarizes the essential facts about this powerful but still far too seldom used functional unit that application engineers, test engineers and project managers should know in order to test, optimize and debug a system efficiently.

Something about Embedded Processor Observation

Monitoring embedded processors can be a difficult task. However, it is essential both during the development process and during deployment:

  • Functional Testing: e.g., measure and validate timing behavior;
  • Structural Testing: e.g., measure code coverage or data and control flow coupling;
  • Debugging: pin down the root cause of anomalies.

Software Instrumentation

One obvious method of observation, which has been in use for a long time, is software instrumentation. The application software is modified to produce the desired information, such as control flow indicators, via standard output channels. Unfortunately, this approach can have a massive impact on the timing behavior and the memory footprint of an application. In the field of embedded systems, this method has serious disadvantages, including:

  • Due to the changed timing behavior, the significance of functional tests is very limited.
  • Unintended phantom synchronization may be introduced to concurrent programs by the access arbitration to shared output channels.
  • Safety-critical software already deployed in the field cannot simply be modified to produce debug information as this would risk functional degradation or the triggering of Heisenbugs.
  • The common mitigation of observability constraints by first performing transparent coverage validation tests with software instrumentation before repeating them without software instrumentation leads to a doubled test effort. This is a major disadvantage, especially, when limited and expensive test resources, such as HIL test benches, are involved. Even though the software instrumentation approach is archaic, it is still widely used. The Embedded Trace presented in the next chapters often also offers functions for hardware-supported software instrumentation. These will be presented in detail in one of our following whitepapers.

The Motivation Behind Embedded Trace

In the olden days, processors were slow (some 10 MHz) and only had external program memory. Eavesdropping on the memory bus was sufficient to observe the memory addresses of the instructions fetched by the CPU for execution. Following the program flow was, thus, rather trivial.

The olden days – all CPU activity is visible on the external bus.

CPUs have been becoming faster, and their processor architecture has been changing:

  • Instructions and data are cached near the CPU. Individual accesses are, therefore, no longer visible on the external bus system.
  • Particularly embedded devices integrate more and more memory (RAM or Flash) internally so as to achieve faster access times and lower the system cost. Accesses to such memory are not at all observable on the external bus system.
  • To meet the demand for more and more integrated computing power and to reduce energy consumption, several CPUs are integrated into one processor. Whatever external effects remain observable now also pose the challenge of attributing them to the correct concurrent control flow among the ones scheduled across truly parallel cores. All these advances in processor architecture require a radically new approach to observing what is happening inside the processor: embedded trace.
State-of-the-Art: Embedded Trace

Embedded trace is the integration of functional units that make the activity of the CPUs observable. However, there is a significant bandwidth problem: Monitoring a single CPU comprehensively, at least, requires information about the executed instructions and the changing CPU registers. For a 32-bit CPU, a naive encoding would amount to a bandwidth of roughly:

[instruction address] + [changing CPU registers and flags] = 32+ ~48 = ~80 bits / CPU clock cycle

If the CPU is now clocked at 1 GHz, this will imply 80 Gbit/s of trace data. This is far beyond an economically reasonable solution. So, the trace data stream must be compressed.

Embedded Trace Protocol

Actual trace implementations such as Arm® CoreSight™ Program Flow Trace™, Arm® CoreSight™ Embedded Trace Macrocell ETMv4.x, Intel® Processor Trace, and Nexus 5001 Forum™ must compress the transmitted control flow trace rigorously. They inject relevant OS-related information, such as context switches, into their output to facilitate a highly efficient context-aware compression. The system observability can be boosted by also tracing the data flow. However, the implied trace data is much harder to compress. This results in significantly higher bandwidth requirements and, hence, more costly trace interfaces. Therefore, most implementations actually refrain from implementing this capability.

Most relevant trace message types

In the following, the most relevant trace message types are briefly introduced.

Looking for support on how embedded trace could help to solve your challenges? Our experts at Accemic Technologies are here to help. Feel free to contact us.

Control-Flow Messages

As already mentioned, the continuous output of the program counter alone would consume significant trace bandwidth. This is overcome (a) by assuming that the executed application is known to the observer and (b) by exploiting the default sequential execution of instructions therein.

This allows to use the following strategy for trace data compression:

Synchronization messages are generated in greater intervals, e.g., every 1000 messages. Only they establish a concrete value of the program counter that identifies the reference point for the further trace data interpretation. The execution of the sequential code following this point is implied.

Branch messages communicate actual control flow decisions. A single bit indicates whether or not a conditional branch instruction has been taken to leave the sequential execution path. Branches not taken imply the sequential continuation of the execution. They do not require any further trace data. Neither do taken direct branches as their fixed continuation target can be inferred from the executed application binary. Only taken indirect branches trigger the generation of an alternate message that enables the observer to reconstruct the dynamically computed branch target address.

Unconditional branches are handled differently by the embedded trace architectures. Some choose to establish posts in the control flow and produce execution bits just as they do for executed conditional direct branches (e. g. Arm® CoreSight™ Program Flow Trace™). Others leave out this clearly implied control flow from the emitted trace altogether (e. g. Intel® Processor Trace).

Exception messages are generated for externally induced control flow diversions such as by interrupts. They typically provide a hint on the nature of the exception and contain all information necessary to resume the control flow reconstruction.

This highly efficient compression results in an average of significantly less than one bit of trace data per executed instruction.

Timing Messages

Depending on the embedded trace architecture, different message types conveying timing information may be available. In addition to wall-clock messages, cycle-count messages are of special importance. They indicate how many CPU clock cycles have elapsed since the last timing update. Observers can typically choose to receive cycle-count messages with each branch message, in programmable time intervals or not at all. This way, the significant trace bandwidth consumption by high-frequency cycle-count messages can be balanced against other desired trace quality properties.

Data-Flow Messages

Data trace is difficult to compress. Depending on the trace architecture, the address of a data access, the transmitted value and the type of access can be communicated. In a 32-bit system, each access can result in a trace message of more than 60 bits in length. Due to its high bandwidth requirements, data trace is not available in all architectures and otherwise limited regularly. E.g., it may be constrained to designated address regions or to producing partial information such as the addresses of write accesses only.

Other Messages

There are a number of other trace message types. They establish the trace context as by allowing the OS to communicate context switches and convey trace-specific information as for signaling internal trace buffer overflows.

Trace Data Processing

Offline Capture

There are several options for processing trace data.

Option 1: Buffering trace data within the processor system

Option 1 shows a solution, which captures the trace data stream in system memory. This solution is often used in the desktop environment. However, it has a significant impact on system behavior and allows only short-term observations limited by memory capacity.
To prevent behavioral feedback to the system under observation, trace data can also be captured by an external trace tool via a designated trace interface. Traditionally, this trace tool is essentially a large memory buffer that collects the trace data for their later offline processing on a PC (Option 2).

Option 2: Buffering trace data in trace tool, offline processing in PC

The approach of buffering the trace data (Option 1 & 2) has two decisive disadvantages:

  1. The observation time is always strictly limited by the buffer size. Depending on the trace bandwidth, this may usually allow trace snapshots of a few milliseconds or, at most, seconds. Limited observation times lead to the fact that analyses of long-running integration and system tests, statistically significant measurements of worst-case execution times or searches for complex non-deterministic errors cannot be scaled to longer time frames and are thus only possible to a limited extent.
  2. The later offline processing of the recorded trace data leads to a long latency between the observation and the availability of results. On the one hand, this is an inconvenience for the engineers involved. On the other hand, this precludes any innovative exploitation of the gained system observability that would require a prompt reaction.

Online Analysis

Our latest innovations now enable the live processing of processor trace data. A large buffer memory decoupling the downstream trace processing is no longer necessary.

Option 3: Trace data online processing by trace tool

For their processing on the fly, the highly compressed trace data stream must be decompressed and the control flow executed by the CPU(s) must be reconstructed. This demanding computation must often cope with the execution traces from multiple fast CPUs that are running at nominal clock speeds above 1 GHz.
This decoding may be further challenged by additional abstractions and indirections introduced by the used operating system. The reconstructed control flow must be analyzed into an apt event stream abstraction that is suitable to drive the desired of various possible backend tasks.

For example, (a) branch information for a coverage analysis may be recorded or (b) dynamic properties over the event stream may be computed and validated against a temporal logic specification.

CEDARtools® trace data online processing tool with high-speed serial trace connection to an Infineon AURIX™ processor

A comparison between the embedded trace monitoring techniques is given below.

Comparison of embedded trace techniques (green: good).

It is obvious that the fundamental ability to observe a system for an arbitrarily long timespan, combined with the extremely short latency of result availability, leads to a new quality of non-intrusive observability of embedded processors.

Physical Trace Interfaces

Trace data can be output either via standard interfaces (e. g. PCIe) or via dedicated interfaces. Several industry standards for dedicated trace interfaces exist.

Comparison of parallel and high-speed serial trace interfaces. Short conclusion: Prefer the high-speed serial interface whenever possible.

Could you use expert guidance to select the ideal interface for your hardware design? Our specialists at Accemic Technologies are ready to assist—please don’t hesitate to reach out.

Summary

Embedded Trace, an integral part of almost all modern processors, is the key component for the non-intrusive and continuous monitoring of processors, especially in safety-critical embedded systems. Embedded Trace enables instruction-accurate control flow reconstruction and non-intrusive monitoring of operating system activity. Optionally, exact timing information in nanoseconds resolution can be extracted and data accesses can be observed. Therefore, Embedded Trace is crucial for testing, performance optimization and efficient debugging of embedded systems. For a responsible project planning, it is essential to ensure the accessibility of the trace interfaces already during the creation of the requirements specification.

Embedded Trace – The Hidden Gem Inside Your Processor

You’ve probably heard of CoreSight™ Embedded Trace Macrocell (ETM), Intel® PT, or MIPI® Nexus—they’re built-in hardware features that quietly record what your processor is up to. Unlike software logs or added profiling code, embedded trace – or sometimes called hardware trace or processor trace – is generated in silicon. You don’t slow down your app, and you don’t need to sprinkle printf statements everywhere.

Still, many teams leave it unused—an unfortunate oversight, considering how much it could simplify debugging and embedded development. Let’s see what embedded trace is, why it’s often ignored, and all the cool stuff you can do with it.


So, What Exactly Is Embedded Trace?

Think of your CPU whispering its every move:

  • Program Counter: the address of each instruction as it runs.
  • Branches Taken: every jump, call, or return your code makes.
  • Memory Reads/Writes: data loads and stores.
  • Cycle Counts: how many clock cycles tick between instructions.

Tiny on-chip circuits capture all this. Gigahertz-speed cores churn out massive data, so the embedded trace IP compresses and packages it, then outputs it over a dedicated trace port (not to confused with JTAG/DAP) or holds it in on-chip buffers at ~500 Mbps or more.


Where You’ll Find Trace Support

  • Intel® NUC & Laptops: Intel® PT is baked into modern Core and Atom CPUs.
  • Raspberry Pi 4: BCM2711 includes CoreSight™ ETM/PTM.
  • …and most Cortex®‑M/R/A, TriCore™, PowerPC®-based boards you see in industry.

Want to know the individual trace capabilities of your processor? Our experts at Accemic Technologies are here to help. Feel free to contact us.


Why You’ll Love Using It

  1. Profiling: Find the functions or tasks eating up cycles.
  2. Code Coverage: Show auditors you hit every branch without adding any extra code.
  3. Replay Debugging: Record once, and replay spurious bugs after.
  4. Real-Time Guarantees: Get cycle-accurate logs to estimate or prove your functions worst-case execution times.

Why Do People Skip It?

  • Mystery Feature: A lot of developers don’t even realize their board has trace built in.
  • Feels Overkill: Setting up trace drivers or fiddling with low-level registers seems daunting. Anyhow, user-friendly tooling exists.
  • Data Overload: The trace churns out tons of information, and on-chip decoders or FIFOs can’t always keep up—so you often only grab a few seconds before buffers fill up.

For short execution times, on-chip processing is easily accessible. To get easily hands on trace, you can use the tracing features of your own x86 laptop and the built-in Linux support.

1. Create a file called “call_trace.c”

Click here to see “call_trace.c”
#include <stdio.h>

static void func3(void) {
    // deepest leaf
    for (volatile int i = 0; i < 1000000; i++);
}

static void func2(void) {
    func3();
}

static void func1(void) {
    func2();
    func3();
}

int main(void) {
    // calls func1, then func2, then func1 again
    func1();
    func2();
    func1();
    return 0;
}

2. Compile the example program below.

gcc -g -O0 -o call_trace call_trace.c

3. Run & record trace.

sudo perf record -e intel_pt//u ./call_trace

4. View trace – for illustrative purposes a simple call-return trace.

sudo perf script --call-ret-trace --dsos=call_trace -i perf.data | grep -Ev 'psb offs:|cbr'
call_trace 12776 [005]  5007.781813585:   tr end  async       (/home/albert/perf-intelpt/call_trace)	_start                          
call_trace 12776 [005]  5007.781814835:   call                (/home/albert/perf-intelpt/call_trace)	    __libc_start_main           
call_trace 12776 [005]  5007.781816918:   call                (/home/albert/perf-intelpt/call_trace)	            _init               
call_trace 12776 [005]  5007.781817127:   return              (/home/albert/perf-intelpt/call_trace)	            _init               
call_trace 12776 [005]  5007.781817127:   call                (/home/albert/perf-intelpt/call_trace)	            frame_dummy         
call_trace 12776 [005]  5007.781817127:   return              (/home/albert/perf-intelpt/call_trace)	            register_tm_clones  
call_trace 12776 [005]  5007.781817127:   return              (/home/albert/perf-intelpt/call_trace)	        __libc_csu_init         
call_trace 12776 [005]  5007.781817335:   call                (/home/albert/perf-intelpt/call_trace)	            func1               
call_trace 12776 [005]  5007.781817335:   call                (/home/albert/perf-intelpt/call_trace)	                func2           
call_trace 12776 [005]  5007.781817335:   call                (/home/albert/perf-intelpt/call_trace)	                    func3       
call_trace 12776 [005]  5007.782483168:   return              (/home/albert/perf-intelpt/call_trace)	                    func3       
call_trace 12776 [005]  5007.782483168:   return              (/home/albert/perf-intelpt/call_trace)	                func2           
call_trace 12776 [005]  5007.782483168:   call                (/home/albert/perf-intelpt/call_trace)	                func3           
call_trace 12776 [005]  5007.782989627:   return              (/home/albert/perf-intelpt/call_trace)	                func3           
call_trace 12776 [005]  5007.782989627:   return              (/home/albert/perf-intelpt/call_trace)	            func1               
call_trace 12776 [005]  5007.782989627:   call                (/home/albert/perf-intelpt/call_trace)	            func2               
call_trace 12776 [005]  5007.782989627:   call                (/home/albert/perf-intelpt/call_trace)	                func3           
call_trace 12776 [005]  5007.783481293:   return              (/home/albert/perf-intelpt/call_trace)	                func3           
call_trace 12776 [005]  5007.783481293:   return              (/home/albert/perf-intelpt/call_trace)	            func2               
call_trace 12776 [005]  5007.783481293:   call                (/home/albert/perf-intelpt/call_trace)	            func1               
call_trace 12776 [005]  5007.783481293:   call                (/home/albert/perf-intelpt/call_trace)	                func2           
call_trace 12776 [005]  5007.783481293:   call                (/home/albert/perf-intelpt/call_trace)	                    func3       
call_trace 12776 [005]  5007.783926085:   return              (/home/albert/perf-intelpt/call_trace)	                    func3       
call_trace 12776 [005]  5007.783926085:   return              (/home/albert/perf-intelpt/call_trace)	                func2           
call_trace 12776 [005]  5007.783926085:   call                (/home/albert/perf-intelpt/call_trace)	                func3           
call_trace 12776 [005]  5007.784296085:   tr end  async       (/home/albert/perf-intelpt/call_trace)	                func3           
call_trace 12776 [005]  5007.784381293:   return              (/home/albert/perf-intelpt/call_trace)	                func3           
call_trace 12776 [005]  5007.784381293:   return              (/home/albert/perf-intelpt/call_trace)	            func1               
call_trace 12776 [005]  5007.784381293:   return              (/home/albert/perf-intelpt/call_trace)	        main

That’s it—perf now speaks silicon-native trace.


No-Compromise: High-End External Analysis

For rugged industrial gear and safety-certified systems—think ISO 26262 automotive, aerospace, or critical embedded applications (DO-178C, IEC 61508)—on-chip trace buffers just can’t keep pace. You get limited capture windows, and live tracing can even interfere with CPU timing. The solution? Stream trace off the chip for full-power analysis:

Unlimited Observation: Send raw trace off-board to record minutes (or hours) of execution.

Live Decoding with CEDARtools: CEDARtools grabs that off-chip feed and decodes it on the fly—even with high-end Cortex‑A72 silicon—so you see execution flow and hotspots in real time.

Infinite Coverage & WCET: Point CEDARtools.Coverage at your trace target for instant coverage results. Switch to CEDARtools.SmartTrace to nail down worst-case execution times you can bank on.

Speed Up Testing: Hook coverage results into Teamscale to slash system-test runs from days to minutes. Only tests touching changed code paths run, so you get fast feedback after every commit.


Ready to Unlock Embedded Trace?

If you want to explore how our CEDARtools-powered solutions can supercharge your system, our Accemic Technologies experts are here to help.

Everything you always wanted to know about Embedded Trace

After having many of our colleagues explain again and again what “embedded trace” actually is and how essential this great feature is for the development, testing and debugging of embedded systems, we have summarised an overview for you in an article that was published on Feb 15 2022 in IEEE. We thank our co-authors from the Virginia Commonwealth University for the great cooperation for this article.