Recording Program's Execution

Learn about tracing for a single and multiple programs in distributed systems.

Tracing refers to the particular use of logging to record information about a program’s execution, used for troubleshooting or diagnostic purposes.

Achieving tracing

We can achieve tracing in its simplest form by associating every request to the program with a unique identifier and recording logs for the most important operations of the program alongside the request identifier.

In this way, when we try to diagnose a specific customer issue, the logs can easily be filtered down to only include a chronologically ordered list of operation logs of the associated request identifier. These logs can provide a summary of the various operations the program executed and the steps it went through.

Collating traces from multiple programs

A distributed system serves every client request through multiple, different applications. As a result, one needs to collate traces from multiple programs to fully understand how a request was served and where something might have gone wrong.

Problem

Collating traces from multiple programs is not that simple because every application might be using its own request identifiers, and the applications are most likely processing multiple requests concurrently. This makes it harder to determine which requests correspond to a specific client request.

Solution

We can solve the above problem through the use of correlation identifiers.

Correlation identifier

A correlation identifier is a unique identifier that corresponds to a top-level client request. This identifier might be automatically generated, or some external system or manual process might provide it. It is then propagated through the various applications that are involved in serving this request. These applications can then include that correlation identifier in their logs along with their request identifiers. In this way, it is easier to identify all the operations across all applications corresponding to a specific client request by filtering their logs based on this correlation identifier.

Note: By incorporating timing data in this logging, one can use this technique to also retrieve performance data, such as the time spent on every operation.

The following illustration shows distributed tracing via correlation IDs and an example of a trace that shows latency contribution of every application:

In the above illustration, the colorful horizontal lines are the latencies of each request as indicated by their labels, e.g., the green one corresponds to the request of service A that took 870 milliseconds.

There are several libraries and tools for implementing distributed tracing, such as OpenTracing or Zipkin.

Get hands-on with 1400+ tech skills courses.