JEP draft: Optionally Record Thread Context in JFR

Author	Ludovic Henry
Owner	Jaroslav Bachorík
Type	Feature
Scope	JDK
Status	Draft
Component	hotspot / jfr
Effort	M
Duration	S
Created	2022/04/06 15:20
Updated	2023/09/25 14:54
Issue	8284453

Summary

When generating an event, the contextual information is the stacktrace, the thread id, and the time. This JEP adds the ability to attach user-defined context to relevant events. This context aims to help enhance user-driven analysis, allowing them to slice and dice better the trove of data they get through JFR. It also allows the user to control the generation of events more finely based on the given context.

Goals

Contextualize events with custom, user-provided data
Filter events based on the context and user-defined filters

Non-Goals

Attach arbitrary data to any event. It is limited to a String key and a String value.

Motivation

It is commonly needed to correlate the information received through JFR to specific actions in the application. For example, when a user's web application receives a request, the developer wants to attribute relevant execution or allocation samples to specific application endpoints. That is only possible if we can construct a context of when a given endpoint is executing on any given thread. Then, given that context, we can rebuild which events happened during the execution of the given endpoint.

In addition to helping analysis, it also helps reduce the amount of data captured by focusing only on the data generated in the context that matters to the customer (when executing a specific endpoint, for example).

Description

Content

The context is conceptually an immutable Map<String, String>. When created, it extends the currently installed context.

Lifecycle

The lifecycle of the context is as follows.

+----------------+
    |    Creation    |
    +-------+--------+
            |
            v
    +-------+--------+    +----------------+
+-->+  Installation  +<-->+  Snapshotting  |
|   +-------+--------+    +----------------+
|           |
|           v
|   +-------+--------+
|   |   Sampling &   |
|   |    Filtering   |
|   +-------+--------+
|           |
|           v
|   +-------+--------+
+-->| Uninstallation |
    +-------+--------+
            |
            v
    +-------+--------+
    |  Destruction   |
    +-------+--------+

Creation

That only happens once per context. For example, it will happen once for each new request to a web service. The context would, for example, contain the web service endpoint and an ID uniquely identifying this request (like for distributed tracing).

Because the context is immutable and is created only once, we want to pay most of the cost of setting up the context eagerly. It, therefore:

creates a final map of the context based on the content of the currently installed context and the user-provided keys and values.
saves this map in the JfrContextRepository and gets a unique ID back.

Installation/Uninstallation

Given the unique ID of the context in the JfrContextRepository, the context is installed and uninstalled by setting/unsetting a thread-local variable in Java and the VM. That is to guarantee a fast and async-safe sampling. The performance is essential as contexts can be installed and uninstalled millions of times per second, like in reactive applications using Netty.

When filters are installed, the installation and uninstallation match the context against the filter to eagerly compute which events to sample. Again, that is to guarantee that sampling can be done cheaply and in an async-safe manner. The filter is built lazily as the user may change the filter after the context has been created. The context/filter match result is cached as its computation isn't guaranteed to be trivial if the number of entries in the context is large, the number of filters is large, or both combined.

Snapshotting

It is necessary to propagate the context between threads or, more generally, execution contexts. It simply consists in capturing a reference to the context to allow installation at a later time.

Sampling & Filtering

When a context is installed, it can be sampled or used to filter events' generation. This sampling and filtering can happen on threads that are suspended at arbitrary points (with Unix Signals or Win32 SuspendThread/ResumeThread, for example). It must be guaranteed to be async-safe. That excludes the possibility to call into Java or arbitrary VM code.

Destruction

After a context isn't used anymore (when the request to the web service has finished, for example), the context can safely be destroyed. It needs to clean up managed and VM resources like the JfrContextRepository or the cache for filtering.

API

Building and using contexts

To create, snapshot, install, uninstall, and destroy a context, the API is the following:

public final class RecordingContext implements AutoCloseable {
    // create
    public static class Builder {
        public Builder where(RecordingContextKey key, String value);
        public RecordingContext build();
    }
    public static Builder where(RecordingContextKey key, String value);

    // snapshot + install/uninstall
    public static class Snapshot {
        public Activation activate();
    }
    public static class Activation implements AutoCloseable {
        public void close();
    }
    public static Snapshot snapshot();
    public static <R> R callWithSnapshot(Callable<R> op, Snapshot s) throws Exception;
    public static void runWithSnapshot(Runnable op, Snapshot s);

    // close
    public void close();
}
public final class RecordingContextKey {
    public static RecordingContextKey forName(String name);
    public boolean isBound();
    public String name();
}

There is no user-accessible API to capture the context as this is done as part of the JFR event generation.

There is also no API to read or modify the values after creation. That is to avoid abuse of the mechanism to propagate arbitrary data not for use in JFR.

Filtering

The API to build and install filters is the following:

public final class RecordingContextFilter {
    public static class Config {
        public static void setContextFilter(RecordingContextFilter filter);
        public static RecordingContextFilter contextFilter();
        public static RecordingContextFilter.Builder createFilter();
    }
    public static class Builder {
        public Builder forAllTypes(Consumer<PerTypeBuilder> callback);
        public Builder forType(EventType type, Consumer<PerTypeBuilder> callback);
        public RecordingContextFilter build();
    }
    public static class PerTypeBuilder {
        public PerTypeBuilder reset();
        public PerTypeBuilder hasContext();
        public PerTypeBuilder hasNoContext()
        public PerTypeBuilder hasKey(RecordingContextKey key);
        public PerTypeBuilder hasEntry(RecordingContextKey key, String value);
    }
}

Propagation

We need the context to "automagically" propagate to all platform/virtual threads, executor and threadpool tasks, and generally all asynchronous tasks. The Class Library takes care of doing that for all internal APIs (ForkJoinPool, CompletableFuture, etc.)

For external libraries, we need an API that allows us to do it "manually." For example, Netty implements a custom threadpool that doesn't use the ForkJoinPool. In such cases, the framework would need to propagate the context to the threadpool task execution manually. Instrumentation libraries like OpenTelemetry could also automatically propagate the context across appropriate boundaries.

Capture

It relies on a similar mechanism to stacktraces. An attribute context is added to applicable events, triggering the capture of the context when committing.

For native events, the capture needs to be async-safe for ExecutionSampleEvent and NativeMethodSample. It then must be done without transitioning to Java. Moreover, transitioning between Java and the VM is prohibitive, especially for latency-sensitive and native events.

Serialization

The name and value strings are stored in the constant pool. That allows reuse across the potentially many events referencing the context. Once stored in the constant pool, events store a reference to the context via a single long key in the JfrContextRepository.

For example, the ObjectAllocationSampleEvent would now look like the following:

jdk.ExecutionSampleEvent {
  startTime = long
  sampledThread = long:threadId
  context = long:contextId
  stackTrace = long:stackTraceId
  state = long:threadStateId
}

Risks and assumptions

Alternatives

We can generate transition events for when a specific context starts and ends on each thread, but this has several downsides. On async-heavy codebases (like Akka or reactive programming in general), a single given context starts and stops executing many times (on one or many threads). That will generate many of these transition events, leading to a bloat of the JFR profile.

Moreover, these transition events are generated even when the JVM or the application doesn't emit any JFR event (execution sample, allocation sample, socket read/write, etc.). In this case, we generate two transition events for no added value. We can't avoid these events as we cannot know in advance whether any other JFR event will be generated. And once emitted, we cannot un-emit them for times when we know no other JFR events have been generated.