JEP 279: Improve Test-Failure Troubleshooting

Owner	Igor Ignatyev
Type	Feature
Scope	Implementation
Status	Closed / Delivered
Release	9
Discussion	hotspot dash dev at openjdk dot java dot net, core dash libs dash dev at openjdk dot java dot net
Effort	XS
Duration	XS
Relates to	JEP 228: Add More Diagnostic Commands
	JEP 102: Process API Updates
Reviewed by	Aleksandre Iline, Brian Goetz
Endorsed by	Mikael Vidstedt
Created	2015/03/20 17:09
Updated	2024/05/20 14:54
Issue	8075621

Summary

Automatically collect diagnostic information which can be used for further troubleshooting in case of test failures and timeouts.

Goals

Gather the following information to help diagnose test failures and timeouts:

For Java processes which are still running on a host after test failure or timeout:
- C and Java stacks
- Core dumps (minidumps on Windows)
- Heap statistics
Environment information:
- Running processes
- CPU and I/O loads
- Open files and sockets
- Free disk space and memory
- Most recent system messages and events

We will develop a library that provides this functionality and co-locate the library sources with the product code.

Motivation

It is difficult to troubleshoot intermittent test failures when there is no information about the testing environment. Such test failures often depend on test execution order and concurrence, which makes it extremely difficult to reproduce them.

Description

Currently, there are two extension points in the jtreg test harness. The first one is the timeout handler, which jtreg runs when a test times out. The second one is the observer, which implements the observer design pattern to track different events in a test run. We will use these extension points to gather diagnostic information and develop a custom observer and timeout handler for jtreg.

Information about environment and non-Java processes will be collected by running platform-specific commands. Gathering information about Java processes will be done via available diagnostic commands which are heavily extended by JEP 228, e.g., the print_vm_state command which collects information similar to hs_err files. The information gathered will be stored for later inspection together with test results. The observer will collect the information on finishedTest events when tests fail.

Since tests may create other processes, information about test processes and their child processes will be collected. To find such processes, the library will create a process tree with the original test process at the root.

Library sources will be placed in the test directory in the top-level repository, and makefiles will be updated to build them and bundle them as a part of test bundles.

Testing

We will schedule regular testing which uses this library. When the results and test execution become stable, we will extend the use of the library to other components.

Risks and Assumptions

Risk that execution of some commands can hang: To minimize this risk a command will be executed only for an allotted time and interrupted after that.
Running out of disk space on a host: The plan is to archive information, restrict the amount of saved information, and check free disk space before information collection.
Tools unavailable on a platform or host: If a tool is not available on a particular host or platform, the commands which depend on the missing tools will be skipped and a warning message will be added to the log file. Another possible solution is to download required tools from a known tools repository.
System resource exhaustion: Some failures can cause exhaustion of different types of system resources (CPU, memory, disk-space, etc.) or be caused by a lock of resources. Since it won't be possible to run commands to gather information in these situations, command execution will be skipped to prevent further system degradation.
Getting process trees in Java: Getting the process tree in Java requires the new process API described in JEP 102. Using the JDK under test as the stable JDK (i.e., the JDK which runs the jtreg test harness) may interfere with test results. To mitigate this, we will develop an alternative process-tree implementation. That implementation will simplify backporting this project into JDK 8.