JEP draft: Support Markdown in Documentation Comments

Authorjjg
OwnerJonathan Gibbons
TypeFeature
ScopeJDK
StatusClosed / Withdrawn
Componenttools / javadoc(tool)
Discussionjavadoc dash dev at openjdk dot org
EffortS
DurationS
Created2023/01/10 18:24
Updated2023/02/20 23:55
Issue8299906

Summary

Support the use of Markdown syntax in documentation comments.

Goals

Non-Goals

Motivation

Markdown is a popular documentation format that allows one to write using an easy-to-read, easy-to-write plain text format, which can easily be transformed to HTML. Documentation comments are typically not complicated structured documents, and for the constructs that typically appear in documentation comments, such as paragraphs, styled text, and lists, Markdown provides simpler forms than the corresponding forms in HTML. For those constructs that are not directly supported in Markdown, Markdown also allows the use of HTML as well. This makes it easier to read and write documentation comments in source code, while retaining the ability to generate the same sort of generated API documentation as before.

As an example of the use of Markdown in a documentation comment, consider the comment for java.lang.Object.hashCode. Using Markdown, this could be rewritten as follows:

/**md
 * Returns a hash code value for the object. This method is
 * supported for the benefit of hash tables such as those provided by
 * {@link java.util.HashMap}.
 *
 * The general contract of `hashCode` is:
 *
 * -   Whenever it is invoked on the same object more than once during
 *     an execution of a Java application, the `hashCode` method
 *     must consistently return the same integer, provided no information
 *     used in `equals` comparisons on the object is modified.
 *     This integer need not remain consistent from one execution of an
 *     application to another execution of the same application.
 * -   If two objects are equal according to the {@link
 *     #equals(Object) equals} method, then calling the
 *     `hashCode` method on each of the two objects must produce the
 *     same integer result.
 * -   It is _not_ required that if two objects are unequal
 *     according to the {@link #equals(Object) equals} method, then
 *     calling the `hashCode` method on each of the two objects
 *     must produce distinct integer results.  However, the programmer
 *     should be aware that producing distinct integer results for
 *     unequal objects may improve the performance of hash tables.
 *
 * @implSpec
 * As far as is reasonably practical, the `hashCode` method defined
 * by class `Object` returns distinct integers for distinct objects.
 *
 * @return  a hash code value for this object.
 * @see     java.lang.Object#equals(java.lang.Object)
 * @see     java.lang.System#identityHashCode
 */

(For the purpose of this example, cosmetic changes like reflowing the text are deliberately avoided, to aid any before and after comparison.)

In that example, note the following:

Description

Documentation comments are used to provide the specification of Java declarations in source code. They are represented by block comments, beginning with /**, and contain a description, followed by a series of block tags, which provide additional details. The description may contain rich text, such as plain text, HTML, entities, and inline tags. Some block tags and inline tags may also contain descriptive rich text.

Documentation comments can be processed by the standard doclet provided as part of the JDK javadoc tool, to generate HTML pages containing the documentation for an API. The comments may also be processed by other tools when browsing code, such as IDEs and the JDK jshell tool.

While it is possible to write HTML in Markdown, Markdown is not a superset of HTML, and so it is necessary to distinguish comments using Markdown from those that do not. Also, since the overall API for a library may be derived from multiple sources, not all of which may use the same format for their documentation comments, it is necessary to be able to support both formats while generating the documentation for any API, and even for any one class, especially when documentation may be inherited from a supertype.

To enable the use of Markdown in documentation comments, use the characters md immediately after the /** at the beginning of the comment, followed by a whitespace character, such as space or a newline. When so enabled, Markdown can be used anywhere that HTML can be used in the comment, such as in the main description, in block tags like @param and @return, or in inline tags like {@link}.

Markdown content in a documentation comment must be written according to the specification for CommonMark.

Restrictions

There are some minor restrictions on the use of Markdown in documentation comments, arising from some syntactic conflicts when processing the comment.

Escape sequences are described in more detail in the section Escape Sequences in the latest Documentation Comment Specification.

Errors

While Markdown is a convenient way to write text that is easy to read, it only works well for text that is valid; it is less good when it comes to text that contains errors of one sort or another. Furthermore, it will not be possible to utilize the existing support in the doclet for recognizing bad HTML without building a Markdown parser into the doclet (or doclint) to understand the surrounding context, such as code spans and code blocks.

Markdown is also notorious for not reporting errors in its input, and may either pass bad input through to the output, to be caught by downstream validators, or may transform the bad input in a way that can only be caught by manually reading and checking the generated output.

For example, the following invalid HTML, with a misspelled tag, will be propagated to the output, without any warning or error message being produced, although the issue could be caught by a downstream validator, if one were run on the generated output:

In contrast, the following invalid HTML will be escaped into valid HTML, and will appear directly in the output when viewed in a browser:

Since the output will be valid HTML, it will be not be detected by any downstream validator, even though it is presumably not what the author originally intended to appear.

Examples of errors are not limited to bad HTML. Generally, CommonMark assumes that if text is not recognized as a valid Markdown construct, it is assumed to be literal text. Here is an example of a bad link, with a typo in the name of the reference:

Here is an [example][bad-link] of a bad link.

[badlink]: http://www.example.com

This is rendered as the following, because of the error in the reference:

Here is an [example][bad-link] of a bad link.

As is the case with many other errors, the output is valid HTML and can only be caught by proofreading the output, and not by any automated validator.

For reference, here is the intended output, seen when the typo is fixed:

Here is an <a href="http://www.example.com">example</a> of a bad link.

While these examples may seem trivial, regrettably there is a long history of errors like these leaking into API documentation, because authors often do not proofread the generated output for their documentation. Indeed, this was a motivation for introducing the DocLint feature in JEP 172, and having to disable it in Markdown comments will be a retrograde step.

Implementation

The design so far has been presented in terms of CommonMark and its specification, although reference has been made to using a third-party implementation of that specification.

The chosen implementation is commonmark-java, which is self-described as a Java library for parsing and rendering Markdown text according to the CommonMark specification (and some extensions).

Using the library can be as simple as the following:

public String parseAndRender(String input) {
    Parser parser = Parser.builder().build();
    Node document = parser.parse(input);
    HtmlRenderer renderer = HtmlRenderer.builder().build();
    return renderer.render(document);
}

Paragraphs

CommonMark defines that a simple run of characters will be treated as a paragraph, implying that the run of characters will be enclosed with <p> and </p> when rendered in HTML.

However, there are places in a documentation comment where there are effectively restrictions on the Markdown that can be used, because the corresponding HTML is restricted to be phrasing content. A prime example of this is a tag like

{@link element description}

which informally maps to

<a href="link/to/element">rendered-description</a>

In situations like this, care must be taken to ensure there are no inappropriate paragraph tags wrapping the original description.

Headings

Some Markdown implementations, including commonmark-java, have a feature (extension) to automatically generate ids for user-defined headings.

Starting in JDK 20, javadoc provides a similar ability (JDK-8289332) but necessarily using a different algorithm to generate the ids, to guard against the possibility of conflicts with other ids generated in other parts of the page.

If we are to maintain parity with the feature as defined for non-Markdown comments, we may want to define an extension that can be used to generate headings similar to those in non-Markdown comments.

Tables

CommonMark does not provide support for tables, and yet tables are one of the more onerous constructs to write in raw HTML. They are also reasonably common, with over 300 appearing in the documentation comments for the modules in OpenJDK. However, commonmark-java does provide an extension for tables, as used in Github-flavored Markdown, and while we should generally resist the use of arbitrary extensions, this may be one to consider.

Opinions differ whether it is better to use HTML or a Markdown extension for a table. While Markdown tables may be easier to read, they can be a lot more tedious to write, without appropriate support in authoring tools. Conversely, HTML tables may be easier to write, although there can be a lot of visual clutter mixed in with the information being presented in the table.

Note that the HTML output generated from Markdown tables typically does not follow Section 508 Accessibility Guidelines or other web accessibility guidelines. This may be an issue for generating documentation that is to be used in some environments.

Compiler Tree API

The documentation comments for any module, package, class or interface can be accessed using the Compiler Tree API, in the com.sun.source.* packages, in the jdk.compiler module.

Markdown content is represented by a new subtype of DocTree, called com.sun.source.doctree.RawTextTree, using a new kind, DocTree.Kind.MARKDOWN.

No other support for analyzing Markdown in more detail is provided.

Alternatives

Why Markdown?

An alternative to Markdown is AsciiDoctor. Each claims to have advantages over the other. (1, 2). One is more popular, the other has less need to use HTML. Both resort to providing extensions to support otherwise-missing features.

The use of an initial marker string /**md could be extended to support other languages, such as AsciiDoctor, by using additional, different marker strings. Entertaining the notion of a possible future enhancement is a possible reason for keeping the integration with the underlying language shallow instead of deep.

Which Version of Markdown?

There are many "variants", or "flavors" of Markdown.

Initially released in 2004, the original version of Markdown did not specify the syntax unambiguously, and the implementation was "quite buggy". This led to divergent implementations, as different authors attempted to fix the issues, and to address some shortcomings of the specification, which in turn led to an effort at commonmark.org to develop a formal specification for Markdown, known as CommonMark. That specification is the basis for the Markdown constructs that can be used in documentation comments.

If we did not specify the use of CommonMark, and allowed the use of arbitrary other variants or flavors, that would lead to potential inconsistencies between different tools that might process documentation comments, or for documentation comments written by different authors, who might be using different variants or have different extensions enabled. API documentation is often derived from comments in different source files, or even different libraries, and so it would be difficult to guarantee consistent behavior for all authors.

Enabling Markdown

Markdown comments are indicated by beginning the comment with /**md so that it is possible to differentiate between Markdown comments and non-Markdown comments.

Pragma

Java has long eschewed the notion of specialized comments or other character sequences to define the syntactic features that may appear in the rest of the file. It would be a big change to head in that direction for identifying whether documentation comments are in Markdown format or not.

Annotations

The initial marker allows an author to opt in to using Markdown on each comment individually. That is certainly convenient when updating existing API documentation, but might be seen an onerous if all the documentation comments in a file use Markdown syntax. One way to mark the content of a source file as following a non-default convention could be to use annotations. However, there are a couple of reasons why this is not a good idea.

  1. Annotations require at least some amount of semantic analysis of a source file, to determine their value and whether they are applicable as intended. Having to do semantic analysis to determine the kind of syntax used in a documentation comment is highly undesirable and would complicate what might otherwise be simple tools for analysing documentation comments.

    One aspect of this is that the existing Compiler Tree API allows access to the parsed form of a comment

  2. While common and effectively standard, documentation comments are not a feature of the Java programming language, and only superficially part of Java SE, in the Language Model API. Likewise, javadoc is a JDK tool, not a Java SE tool, and it would be strange to see annotations pertaining to the use of a JDK feature in Java SE APIs. And more specifically, the annotation interface for any such annotation appearing in the APIs of the java.base module would itself have to be declared in the java.base module, because that module by definition has no dependencies on other module.

For these reasons, annotations are not supported as a way to indicate the format of documentation comments in a file.

Command-Line and/or Configuration Options

If all the source files in a project adopt the use of Markdown syntax in its documentation files, it could be possible to provide a command-line or other form of configuration option to specify the use of Markdown as a default, even if the initial marker string md is not present. Such a command-line option could be seen as similar to the --source option used by javac, javadoc and other tools to specify the version of the Java programming language used in the source files being read.

However, if the initial marker is not used and an author relies on a command-line or configuration option to specify the format, it will never be possible to a mixture of source files, some using Markdown in documentation comments, some not, without having more complicated options to specify which files use which format.

In addition, any such mechanism would have to be supported in some form by all tools that want to read and analyze documentation comments. Here is a list of some of such tools:

For these reasons, support for command-line or other configuration options to indicate which files use Markdown format in documentation comments is not provided at this time, but may be reconsidered in the future.

Different Comment Syntax

Instead of adding a marker string at the beginning of the comment, an alternative would be to use a different form of comment, such as a stylized form of a series of line (//) comments. For example, we could extend the definition of a documentation comment to include a series of adjacent line comments beginning with ///, or maybe a series of adjacent line comments of which at least the first begins with ///. Note that unlike block comments, the presence of a marker at the beginning of each line (after any optional whitespace) cannot be avoided.

Such a change would require an update to the Java SE API used to access the content of a documentation comment.

It is left as an exercise for the reader to determine whether a series of stylized line comments is more or less visually intrusive than a block comment with a short initial marker string. However, using line comments would obviate the need for some of the escape sequences that may be required when using block comments.

When to Process Markdown?

Just as existing documentation comments can contain a mixture of HTML and javadoc tags, so too will we want to embed javadoc tags in a Markdown documentation comment. Since the goal is to leverage a third-party implementation of Markdown, and not develop our own implementation, that raises the question of when and how we should use such an implementation.

As attractive and simple as it may seem, it is not enough to simply preprocess the entire comment through a Markdown processor. For example, consider the following:

Multiplication is commutative, which means that
{@code a*b} is the same as {@code b*a}.

After preprocessing (in this case, with the original Markdown.pl implementation), that will become:

<p>Multiplication is commutative, which means that
{@code a<em>b} is the same as {@code b</em>a}.</p>

Note that the asterisks got translated into <em> and </em> respectively, because there was no recognition of the {@code ...} tags.

You can try out this and other examples for yourself using the "dingus" at https://spec.commonmark.org/dingus.

The same reasoning and a similar example indicates that we should not post-process the output generated from the comment either, since we will have lost any contextual information available in the original comment, and again, might find inadvertent matches for Markdown syntactic forms. For example, the doclet might initially transform the above comment to the following:

Multiplication is commutative, which means that
<code>a*b</code> is the same as <code>b*a</code>.

Post-processing that output to handle the Markdown constructs yields the following invalid HTML, with partially overlapping HTML elements:

<code>a<em>b</code> is the same as <code>b</em>a</code>

Thus, it does not work to either process a comment as Markdown, or to post-process the result of converting the comment to HTML, and a more sophisticated strategy is required.

Processing Markdown

The general assumption here is the use of a fixed lightweight processor to handle Markdown content.

Because there are different "flavors" of Markdown available for authors to use for plain Markdown documents, it might be seen as a possibility to allow an author to specify the Markdown processor to be used to transform Markdown content to HTML. However, such an ability would perpetuate the differences between different versions of Markdown. Moreover, at least one popular Markdown processor (pandoc) is not written in Java, and so would have to be invoked in a separate process. Given that this would likely have to be done on each individual documentation comment, the overhead to exec an external tool for each comment would be prohibitively expensive, compared to the ability to direct invoke a Markdown processor written in Java and running in the same virtual machine as javadoc or related tools.

One workaround for that issue could be to batch Markdown comments together, to reduce the number of times the external tool needs to be invoked. There are various reasons why that would be undesirable:

For all those reasons, we focus on using a well-defined, well-supported Java library to transform Markdown to HTML.

Escape Sequences

Escape sequences can be used in a situations that would otherwise cause the unescaped character to be misinterpreted.

Testing

Testing will focus on the ability to read and recognize Markdown comments, and to invoke the third-party library to transform Markdown comments to HTML.

Testing will also focus on the use of Markdown comments in a hybrid world using both regular documentation comments and Markdown documentation comments. This will include testing situations where parts of a comment are inherited from a supertype, such that the inherited comment is in a different format to that of the inheriting comment.

Because of the use of a third-party library to transform Markdown to HTML, there will not be detailed testing of individual Markdown constructs. It is assumed the library performs as specified.

There are no special environmental issues that need to be tested.

Risks and Assumptions

The primary assumption is that it is unreasonable and impractical to build direct support for Markdown into either the standard doclet, or the Compiler Tree API on which it depends. Thus, the primary risk is the use of a third-party library to transform Markdown to HTML, in the manner described above. If external support for the library is dropped, we will have to maintain a fork of the library for use in our tools, unless we can find an equivalent alternative.

An additional risk is that of an increased presence of errors in generated API specifications, because of the reduced ability to check for bad code, and because authors sometimes omit to check the generated form of their documentation. (JBS reports there have been over forty issues in JDK containing the words "bad" or "malformed" "HTML" in their Summary description, with over ten in the main core-libs area. Many involve mismatched tags that render the page effectively unusable. And all of which, presumably, were not detected up front by the authors or their reviewers.)

Dependencies

The primary dependency is on the chosen third-party library. There are no JDK features that are dependent on this work at this time.