JEP draft: Support Markdown in Documentation Comments

Author	jjg
Owner	Jonathan Gibbons
Type	Feature
Scope	JDK
Status	Closed / Withdrawn
Component	tools / javadoc(tool)
Discussion	javadoc dash dev at openjdk dot org
Effort	S
Duration	S
Created	2023/01/10 18:24
Updated	2023/02/20 23:55
Issue	8299906

Summary

Support the use of Markdown syntax in documentation comments.

Goals

Introduce the ability to use Markdown syntax in documentation comments, alongside HTML elements and JavaDoc tags, without adversely affecting the interpretation of any existing documentation comments.
Leverage an existing third-party solution to transform Markdown to HTML, to avoid having to build detailed knowledge of Markdown syntax into the standard doclet.
Facilitate other tools that analyze documentation comments to also be able to handle Markdown content in those comments.

Non-Goals

It is not a goal to define and support yet another variant of Markdown.
It is not a goal to support any possible new features specific to javadoc, such as extending Markdown support for links to refer to other program elements, as is currently possible with the {@link ...} tag and other related tags.
It is not a goal to reinvent or redefine what it means to be a documentation comment beyond the ability to use Markdown syntax.

Motivation

Markdown is a popular documentation format that allows one to write using an easy-to-read, easy-to-write plain text format, which can easily be transformed to HTML. Documentation comments are typically not complicated structured documents, and for the constructs that typically appear in documentation comments, such as paragraphs, styled text, and lists, Markdown provides simpler forms than the corresponding forms in HTML. For those constructs that are not directly supported in Markdown, Markdown also allows the use of HTML as well. This makes it easier to read and write documentation comments in source code, while retaining the ability to generate the same sort of generated API documentation as before.

As an example of the use of Markdown in a documentation comment, consider the comment for java.lang.Object.hashCode. Using Markdown, this could be rewritten as follows:

/**md
 * Returns a hash code value for the object. This method is
 * supported for the benefit of hash tables such as those provided by
 * {@link java.util.HashMap}.
 *
 * The general contract of `hashCode` is:
 *
 * -   Whenever it is invoked on the same object more than once during
 *     an execution of a Java application, the `hashCode` method
 *     must consistently return the same integer, provided no information
 *     used in `equals` comparisons on the object is modified.
 *     This integer need not remain consistent from one execution of an
 *     application to another execution of the same application.
 * -   If two objects are equal according to the {@link
 *     #equals(Object) equals} method, then calling the
 *     `hashCode` method on each of the two objects must produce the
 *     same integer result.
 * -   It is _not_ required that if two objects are unequal
 *     according to the {@link #equals(Object) equals} method, then
 *     calling the `hashCode` method on each of the two objects
 *     must produce distinct integer results.  However, the programmer
 *     should be aware that producing distinct integer results for
 *     unequal objects may improve the performance of hash tables.
 *
 * @implSpec
 * As far as is reasonably practical, the `hashCode` method defined
 * by class `Object` returns distinct integers for distinct objects.
 *
 * @return  a hash code value for this object.
 * @see     java.lang.Object#equals(java.lang.Object)
 * @see     java.lang.System#identityHashCode
 */

(For the purpose of this example, cosmetic changes like reflowing the text are deliberately avoided, to aid any before and after comparison.)

In that example, note the following:

The use of Markdown is enabled by the presence of the characters md immediately after the initial /**.
The HTML <p> element is not required; a blank line is enough to indicate a paragraph break.
The HTML <ul> and <li> elements are replaced by Markdown bullet-list markers, using - to indicate the beginning of each bullet in the list.
The HTML <em> element is replaced by using underscores (_) to indicate the font change.
Instances of the {@code...} tag are replaced by the equivalent use of backticks (`) to indicate monospace font.
Use of {@link ...} to link to other program elements is unchanged, avoiding the need for a new Markdown construct for such links.
Use of block tags, like @implSpec, @return and @see is generally unaffected, although because Markdown is enabled in the comment as a whole, that includes using it in the content of these tags, such as the backticks in the @implSpec tag.

Description

Documentation comments are used to provide the specification of Java declarations in source code. They are represented by block comments, beginning with /**, and contain a description, followed by a series of block tags, which provide additional details. The description may contain rich text, such as plain text, HTML, entities, and inline tags. Some block tags and inline tags may also contain descriptive rich text.

Documentation comments can be processed by the standard doclet provided as part of the JDK javadoc tool, to generate HTML pages containing the documentation for an API. The comments may also be processed by other tools when browsing code, such as IDEs and the JDK jshell tool.

While it is possible to write HTML in Markdown, Markdown is not a superset of HTML, and so it is necessary to distinguish comments using Markdown from those that do not. Also, since the overall API for a library may be derived from multiple sources, not all of which may use the same format for their documentation comments, it is necessary to be able to support both formats while generating the documentation for any API, and even for any one class, especially when documentation may be inherited from a supertype.

To enable the use of Markdown in documentation comments, use the characters md immediately after the /** at the beginning of the comment, followed by a whitespace character, such as space or a newline. When so enabled, Markdown can be used anywhere that HTML can be used in the comment, such as in the main description, in block tags like @param and @return, or in inline tags like {@link}.

Markdown content in a documentation comment must be written according to the specification for CommonMark.

Restrictions

There are some minor restrictions on the use of Markdown in documentation comments, arising from some syntactic conflicts when processing the comment.

You cannot use the character sequence */ anywhere within any documentation comment, because to do so would terminate the comment. This includes the use of those characters in code fragments in the comment, such as in examples of regular expressions and glob patterns, or example code containing block comments. There are a few solutions that can be used:
- replace either or both characters with an entity,
- separate the characters with an entity such as a zero-width space,
- use an external snippet for larger code fragments, or
- use the escape sequence *@/.
It is common practice to prefix the lines of a documentation comment with zero or more spaces followed by one of more asterisks (*). Any such characters are ignored when reading the content of a comment. This affects the use of * as a bullet list marker in bulleted lists.
- If you follow the common practice of beginning the line with spaces and asterisks, there must be at least one space character before the bullet list marker.
- If you do not follow the common practice of beginning the line with spaces and asterisks, you cannot simply use * as a bullet list marker. Instead, you should either use one of the alternative available characters (+ or -), or you can use the escape sequence @*.
Each documentation comment is analyzed by identifying the overall structure of the description and any block tags. Then, any inline tags are identified within any descriptive text, and are treated as opaque objects within the surrounding text. This affects the use of character sequences that might incorrectly be interpreted as beginning an inline or block tag. Thus, you cannot put Java code containing annotations, or other constructs with an @ character at the beginning of a line, in a Markdown code block, such as a fenced code block or indented code block. If an annotation appears at the beginning of a line, as is common practice, it will be seen as a block tag. Instead, use the {@snippet ...} tag to include code samples containing annotations, or use the escape sequence @@. Note that the {@snippet ...} tag provides better control of the indentation of the code sample, and the ability to link to other program elements. You can also use @@ as an escape sequence to prevent {@ being interpreted as the beginning of an inline tag.

Escape sequences are described in more detail in the section Escape Sequences in the latest Documentation Comment Specification.

Errors

While Markdown is a convenient way to write text that is easy to read, it only works well for text that is valid; it is less good when it comes to text that contains errors of one sort or another. Furthermore, it will not be possible to utilize the existing support in the doclet for recognizing bad HTML without building a Markdown parser into the doclet (or doclint) to understand the surrounding context, such as code spans and code blocks.

Markdown is also notorious for not reporting errors in its input, and may either pass bad input through to the output, to be caught by downstream validators, or may transform the bad input in a way that can only be caught by manually reading and checking the generated output.

For example, the following invalid HTML, with a misspelled tag, will be propagated to the output, without any warning or error message being produced, although the issue could be caught by a downstream validator, if one were run on the generated output:

input: <tabel><tr><td>data</table>
output: <tabel><tr><td>data</table>

In contrast, the following invalid HTML will be escaped into valid HTML, and will appear directly in the output when viewed in a browser:

input: Unfinished link: <a href="example.com"
output: Unfinished link: &lt;a href=&quot;example.com&quot;

Since the output will be valid HTML, it will be not be detected by any downstream validator, even though it is presumably not what the author originally intended to appear.

Examples of errors are not limited to bad HTML. Generally, CommonMark assumes that if text is not recognized as a valid Markdown construct, it is assumed to be literal text. Here is an example of a bad link, with a typo in the name of the reference:

Here is an [example][bad-link] of a bad link.

[badlink]: http://www.example.com

This is rendered as the following, because of the error in the reference:

Here is an [example][bad-link] of a bad link.

As is the case with many other errors, the output is valid HTML and can only be caught by proofreading the output, and not by any automated validator.

For reference, here is the intended output, seen when the typo is fixed:

Here is an <a href="http://www.example.com">example</a> of a bad link.

While these examples may seem trivial, regrettably there is a long history of errors like these leaking into API documentation, because authors often do not proofread the generated output for their documentation. Indeed, this was a motivation for introducing the DocLint feature in JEP 172, and having to disable it in Markdown comments will be a retrograde step.

Implementation

The design so far has been presented in terms of CommonMark and its specification, although reference has been made to using a third-party implementation of that specification.

The chosen implementation is commonmark-java, which is self-described as a Java library for parsing and rendering Markdown text according to the CommonMark specification (and some extensions).

Using the library can be as simple as the following:

public String parseAndRender(String input) {
    Parser parser = Parser.builder().build();
    Node document = parser.parse(input);
    HtmlRenderer renderer = HtmlRenderer.builder().build();
    return renderer.render(document);
}

Paragraphs

CommonMark defines that a simple run of characters will be treated as a paragraph, implying that the run of characters will be enclosed with <p> and </p> when rendered in HTML.

However, there are places in a documentation comment where there are effectively restrictions on the Markdown that can be used, because the corresponding HTML is restricted to be phrasing content. A prime example of this is a tag like

{@link element description}

which informally maps to

<a href="link/to/element">rendered-description</a>

In situations like this, care must be taken to ensure there are no inappropriate paragraph tags wrapping the original description.

Headings

Some Markdown implementations, including commonmark-java, have a feature (extension) to automatically generate ids for user-defined headings.

Starting in JDK 20, javadoc provides a similar ability (JDK-8289332) but necessarily using a different algorithm to generate the ids, to guard against the possibility of conflicts with other ids generated in other parts of the page.

If we are to maintain parity with the feature as defined for non-Markdown comments, we may want to define an extension that can be used to generate headings similar to those in non-Markdown comments.

Tables

CommonMark does not provide support for tables, and yet tables are one of the more onerous constructs to write in raw HTML. They are also reasonably common, with over 300 appearing in the documentation comments for the modules in OpenJDK. However, commonmark-java does provide an extension for tables, as used in Github-flavored Markdown, and while we should generally resist the use of arbitrary extensions, this may be one to consider.

Opinions differ whether it is better to use HTML or a Markdown extension for a table. While Markdown tables may be easier to read, they can be a lot more tedious to write, without appropriate support in authoring tools. Conversely, HTML tables may be easier to write, although there can be a lot of visual clutter mixed in with the information being presented in the table.

Note that the HTML output generated from Markdown tables typically does not follow Section 508 Accessibility Guidelines or other web accessibility guidelines. This may be an issue for generating documentation that is to be used in some environments.

Compiler Tree API

The documentation comments for any module, package, class or interface can be accessed using the Compiler Tree API, in the com.sun.source.* packages, in the jdk.compiler module.

Markdown content is represented by a new subtype of DocTree, called com.sun.source.doctree.RawTextTree, using a new kind, DocTree.Kind.MARKDOWN.

No other support for analyzing Markdown in more detail is provided.

Alternatives

Why Markdown?

An alternative to Markdown is AsciiDoctor. Each claims to have advantages over the other. (1, 2). One is more popular, the other has less need to use HTML. Both resort to providing extensions to support otherwise-missing features.

The use of an initial marker string /**md could be extended to support other languages, such as AsciiDoctor, by using additional, different marker strings. Entertaining the notion of a possible future enhancement is a possible reason for keeping the integration with the underlying language shallow instead of deep.

Which Version of Markdown?

There are many "variants", or "flavors" of Markdown.

Initially released in 2004, the original version of Markdown did not specify the syntax unambiguously, and the implementation was "quite buggy". This led to divergent implementations, as different authors attempted to fix the issues, and to address some shortcomings of the specification, which in turn led to an effort at commonmark.org to develop a formal specification for Markdown, known as CommonMark. That specification is the basis for the Markdown constructs that can be used in documentation comments.

If we did not specify the use of CommonMark, and allowed the use of arbitrary other variants or flavors, that would lead to potential inconsistencies between different tools that might process documentation comments, or for documentation comments written by different authors, who might be using different variants or have different extensions enabled. API documentation is often derived from comments in different source files, or even different libraries, and so it would be difficult to guarantee consistent behavior for all authors.

Enabling Markdown

Markdown comments are indicated by beginning the comment with /**md so that it is possible to differentiate between Markdown comments and non-Markdown comments.

Pragma

Java has long eschewed the notion of specialized comments or other character sequences to define the syntactic features that may appear in the rest of the file. It would be a big change to head in that direction for identifying whether documentation comments are in Markdown format or not.

Annotations

The initial marker allows an author to opt in to using Markdown on each comment individually. That is certainly convenient when updating existing API documentation, but might be seen an onerous if all the documentation comments in a file use Markdown syntax. One way to mark the content of a source file as following a non-default convention could be to use annotations. However, there are a couple of reasons why this is not a good idea.

Annotations require at least some amount of semantic analysis of a source file, to determine their value and whether they are applicable as intended. Having to do semantic analysis to determine the kind of syntax used in a documentation comment is highly undesirable and would complicate what might otherwise be simple tools for analysing documentation comments.

One aspect of this is that the existing Compiler Tree API allows access to the parsed form of a comment
While common and effectively standard, documentation comments are not a feature of the Java programming language, and only superficially part of Java SE, in the Language Model API. Likewise, javadoc is a JDK tool, not a Java SE tool, and it would be strange to see annotations pertaining to the use of a JDK feature in Java SE APIs. And more specifically, the annotation interface for any such annotation appearing in the APIs of the java.base module would itself have to be declared in the java.base module, because that module by definition has no dependencies on other module.

For these reasons, annotations are not supported as a way to indicate the format of documentation comments in a file.

Command-Line and/or Configuration Options

If all the source files in a project adopt the use of Markdown syntax in its documentation files, it could be possible to provide a command-line or other form of configuration option to specify the use of Markdown as a default, even if the initial marker string md is not present. Such a command-line option could be seen as similar to the --source option used by javac, javadoc and other tools to specify the version of the Java programming language used in the source files being read.

However, if the initial marker is not used and an author relies on a command-line or configuration option to specify the format, it will never be possible to a mixture of source files, some using Markdown in documentation comments, some not, without having more complicated options to specify which files use which format.

In addition, any such mechanism would have to be supported in some form by all tools that want to read and analyze documentation comments. Here is a list of some of such tools:

javadoc: for generating API documentation,
javac: for the doclint feature to detect and report common errors in documentation comments,
jshell: for its ability to render documentation comment of imported API,
IDEs: for their ability to render documentation comments, such as IntelliJ IDEA's Reader Mode, and
any other custom utility used analyze documentation comments, such as the one used to count occurrences of different Markdown constructs.
the JDK Compiler Tree API, which can be used to parse documentation comments

For these reasons, support for command-line or other configuration options to indicate which files use Markdown format in documentation comments is not provided at this time, but may be reconsidered in the future.

Different Comment Syntax

Instead of adding a marker string at the beginning of the comment, an alternative would be to use a different form of comment, such as a stylized form of a series of line (//) comments. For example, we could extend the definition of a documentation comment to include a series of adjacent line comments beginning with ///, or maybe a series of adjacent line comments of which at least the first begins with ///. Note that unlike block comments, the presence of a marker at the beginning of each line (after any optional whitespace) cannot be avoided.

Such a change would require an update to the Java SE API used to access the content of a documentation comment.

It is left as an exercise for the reader to determine whether a series of stylized line comments is more or less visually intrusive than a block comment with a short initial marker string. However, using line comments would obviate the need for some of the escape sequences that may be required when using block comments.

When to Process Markdown?

Just as existing documentation comments can contain a mixture of HTML and javadoc tags, so too will we want to embed javadoc tags in a Markdown documentation comment. Since the goal is to leverage a third-party implementation of Markdown, and not develop our own implementation, that raises the question of when and how we should use such an implementation.

As attractive and simple as it may seem, it is not enough to simply preprocess the entire comment through a Markdown processor. For example, consider the following:

Multiplication is commutative, which means that
{@code a*b} is the same as {@code b*a}.

After preprocessing (in this case, with the original Markdown.pl implementation), that will become:

<p>Multiplication is commutative, which means that
{@code a<em>b} is the same as {@code b</em>a}.</p>

Note that the asterisks got translated into <em> and </em> respectively, because there was no recognition of the {@code ...} tags.

You can try out this and other examples for yourself using the "dingus" at https://spec.commonmark.org/dingus.

The same reasoning and a similar example indicates that we should not post-process the output generated from the comment either, since we will have lost any contextual information available in the original comment, and again, might find inadvertent matches for Markdown syntactic forms. For example, the doclet might initially transform the above comment to the following:

Multiplication is commutative, which means that
<code>a*b</code> is the same as <code>b*a</code>.

Post-processing that output to handle the Markdown constructs yields the following invalid HTML, with partially overlapping HTML elements:

<code>a<em>b</code> is the same as <code>b</em>a</code>

Thus, it does not work to either process a comment as Markdown, or to post-process the result of converting the comment to HTML, and a more sophisticated strategy is required.

Processing Markdown

The general assumption here is the use of a fixed lightweight processor to handle Markdown content.

Because there are different "flavors" of Markdown available for authors to use for plain Markdown documents, it might be seen as a possibility to allow an author to specify the Markdown processor to be used to transform Markdown content to HTML. However, such an ability would perpetuate the differences between different versions of Markdown. Moreover, at least one popular Markdown processor (pandoc) is not written in Java, and so would have to be invoked in a separate process. Given that this would likely have to be done on each individual documentation comment, the overhead to exec an external tool for each comment would be prohibitively expensive, compared to the ability to direct invoke a Markdown processor written in Java and running in the same virtual machine as javadoc or related tools.

One workaround for that issue could be to batch Markdown comments together, to reduce the number of times the external tool needs to be invoked. There are various reasons why that would be undesirable:

the cost and complexity to batch the inputs together and break apart the resulting output,
the risk of creating conflicts in the comments being batched together, such as the definitions for reference links, and
the similar cost and complexity this would imply for other tools that want to analyse documentation comments.

For all those reasons, we focus on using a well-defined, well-supported Java library to transform Markdown to HTML.

Escape Sequences

Escape sequences can be used in a situations that would otherwise cause the unescaped character to be misinterpreted.

Avoiding the use of */:

There is no alternative to the issue of */ terminating a block comment, thus preventing its use anywhere within any such comment.

Short of changing the definition of a [comment][jls-comment] in the Java Language Specification, the only alternative that would permit direct use of */ in a documentation comment is to use a different form of enclosing comment, suggesting a new stylized use of // comments, which do not provide any restrictions for the characters that follow on the same line. Such a significant change seems totally out of line with any resulting benefits.
Avoiding whitespace and asterisks at the beginning of a line being ignored:

The behavior to ignore leading whitespace and asterisks on each line of a documentation comment is defined by a Java SE API (Elements.getDocComment). Changing the behavior would be an incompatible change to that method. Using an escape sequence for an asterisk as a bullet marker at the beginning of a line is only required when not following the common convention to prefix each line with whitespace and asterisks, or when wanting the place an asterisk list bullet marker immediately after the initial line prefix. Thus, there are already sufficient alternatives to avoid using the escape sequence and so no additional alternatives seem necessary.
Avoiding the interpretation of @ as part of a block or inline tag:

It is often unavoidable that there may be problems when embedding the use of one language inside another, arising from conflicting syntactic interpretations of a certain character sequences. The primary problem case here is the desire to write a Java code fragment containing annotations inside a Markdown code block in the DSL for a documentation comment in a Java block comment! The issue is the common coding practice to place annotations at the beginning of the line conflicting with the interpretation of any such characters as the introduction of a block tag.

While it would be possible to avoid the conflict by detecting the presence of a Markdown code span or code block surrounding the relevant characters, that would require integrating a full Markdown parser into the low-level code to parse documentation comments, and that would be too disruptive to the code, and to the existing public JDK API for reading and representing documentation comments. Another alternative would be to change the syntax for block tags to avoid any conflict, bug that would lead to unnecessary inconsistency between traditional comments and Markdown comments.

Testing

Testing will focus on the ability to read and recognize Markdown comments, and to invoke the third-party library to transform Markdown comments to HTML.

Testing will also focus on the use of Markdown comments in a hybrid world using both regular documentation comments and Markdown documentation comments. This will include testing situations where parts of a comment are inherited from a supertype, such that the inherited comment is in a different format to that of the inheriting comment.

Because of the use of a third-party library to transform Markdown to HTML, there will not be detailed testing of individual Markdown constructs. It is assumed the library performs as specified.

There are no special environmental issues that need to be tested.

Risks and Assumptions

The primary assumption is that it is unreasonable and impractical to build direct support for Markdown into either the standard doclet, or the Compiler Tree API on which it depends. Thus, the primary risk is the use of a third-party library to transform Markdown to HTML, in the manner described above. If external support for the library is dropped, we will have to maintain a fork of the library for use in our tools, unless we can find an equivalent alternative.

An additional risk is that of an increased presence of errors in generated API specifications, because of the reduced ability to check for bad code, and because authors sometimes omit to check the generated form of their documentation. (JBS reports there have been over forty issues in JDK containing the words "bad" or "malformed" "HTML" in their Summary description, with over ten in the main core-libs area. Many involve mismatched tags that render the page effectively unusable. And all of which, presumably, were not detected up front by the authors or their reviewers.)

Dependencies

The primary dependency is on the chosen third-party library. There are no JDK features that are dependent on this work at this time.