How to use and layer semantic tags in PDF documents

Introduction

I've found that user-friendly information on PDF tags is surprisingly hard to come by. Most resources dive deep into technical standards, like the PDF ISO specifications, which can be overwhelming if you're just trying to make your documents more accessible.

This guide is designed to somewhat bridge that gap. My hope is that it offers a straightforward reference for common tags, along with a handful of practical examples of how to properly nest them to improve accessibility.

What are PDF tags?

In a PDF, tags define the document's structure and content for assistive technologies. Each element is assigned a tag that describes the type of content it holds: headings, paragraphs, tables, or lists.

Although tags are visually invisible to the end user, they're essential for creating an accessible PDF. The tag tree helps screen readers understand the document's structure and ensures the content is presented in the correct reading order.

The table below highlights some of the most common PDF tags you're likely to come across in your documents. I've included brief descriptions for each tag, though many are fairly self-explanatory.

Some tags have been left out because, while they may still appear in documents, they're either no longer supported or their use is discouraged according to the Tagged PDF Best Practice Guide: Syntax.

Tag	Description
`<Art>`	Separates individual articles within the same document.
`<Aside>`	Content that is indirectly related to the current topic, like a side note.
`<BlockQuote>`	Block-level quotation. Can contain paragraphs and a caption.
`<Caption>`	Used to provide a title to an element, found before or after the element it titles.
`<Code>`	Inline snippet of programming code. Found in a block-level element.
`<Div>`	Semantically-empty block-level container. Typically used to apply styles to grouped elements.
`<Document>`	The container of a complete document. A PDF can contain multiple documents, and can be left empty to indicate a blank page.
`<Figure>`	Images, charts, and other graphical elements. May contain various elements but will be interpreted as a single image by screen readers.
`<Form>`	Form elements. Can contain text when multiple fields are grouped together, but typically only contains an attributed object.
`<Formula>`	Mathematical or scientific notations. Can be used inline or at block level.
`<H(X)>`	Section, document, or page titles. Should appear in order, from `<H1>` to `<H6>`, without skipping a level.
`<Index>`	Container for a subject index list, typically found near the end of a publication.
`<Lbl>`	Labels for list markers, like bullets or numbers. Found in the `<LI>` element. Unlike HTML, this element is not to be used label an input in a form.
`<Link>`	A link to a web page or another location in the document.
`<L>`	Parent list container. Contains `<LI>` children.
`<LI>`	Individual list items found in a parent `<L>`. Parent to the `<Lbody>` and parent to the `<L>` element when using nested lists.
`<LBody>`	The contents of a list item, found in the `<LI>` element, at the same level as the `<Lbl>`.
`<Note>`	An explanatory passage, like a footnote or endnote. Typically found in a `<Reference>`.
`<P>`	An ordinary paragraph.
`<Part>`	Used to divide documents. Sub-sections should use `<Sect>`.
`<Quote>`	Inline quote in a block-level parent.
`<Reference>`	A citation to text or data found elsewhere in the document. Can include a `<Link>` element.
`<Sect>`	Used to divide a document into small sections. Often found in `<Part>` or `<Art>` elements.
`<Span>`	Semantically-empty inline container, often wrapped around styled text.
`<Table>`	Table parent container. Contains `<TD>` and `<TH>` elements.
`<TBody>`	Designates a section of the table as the content area. Optional.
`<TD>`	Found in `<Table>`, `<TBody>`, or `<TFoot>` elements.
`<TFoot>`	Designates a table section as the footer, typically a total row. Optional.
`<TH>`	Found in `<Table>` elements. Can be assigned a scope.
`<THead>`	Designates a table section as the header. Typically contains the table's `<TH>` cells. Optional.
`<TOC>`	Parent container of a table of contents.
`<TOCI>`	Individual table of contents items. Can contain another `<TOC>` for nested tables.
`<TR>`	Table row used to group `<TD>` or `<TH>` cells in a row

Syntax and hierarchy

Below are sample tag trees for commonly nested elements. Keep in mind that some tags can be structured in different ways since PDF formatting is fairly flexible. However, the examples here reflect the layouts I personally prefer to use.

Structure

The <Part> and <Sect> elements can be used to divide a document into logical sections, and the <Article> denotes individual articles if part of a larger collection.


								<Document>
									<Art>
										<Section>
											<Part>

Captions

The <Caption> element is used to provide a title to an element, commonly used with <Figure> or <Table>.

It can be placed before or after the element it titles.


								<Figure>
									<Caption>
										<p>Figure 1. Example of a caption


								<Caption>
									<p>Table 1. Example of a caption
								<Table>

Data tables

A three-column table with a caption, a row of column header cells, and a single row of data cells. Note that the <THead>, <TBody> and <TFoot> elements are optional.


								<Table>
									<Caption>
									<THead>
										<TR>
											<TH>
												<P>Row 1, column 1
											<TH>
											<P>Row 1, column 2
											<TH>
										<TBody>
											<TR>
												<TD>
													<P>Row 2, column 1
												<TD>
												<P>Row 2, column 2
												<TD>
											<TFoot>
												<TR>
													<TD>
														<P>Row 3, column 1
													<TD>
													<P>Row 3, column 2
													<TD>

Forms

Every input must have its own <Form> element unless they are a group, like checkbox or radio sets. The <Form> element should appear at the same level as the primary label, and both within a common parent element. The OBJR notation is an Object Reference, which means the tag represents the actual field element.

Note that there is no mechanism to assign an input to a particular label. Tooltip values should be provided for users tabbing through inputs.

Text inputs

A single parent element can contain multiple inputs.


									<P>
										Label text
										<Form>
											Field Name - OBJR
											Label text 2
											<Form>
												Field 2 Name - OBJR

Checkboxes and radio buttons

Individual form labels should be found directly before or after their object.


									<P>
										Label text:
										<Form>
											Checkbox 1 Name - OBJR
											Checkbox 1 label text
											Checkbox 2 Name - OBJR
											Checkbox 2 label text
											Checkbox 3 Name - OBJR
											Checkbox 3 label text

Lists

A standard list. Another way I've seen to denote bullets is to include the character in the <Lbody> and wrap the text in a non-semantic element like a <span>.


								<L>
									<LI>
										<Lbl>•
										<LBody>List item text

You can also nest a list in another list.


								<L>
									<LI>
										<Lbl>•
										<LBody>List item text
										<L>
											<LI>
												<Lbl>•
												<LBody>List item text
										<LI>

Table of contents can be nested or presented in a single, flat level. Both approaches are acceptable.

Nested

The <TOC> element can be nested as a child of another <TOC> or in a <TOCI>.


									<TOC>
										<TOCI>
											<TOC>
												<TOCI>
												<TOCI>
												<TOCI>
											<TOCI>

Flat

Tables of contents can also be flattened and displayed linearly.


									<TOC>
										<TOCI>
										<TOCI>
										<TOCI>
										<TOCI>