Pitfalls of Converting TEI XML Standoff Annotations to Inline, and a DOM-Based Solution

Digital Engishiki is a project that encodes the Engishiki — a collection of supplementary regulations for the ritsuryō legal system, completed in 927 CE — in TEI (Text Encoding Initiative) XML, making it browsable and searchable on the web. Led by the National Museum of Japanese History, the project provides TEI markup for critical editions, modern Japanese translations, and English translations, served through a Nuxt.js (Vue.js) based viewer.

During development, we encountered a bug where converting TEI XML standoff annotations to inline annotations caused the XML document structure to collapse. This article records the cause and the DOM-based solution.

What Are Standoff Annotations?

In TEI XML, standoff markup is a common approach for recording variant readings across manuscripts. In Digital Engishiki, textual variants between multiple manuscripts are recorded using <app> elements. <anchor> elements mark the range in the text, while the corresponding <app> element is placed elsewhere:

<p>
  preceding text
  <anchor xml:id="app001"/>
  text with variant
  <anchor xml:id="app001e"/>
  following text
</p>

<!-- variant information placed separately -->
<app from="#app001" to="#app001e">
  <lem>text with variant</lem>
  <rdg wit="#manuscript_A">different text</rdg>
</app>

This approach has the advantage of bypassing XML’s nesting constraint. Even when a variant’s range crosses element boundaries (the overlapping hierarchy problem), anchors can be placed anywhere.

Why Convert to Inline?

Mapping XML Trees to UI Component Trees

The Digital Engishiki viewer is built with Vue.js. In component-based frameworks like Vue.js and React, the UI is described as a tree structure. Since TEI XML is also a tree, a natural approach is to recursively render each XML element by mapping it 1:1 to a UI component:

<!-- TEI.vue: recursively map XML elements to components -->
<template>
  <component v-for="child in element.children"
             :is="getComponent(child.tagName)"
             :element="child" />
</template>

With this design, if the <app> element exists inline within the text, rendering completes through tree traversal alone:

<!-- after inlining: renderable by tree traversal -->
<p>
  preceding text
  <app xml:id="app001">
    <note type="base">text with variant</note>
    <lem>text with variant</lem>
    <rdg wit="#manuscript_A">different text</rdg>
  </app>
  following text
</p>

With standoff markup, every time the renderer encounters an <anchor> mid-text, it needs to search for the corresponding <app> in a different part of the tree and inject it. Referencing a different node mid-traversal does not fit well with recursive rendering.

Additional benefits include easier search index construction (a full-text scan of each <p> element suffices) and good compatibility with static site generation (no complex runtime logic needed).

That said, this is an architectural choice. Alternatives such as the Web Annotation model, which overlays annotations at the display layer while keeping them in standoff form, are also viable.

Convertibility Between Standoff and Inline

It is worth clarifying whether standoff and inline can be freely converted between each other.

Inline → Standoff: Always Possible

Any inline annotation can be mechanically converted to standoff by inserting anchors at the start/end positions and moving the element body elsewhere. No information is lost.

Standoff → Inline: Not Always Possible

When annotation ranges overlap, they cannot be expressed as well-formed XML. This is the constraint known as the “overlapping hierarchy problem” in the TEI community:

<!-- two annotation ranges partially overlap -->
Text:         A B C D E
Annotation 1:   [B C D]
Annotation 2:     [C D E]

<!-- attempting to inline... -->
A <ann1>B C <ann2>D</ann1> E</ann2>   ← not well-formed XML

XML requires proper nesting, and two partially overlapping elements cannot be represented. Standoff markup was designed precisely to work around this constraint.

The Digital Engishiki Case

The variant annotation (<app>) that triggered our bug crossed the boundary of existing structural elements (<measure>):

<measure>...<anchor start/>...</measure>...<measure>...<anchor end/>...</measure>

The range of <app> and the range of <measure> overlap, so <app> cannot be properly nested inside or outside <measure>. The current implementation therefore copies the content between anchors into a <note> element — an approximate conversion involving content duplication rather than strict inlining.

Direction	Feasibility	Condition
Inline → Standoff	Always possible	Unconditional
Standoff → Inline	Conditional	Only when annotation ranges do not overlap with other elements
Standoff → Inline (with overlap)	Approximately possible	Requires workarounds such as content duplication or element splitting

This asymmetry illustrates why standoff is a superior canonical form for data, while also showing that inlining carries inherent complexity.

Anchor Pairs Crossing Element Boundaries

Most variant annotations fit within a single parent element, but some cross element boundaries. In the Engishiki, tribute quantities are marked up with <measure> elements, and some variant ranges cross these <measure> boundaries:

<!-- Engishiki, Book 24 (Shukei-ryō jō): tribute from Iyo Province -->
<p>
  <measure quantity="4">
    <anchor xml:id="app001"/>
    四
    <unit unitRef="#疋"> 疋 </unit>
  </measure>
  、
  <measure commodity="#緋帛">
    緋
    <anchor xml:id="app001e"/>
  </measure>
</p>

anchor_start is inside the first <measure>, and anchor_end is inside the second. DOM next_sibling cannot reach across them.

Previous Implementation: String Manipulation + Re-parsing

The original implementation, judging that DOM operations alone were insufficient, adopted an approach of serializing the entire XML to a string, manipulating it, and re-parsing:

# previous implementation (simplified)
soup_str = str(self.soup)                          # serialize entire XML
between = soup_str.split(start_str)[1].split(end_str)[0]  # extract between anchors

# remove content between anchors from string
start_pos = soup_str.index(start_str) + len(start_str)
end_pos = soup_str.index(end_str)
soup_str = soup_str[:start_pos] + soup_str[end_pos:]

# rebuild DOM from string
self.soup.__init__(soup_str, "xml")  # ← the problem

Why It Breaks

The content between anchors includes structural tag boundaries:

四 <unit>疋</unit> </measure> 、 <measure ...> 緋
                    ^^^^^^^^^     ^^^^^^^^^^^^
                    closing tag   opening tag

Removing this from the string breaks the XML nesting structure:

<!-- before removal -->
<measure quantity="4">...<anchor start/>四<unit>疋</unit></measure>、<measure>緋<anchor end/>...</measure>

<!-- after removal (broken XML) -->
<measure quantity="4">...<anchor start/><anchor end/>...</measure>

With the closing </measure> and opening <measure> removed, the XML is no longer well-formed. BeautifulSoup’s parser attempts to “repair” the broken XML, but in doing so, rearranges the entire document structure in unintended ways.

In Digital Engishiki, the critical edition (<div type="original">) and modern Japanese translation (<div type="japanese">) are managed in the same file. The re-parsing caused the contents of the Japanese translation <div> to migrate into the critical edition <div>. As a result, 84 out of 87 items from Book 24 (Shukei-ryō jō) were missing from the search index, and queries for terms like “鰒” (abalone) or “緋” (scarlet silk) returned no results for that book.

Solution: DOM-Based Range Operations

We eliminated string manipulation entirely and reimplemented using only BeautifulSoup’s DOM operations.

Phase Decomposition from the Common Ancestor

When two anchors are in different parent elements, we decompose the operation based on their Lowest Common Ancestor (LCA):

def _collect_between_content(self, anchor_start, anchor_end):
    common = self._find_common_ancestor(anchor_start, anchor_end)
    start_path = self._path_to_ancestor(anchor_start, common)
    end_path = self._path_to_ancestor(anchor_end, common)

          common (p)
         /     |     \
   measure_A   、   measure_B
   /    \              |
anchor_s  unit      anchor_e

Collection and removal are performed in five phases:

Phase	Target	Operation
1	Siblings after anchor_start	Collect within parent (partially collect elements containing anchor_end)
2	Siblings after nodes on start-side path	Collect while traversing upward at each level
3	Intermediate siblings under common	Collect between start_child and end_child
4	Siblings before nodes on end-side path	Collect while traversing downward at each level
5	Siblings before anchor_end	Collect within parent

Partial Processing of Elements Containing anchor_end

In Phase 1, a sibling of anchor_start may contain anchor_end as a descendant (e.g., anchor_end inside a <unit> element). In this case, rather than collecting the entire element, we recursively collect only the content before anchor_end:

def _collect_siblings_after_safe(self, element, anchor_end, collected):
    current = element.next_sibling
    while current:
        if self._contains(current, anchor_end):
            # element contains anchor_end → partial collection
            self._collect_up_to_anchor(current, anchor_end, collected)
            break
        collected.append(self._deep_copy_node(current))
        current = current.next_sibling

Removal follows the same pattern, with partial removal for elements containing anchor_end. This enables content collection and removal without breaking structural tags.

Well-formedness Guarantee

DOM operations always maintain a well-formed tree, so the converted XML cannot become malformed. We also added tests that re-parse the output and verify structural consistency:

def test_cross_parent_preserves_document_structure():
    # run conversion
    replacer = AppReplacer(soup, verbose=False)
    replacer.process()

    # re-parse and verify structure
    reparsed = BeautifulSoup(str(soup), "xml")

    original = reparsed.find("div", type="original")
    japanese = reparsed.find("div", type="japanese")
    assert original is not None
    assert japanese is not None

    # no translation items should leak into the critical edition div
    for item in original.find_all("p", ana="項"):
        assert not item.get("xml:id", "").startswith("ja-")

Design Observations

Risks of Editing XML via String Manipulation

Partial string manipulation of XML always carries the risk of breaking structural tag correspondence. Even when it appears to work in some cases, certain data patterns can cause collapse, and as in our case, the problem may go undetected for some time. DOM operations are more verbose but guarantee structural integrity.

Testing Build-Time Conversions

When incorporating standoff-to-inline conversion into a build pipeline, complexity concentrates in the conversion step. The following types of tests proved effective as a safety net:

Structure preservation tests: re-parse the converted XML and verify that div structures are maintained
Item count tests: verify that the number of critical edition and translation items per volume matches expectations
Leak tests: verify that no items from one language div appear in another

Generality of DOM Range Operations

The pattern implemented here — collecting and removing content between two points via LCA identification, path decomposition, and phased processing — is essentially the same operation as the browser’s DOM Range API. Since BeautifulSoup and lxml do not provide this functionality built-in, it may be reusable for cross-element-boundary processing in XML/HTML beyond TEI.

Summary

Converting standoff to inline is asymmetric: inline to standoff is always possible, but the reverse requires workarounds when annotation ranges overlap (the overlapping hierarchy problem)
Build-time inlining is a practical choice due to its compatibility with Vue.js/React recursive rendering and ease of search index construction
String manipulation + re-parsing risks breaking structural tag correspondence; DOM-based phase decomposition from the common ancestor provides a safe conversion path
When adopting build-time conversion, structure preservation tests are essential for quality assurance

What Are Standoff Annotations?#

Why Convert to Inline?#

Mapping XML Trees to UI Component Trees#

Convertibility Between Standoff and Inline#

Inline → Standoff: Always Possible#

Standoff → Inline: Not Always Possible#

The Digital Engishiki Case#

Anchor Pairs Crossing Element Boundaries#

Previous Implementation: String Manipulation + Re-parsing#

Why It Breaks#

Solution: DOM-Based Range Operations#

Phase Decomposition from the Common Ancestor#

Partial Processing of Elements Containing anchor_end#

Well-formedness Guarantee#

Design Observations#

Risks of Editing XML via String Manipulation#

Testing Build-Time Conversions#

Generality of DOM Range Operations#

Summary#