Digital Engishiki is a project that encodes the Engishiki — a collection of supplementary regulations for the ritsuryō legal system, completed in 927 CE — in TEI (Text Encoding Initiative) XML, making it browsable and searchable on the web. Led by the National Museum of Japanese History, the project provides TEI markup for critical editions, modern Japanese translations, and English translations, served through a Nuxt.js (Vue.js) based viewer.
During development, we encountered a bug where converting TEI XML standoff annotations to inline annotations caused the XML document structure to collapse. This article records the cause and the DOM-based solution.
What Are Standoff Annotations?
In TEI XML, standoff markup is a common approach for recording variant readings across manuscripts. In Digital Engishiki, textual variants between multiple manuscripts are recorded using <app> elements. <anchor> elements mark the range in the text, while the corresponding <app> element is placed elsewhere:
<p>
preceding text
<anchor xml:id="app001"/>
text with variant
<anchor xml:id="app001e"/>
following text
</p>
<!-- variant information placed separately -->
<app from="#app001" to="#app001e">
<lem>text with variant</lem>
<rdg wit="#manuscript_A">different text</rdg>
</app>
This approach has the advantage of bypassing XML’s nesting constraint. Even when a variant’s range crosses element boundaries (the overlapping hierarchy problem), anchors can be placed anywhere.
Why Convert to Inline?
Mapping XML Trees to UI Component Trees
The Digital Engishiki viewer is built with Vue.js. In component-based frameworks like Vue.js and React, the UI is described as a tree structure. Since TEI XML is also a tree, a natural approach is to recursively render each XML element by mapping it 1:1 to a UI component:
<!-- TEI.vue: recursively map XML elements to components -->
<template>
<component v-for="child in element.children"
:is="getComponent(child.tagName)"
:element="child" />
</template>
With this design, if the <app> element exists inline within the text, rendering completes through tree traversal alone:
<!-- after inlining: renderable by tree traversal -->
<p>
preceding text
<app xml:id="app001">
<note type="base">text with variant</note>
<lem>text with variant</lem>
<rdg wit="#manuscript_A">different text</rdg>
</app>
following text
</p>
With standoff markup, every time the renderer encounters an <anchor> mid-text, it needs to search for the corresponding <app> in a different part of the tree and inject it. Referencing a different node mid-traversal does not fit well with recursive rendering.
Additional benefits include easier search index construction (a full-text scan of each <p> element suffices) and good compatibility with static site generation (no complex runtime logic needed).
That said, this is an architectural choice. Alternatives such as the Web Annotation model, which overlays annotations at the display layer while keeping them in standoff form, are also viable.
Convertibility Between Standoff and Inline
It is worth clarifying whether standoff and inline can be freely converted between each other.
Inline → Standoff: Always Possible
Any inline annotation can be mechanically converted to standoff by inserting anchors at the start/end positions and moving the element body elsewhere. No information is lost.
Standoff → Inline: Not Always Possible
When annotation ranges overlap, they cannot be expressed as well-formed XML. This is the constraint known as the “overlapping hierarchy problem” in the TEI community:
<!-- two annotation ranges partially overlap -->
Text: A B C D E
Annotation 1: [B C D]
Annotation 2: [C D E]
<!-- attempting to inline... -->
A <ann1>B C <ann2>D</ann1> E</ann2> ← not well-formed XML
XML requires proper nesting, and two partially overlapping elements cannot be represented. Standoff markup was designed precisely to work around this constraint.
The Digital Engishiki Case
The variant annotation (<app>) that triggered our bug crossed the boundary of existing structural elements (<measure>):
<measure>...<anchor start/>...</measure>...<measure>...<anchor end/>...</measure>
The range of <app> and the range of <measure> overlap, so <app> cannot be properly nested inside or outside <measure>. The current implementation therefore copies the content between anchors into a <note> element — an approximate conversion involving content duplication rather than strict inlining.
| Direction | Feasibility | Condition |
|---|---|---|
| Inline → Standoff | Always possible | Unconditional |
| Standoff → Inline | Conditional | Only when annotation ranges do not overlap with other elements |
| Standoff → Inline (with overlap) | Approximately possible | Requires workarounds such as content duplication or element splitting |
This asymmetry illustrates why standoff is a superior canonical form for data, while also showing that inlining carries inherent complexity.
Anchor Pairs Crossing Element Boundaries
Most variant annotations fit within a single parent element, but some cross element boundaries. In the Engishiki, tribute quantities are marked up with <measure> elements, and some variant ranges cross these <measure> boundaries:
<!-- Engishiki, Book 24 (Shukei-ryō jō): tribute from Iyo Province -->
<p>
<measure quantity="4">
<anchor xml:id="app001"/>
四
<unit unitRef="#疋"> 疋 </unit>
</measure>
、
<measure commodity="#緋帛">
緋
<anchor xml:id="app001e"/>
</measure>
</p>
anchor_start is inside the first <measure>, and anchor_end is inside the second. DOM next_sibling cannot reach across them.
Previous Implementation: String Manipulation + Re-parsing
The original implementation, judging that DOM operations alone were insufficient, adopted an approach of serializing the entire XML to a string, manipulating it, and re-parsing:
# previous implementation (simplified)
soup_str = str(self.soup) # serialize entire XML
between = soup_str.split(start_str)[1].split(end_str)[0] # extract between anchors
# remove content between anchors from string
start_pos = soup_str.index(start_str) + len(start_str)
end_pos = soup_str.index(end_str)
soup_str = soup_str[:start_pos] + soup_str[end_pos:]
# rebuild DOM from string
self.soup.__init__(soup_str, "xml") # ← the problem
Why It Breaks
The content between anchors includes structural tag boundaries:
四 <unit>疋</unit> </measure> 、 <measure ...> 緋
^^^^^^^^^ ^^^^^^^^^^^^
closing tag opening tag
Removing this from the string breaks the XML nesting structure:
<!-- before removal -->
<measure quantity="4">...<anchor start/>四<unit>疋</unit></measure>、<measure>緋<anchor end/>...</measure>
<!-- after removal (broken XML) -->
<measure quantity="4">...<anchor start/><anchor end/>...</measure>
With the closing </measure> and opening <measure> removed, the XML is no longer well-formed. BeautifulSoup’s parser attempts to “repair” the broken XML, but in doing so, rearranges the entire document structure in unintended ways.
In Digital Engishiki, the critical edition (<div type="original">) and modern Japanese translation (<div type="japanese">) are managed in the same file. The re-parsing caused the contents of the Japanese translation <div> to migrate into the critical edition <div>. As a result, 84 out of 87 items from Book 24 (Shukei-ryō jō) were missing from the search index, and queries for terms like “鰒” (abalone) or “緋” (scarlet silk) returned no results for that book.
Solution: DOM-Based Range Operations
We eliminated string manipulation entirely and reimplemented using only BeautifulSoup’s DOM operations.
Phase Decomposition from the Common Ancestor
When two anchors are in different parent elements, we decompose the operation based on their Lowest Common Ancestor (LCA):
def _collect_between_content(self, anchor_start, anchor_end):
common = self._find_common_ancestor(anchor_start, anchor_end)
start_path = self._path_to_ancestor(anchor_start, common)
end_path = self._path_to_ancestor(anchor_end, common)
common (p)
/ | \
measure_A 、 measure_B
/ \ |
anchor_s unit anchor_e
Collection and removal are performed in five phases:
| Phase | Target | Operation |
|---|---|---|
| 1 | Siblings after anchor_start | Collect within parent (partially collect elements containing anchor_end) |
| 2 | Siblings after nodes on start-side path | Collect while traversing upward at each level |
| 3 | Intermediate siblings under common | Collect between start_child and end_child |
| 4 | Siblings before nodes on end-side path | Collect while traversing downward at each level |
| 5 | Siblings before anchor_end | Collect within parent |
Partial Processing of Elements Containing anchor_end
In Phase 1, a sibling of anchor_start may contain anchor_end as a descendant (e.g., anchor_end inside a <unit> element). In this case, rather than collecting the entire element, we recursively collect only the content before anchor_end:
def _collect_siblings_after_safe(self, element, anchor_end, collected):
current = element.next_sibling
while current:
if self._contains(current, anchor_end):
# element contains anchor_end → partial collection
self._collect_up_to_anchor(current, anchor_end, collected)
break
collected.append(self._deep_copy_node(current))
current = current.next_sibling
Removal follows the same pattern, with partial removal for elements containing anchor_end. This enables content collection and removal without breaking structural tags.
Well-formedness Guarantee
DOM operations always maintain a well-formed tree, so the converted XML cannot become malformed. We also added tests that re-parse the output and verify structural consistency:
def test_cross_parent_preserves_document_structure():
# run conversion
replacer = AppReplacer(soup, verbose=False)
replacer.process()
# re-parse and verify structure
reparsed = BeautifulSoup(str(soup), "xml")
original = reparsed.find("div", type="original")
japanese = reparsed.find("div", type="japanese")
assert original is not None
assert japanese is not None
# no translation items should leak into the critical edition div
for item in original.find_all("p", ana="項"):
assert not item.get("xml:id", "").startswith("ja-")
Design Observations
Risks of Editing XML via String Manipulation
Partial string manipulation of XML always carries the risk of breaking structural tag correspondence. Even when it appears to work in some cases, certain data patterns can cause collapse, and as in our case, the problem may go undetected for some time. DOM operations are more verbose but guarantee structural integrity.
Testing Build-Time Conversions
When incorporating standoff-to-inline conversion into a build pipeline, complexity concentrates in the conversion step. The following types of tests proved effective as a safety net:
- Structure preservation tests: re-parse the converted XML and verify that div structures are maintained
- Item count tests: verify that the number of critical edition and translation items per volume matches expectations
- Leak tests: verify that no items from one language div appear in another
Generality of DOM Range Operations
The pattern implemented here — collecting and removing content between two points via LCA identification, path decomposition, and phased processing — is essentially the same operation as the browser’s DOM Range API. Since BeautifulSoup and lxml do not provide this functionality built-in, it may be reusable for cross-element-boundary processing in XML/HTML beyond TEI.
Summary
- Converting standoff to inline is asymmetric: inline to standoff is always possible, but the reverse requires workarounds when annotation ranges overlap (the overlapping hierarchy problem)
- Build-time inlining is a practical choice due to its compatibility with Vue.js/React recursive rendering and ease of search index construction
- String manipulation + re-parsing risks breaking structural tag correspondence; DOM-based phase decomposition from the common ancestor provides a safe conversion path
- When adopting build-time conversion, structure preservation tests are essential for quality assurance