This article was created by AI with some human modifications.

Introduction

In the world of digital humanities, it has become common to store documents in TEI (Text Encoding Initiative) format. TEI is a standard for structuring scholarly texts. This article explains how to convert documents created in Microsoft Word to TEI XML format using Python.

What is TEIgarage?

TEIgarage is an online service for converting documents in various formats to TEI XML. The service provides an API that can be called directly from programs. In this article, we will call this API from Python to convert Word files.

Requirements

  • Python 3.6 or higher
  • requests library (for API requests)
  • Internet connection
  • A Word file (.docx format) to convert

Steps

1. Install Required Libraries

First, install the necessary libraries. Run the following command in your command prompt or terminal.

pip install requests

2. Create the Python Script

Next, save the following Python code with a name like word_to_tei.py.

import requests
import os
import zipfile
from io import BytesIO

def convert_docx_to_tei_xml(file_path, output_path):

    # OxGarage endpoint
    input_document_type = "docx%3Aapplication%3Avnd.openxmlformats-officedocument.wordprocessingml.document"
    output_document_type = "TEI%3Atext%3Axml"
    TEIGARAGE_URL = f"https://teigarage.tei-c.org/ege-webservice/Conversions/{input_document_type}/{output_document_type}/"

    # Open the .docx file and send it to the API
    with open(file_path, "rb") as file:
        files = {"file": file}
        response = requests.post(TEIGARAGE_URL, files=files)

    # Extract the conversion result without saving as a file
    if response.status_code == 200:

        # Extract zip file in memory
        with zipfile.ZipFile(BytesIO(response.content)) as zip_ref:
            # Save the tei.xml file
            for member in zip_ref.namelist():
                if member.endswith("tei.xml"):
                    zip_ref.extract(member, os.path.dirname(output_path))
                    tei_xml_path = os.path.join(os.path.dirname(output_path), member)
                    os.rename(tei_xml_path, output_path)
                    print("TEI/XML conversion successful! Saved to tei.xml.")
                    break
            else:
                print("Error: tei.xml file not found.")
    else:
        print("Error:", response.status_code, response.text)

# Main processing
if __name__ == "__main__":
    # Specify the path to the Word file you want to convert
    word_file = "documents/sample.docx"  # Change this to the actual file path

    # Specify the output file path
    output_file = "output/sample_tei.xml"  # Specify the output destination

    try:
        # Convert the Word file
        convert_docx_to_tei_xml(word_file, output_file)



    except Exception as e:
        print(f"An error occurred: {e}")

3. Run the Script

Change the word_file variable in the script to the actual path of the Word file you want to convert. Similarly, change the output_file variable to your desired output destination.

Then, run the following command in your command prompt or terminal.

python word_to_tei.py

Summary

By using the TEIgarage API, you can easily convert Word files to TEI XML format. Use this script to streamline text processing in your digital humanities projects.

TEI is a standard markup language for scholarly texts, and converted XML files are suitable for long-term preservation and detailed analysis. Additionally, TEI-formatted data can be used with various digital humanities tools.

Feel free to customize this script for your own projects!