TL;DR

Transkribus is an AI-based Handwritten Text Recognition (HTR) platform. Supporting over 100 languages, it can recognize not only printed text but also handwriting. Its custom model training feature allows you to optimize recognition accuracy for specific handwriting styles and scripts. It has become an essential tool for DH researchers working on historical document transcription.

What is Transkribus?

Transkribus originated as a project at the University of Innsbruck, Austria, and is currently managed by READ-COOP SCE (a European cooperative). Its development has been supported by funding from the EU’s Horizon 2020 programme and other sources.

Key features include:

  • HTR (Handwritten Text Recognition): Deep learning-based handwriting recognition engine
  • 100+ languages: Supports diverse writing systems including Latin, Cyrillic, Arabic, and Hebrew scripts
  • Custom model training: Train recognition models on your own data for high-accuracy recognition specialized to specific documents
  • Layout analysis: Automatic detection of text regions, lines, and paragraphs within pages
  • Collaborative work: Supports team-based collaboration for efficient large-scale transcription projects

Key Features

Text Recognition (HTR/OCR)

The core functionality of Transkribus. Pre-trained general models allow you to start text recognition immediately. Available public models cover various periods and languages, including medieval Latin manuscripts, early modern German Kurrent script, and English cursive handwriting.

Custom Model Training

One of the most powerful features. With approximately 50 pages of Ground Truth (images with correct transcription text), you can train a model specialized for specific handwriting or scripts. Trained models can also be shared with other users.

Layout Analysis

Automatically analyzes document image layouts, detecting text regions (TextRegion), text lines (TextLine), and baselines. It handles complex layouts including multi-column text, tables, and marginal notes.

Transkribus Lite

A browser-based interface that requires no installation. It provides basic HTR functionality and layout analysis, making it suitable for quick trials.

How to Use

Basic Workflow

  1. Create an account: Register at Transkribus
  2. Upload documents: Upload image files (JPEG, PNG, TIFF) or PDFs
  3. Layout analysis: Run automatic layout analysis to detect text regions and lines
  4. Select a model: Choose an appropriate HTR model (searchable from the public model list)
  5. Run text recognition: Execute HTR with the selected model
  6. Review and correct: Review and correct recognition results while comparing against the original images
  7. Export: Export in formats such as TEI-XML, PAGE XML, ALTO XML, or plain text

Pricing

Transkribus uses a pay-per-use model. A free tier (500 credits per month) is available, allowing small-scale use at no cost. Subscription plans are available for large-scale projects.

Practical Applications in DH Research

Historical Document Transcription

Transcribe handwritten documents from archives to build full-text searchable digital archives. For example, transcribing Edo-period historical documents or Meiji-era handwritten government records.

Large-Scale Corpus Building

By combining custom model training with automated recognition, efficiently transcribe documents at the scale of thousands of pages to build corpora for text mining and linguistic analysis.

Comparative Document Studies

Transcribe different manuscript copies of the same text to analyze textual variations in stemmatological studies. TEI-XML export facilitates the creation of critical editions.

Citizen Science Projects

Leverage Transkribus’s collaborative features to run crowdsourced transcription projects with volunteers. Quality control features ensure high-quality outcomes even with citizen participation.

Comparison with Other Tools

FeatureTranskribusGoogle Cloud VisionTesseract OCR
HTR (handwriting)High accuracyBasicNot supported
Custom modelsYesVia AutoMLTrainable
Historical documentsSpecializedGeneralGeneral
Layout analysisAdvancedBasicBasic
PricingPay-per-usePay-per-useFree (OSS)
Output formatsTEI/ALTO/PAGEJSONText/hOCR

Conclusion

Transkribus is the most proven platform for historical document transcription. Its AI-based HTR engine and custom model training capabilities handle handwritten documents across various periods and languages. In DH research, transcription is the starting point for many analyses, and Transkribus provides the essential foundation for that work.

References