Background

When extracting the transparent text layer from PDF files, we encountered the problem of “text order differing from the original PDF.” This article explains the cause of this issue and solutions in both JavaScript and Python. There may be some inaccuracies, but we hope this serves as a useful reference.

What is PDF Transparent Text?

The transparent text layer of a PDF is searchable text information embedded within the PDF file. OCR-processed PDFs and digitally generated PDFs contain this transparent text layer, enabling the following features:

  • Text search
  • Copy and paste
  • Screen reader narration
  • Machine translation

The Problem: Why Text Order Gets Disrupted

PDF Internal Structure

PDF files store text in a format called “content streams.” These streams contain text and its position information, but the content is not necessarily stored in reading order.

Example: Conceptual diagram of PDF content stream
[Position: x=100, y=200, Text="Heading"]
[Position: x=300, y=400, Text="Footnote"]
[Position: x=100, y=300, Text="Body text"]

Problems with Common Extraction Methods

Many PDF processing libraries extract text through the following procedure:

  1. Retrieve text and position information from the content stream
  2. Sort by coordinates (top to bottom, left to right)
  3. Output the sorted result

This “sort by coordinates” process is the main cause of text order disruption.

Specific Problem Examples

  • Mixed vertical and horizontal text: Commonly seen in Japanese documents
  • Multi-column layouts: Newspaper or magazine formats
  • Inserted figures and tables: Elements that interrupt the flow of body text
  • Headers and footers: Elements that span across pages

Solutions: Language-Specific Approaches

JavaScript (PDF.js) Solution

PDF.js is a JavaScript-based PDF rendering library developed by Mozilla.

Implementation that Preserves Order

// Order-preserving text extraction using PDF.js
async function extractTextWithOrder(page) {
  // getTextContent() maintains the content stream order
  const textContent = await page.getTextContent();

  // items is an array preserving the original order
  const orderedText = textContent.items.map(item => {
    return {
      text: item.str,
      x: item.transform[4],
      y: item.transform[5],
      width: item.width,
      height: item.height
    };
  });

  // Use the array order as-is (no coordinate sorting)
  return orderedText;
}

Key Points

  • The getTextContent() method returns text in order faithful to the PDF’s internal structure
  • Array indices represent the original order
  • No re-sorting by coordinates is performed

Python (PyMuPDF) Solution

PyMuPDF (fitz) is a Python binding for the MuPDF library.

Implementation that Preserves Order

import fitz  # PyMuPDF

def extract_text_with_order(pdf_path):
    doc = fitz.open(pdf_path)

    for page_idx, page in enumerate(doc):
        # Method 1: Raw text extraction (content stream order)
        raw_text = page.get_text("text")

        # Method 2: Extraction preserving detailed structure
        if not raw_text.strip():
            text_dict = page.get_text("dict")
            page_texts = []

            # Process blocks in original order
            for block in text_dict.get("blocks", []):
                if block.get("type") == 0:  # Text block
                    for line in block.get("lines", []):
                        line_text = ""
                        for span in line.get("spans", []):
                            line_text += span.get("text", "")
                        if line_text.strip():
                            page_texts.append(line_text)

            text = '\n'.join(page_texts)
        else:
            text = raw_text

        yield page_idx, text

    doc.close()

Key Points

  • get_text("text") preserves the PDF content stream order
  • get_text("dict") retrieves detailed structure information
  • Avoids coordinate-based sorting

Python (pdfplumber) Issues

pdfplumber is a popular Python library, but it performs coordinate-based processing by default:

# pdfplumber example (problematic approach)
import pdfplumber

with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        # extract_text() performs coordinate sorting internally
        text = page.extract_text()  # Order may be disrupted

Implementation Comparison Table

FeaturePDF.js (JavaScript)PyMuPDF (Python)pdfplumber (Python)
Content stream order preservationYesYesNo
Coordinate information retrievalYesYesYes
Processing speedMediumHighLow
Memory usageMediumLowHigh
Japanese supportYesYesPartial
Browser supportYesNoNo

Practical Example: Hybrid Approach

An implementation example that leverages both order preservation and coordinate information:

JavaScript Implementation

class PDFTextExtractor {
  constructor() {
    this.textItems = [];
  }

  async extractWithMetadata(page) {
    const textContent = await page.getTextContent();

    // Preserve original order while also recording position information
    this.textItems = textContent.items.map((item, index) => ({
      originalIndex: index,  // Original order
      text: item.str,
      x: item.transform[4],
      y: item.transform[5],
      width: item.width,
      height: item.height
    }));

    return this.textItems;
  }

  // Sort by coordinates when needed
  getTextByPosition() {
    return [...this.textItems].sort((a, b) => {
      if (Math.abs(a.y - b.y)  5) {  // Considered same line
        return a.x - b.x;  // Left to right
      }
      return b.y - a.y;  // Top to bottom
    });
  }

  // Get in original order
  getTextByOriginalOrder() {
    return this.textItems;  // Already in original order
  }
}

Python Implementation

class PDFTextExtractor:
    def __init__(self):
        self.text_items = []

    def extract_with_metadata(self, pdf_path):
        doc = fitz.open(pdf_path)

        for page_num, page in enumerate(doc):
            # Get detailed information in dictionary format
            text_dict = page.get_text("dict")

            item_index = 0
            for block in text_dict.get("blocks", []):
                if block.get("type") == 0:  # Text block
                    for line in block.get("lines", []):
                        for span in line.get("spans", []):
                            self.text_items.append({
                                'original_index': item_index,
                                'page': page_num,
                                'text': span.get("text", ""),
                                'bbox': span.get("bbox", []),  # [x0, y0, x1, y1]
                                'font': span.get("font", ""),
                                'size': span.get("size", 0)
                            })
                            item_index += 1

        doc.close()
        return self.text_items

    def get_text_by_position(self):
        # Sort by coordinates when needed
        return sorted(self.text_items,
                     key=lambda x: (-x['bbox'][1], x['bbox'][0]))

    def get_text_by_original_order(self):
        # Maintain original order
        return self.text_items

Best Practices

1. Choose Based on Use Case

def choose_extraction_method(use_case):
    if use_case == "full_text_search":
        # Full-text search: prioritize original order
        return "original_order"
    elif use_case == "layout_analysis":
        # Layout analysis: prioritize coordinate information
        return "position_based"
    elif use_case == "content_extraction":
        # Content extraction: hybrid
        return "hybrid"

2. Error Handling

async function safeExtractText(page) {
  try {
    const textContent = await page.getTextContent();

    // Check for empty text
    if (!textContent.items || textContent.items.length === 0) {
      console.warn('No text found in page');
      return [];
    }

    // Check for garbled characters
    const items = textContent.items.filter(item => {
      // Remove CID characters
      return !item.str.includes('(cid:');
    });

    return items;
  } catch (error) {
    console.error('Text extraction failed:', error);
    return [];
  }
}

3. Performance Optimization

# Processing large PDFs
def extract_large_pdf(pdf_path, batch_size=10):
    doc = fitz.open(pdf_path)
    total_pages = len(doc)

    for start_idx in range(0, total_pages, batch_size):
        end_idx = min(start_idx + batch_size, total_pages)
        batch_texts = []

        for page_idx in range(start_idx, end_idx):
            page = doc[page_idx]
            # Memory-efficient processing
            text = page.get_text("text")
            batch_texts.append(text)
            # Release page object
            page = None

        yield batch_texts

    doc.close()

Troubleshooting

Common Problems and Solutions

  1. Japanese character garbling
# Explicit UTF-8 encoding
text = page.get_text("text").encode('utf-8', errors='ignore').decode('utf-8')
  1. Vertical text processing
// Determine writing direction
function getWritingDirection(item) {
  // Determine from transform matrix
  return Math.abs(item.transform[1]) > 0.5 ? 'vertical' : 'horizontal';
}
  1. Out of memory
# Streaming processing
def stream_pdf_text(pdf_path):
    doc = fitz.open(pdf_path)
    for page in doc:
        yield page.get_text("text")
        page = None  # Explicit release
    doc.close()

Summary

Preserving order is an important challenge in PDF transparent text extraction. Key points are:

  1. Understanding the problem: Coordinate-based sorting is the main cause of order disruption
  2. Choosing the right library: Use PDF.js (JavaScript) or PyMuPDF (Python)
  3. Implementation approach: Maintain the content stream order
  4. Hybrid approach: Leverage both order and coordinate information

By applying this knowledge, you can build more accurate and reliable PDF text extraction systems.

References