Challenges and Solutions for Preserving Order in PDF Transparent Text Extraction

Background

When extracting the transparent text layer from PDF files, we encountered the problem of “text order differing from the original PDF.” This article explains the cause of this issue and solutions in both JavaScript and Python. There may be some inaccuracies, but we hope this serves as a useful reference.

What is PDF Transparent Text?

The transparent text layer of a PDF is searchable text information embedded within the PDF file. OCR-processed PDFs and digitally generated PDFs contain this transparent text layer, enabling the following features:

Text search
Copy and paste
Screen reader narration
Machine translation

The Problem: Why Text Order Gets Disrupted

PDF Internal Structure

PDF files store text in a format called “content streams.” These streams contain text and its position information, but the content is not necessarily stored in reading order.

Example: Conceptual diagram of PDF content stream
[Position: x=100, y=200, Text="Heading"]
[Position: x=300, y=400, Text="Footnote"]
[Position: x=100, y=300, Text="Body text"]

Problems with Common Extraction Methods

Many PDF processing libraries extract text through the following procedure:

Retrieve text and position information from the content stream
Sort by coordinates (top to bottom, left to right)
Output the sorted result

This “sort by coordinates” process is the main cause of text order disruption.

Specific Problem Examples

Mixed vertical and horizontal text: Commonly seen in Japanese documents
Multi-column layouts: Newspaper or magazine formats
Inserted figures and tables: Elements that interrupt the flow of body text
Headers and footers: Elements that span across pages

Solutions: Language-Specific Approaches

JavaScript (PDF.js) Solution

PDF.js is a JavaScript-based PDF rendering library developed by Mozilla.

Implementation that Preserves Order

// Order-preserving text extraction using PDF.js
async function extractTextWithOrder(page) {
  // getTextContent() maintains the content stream order
  const textContent = await page.getTextContent();

  // items is an array preserving the original order
  const orderedText = textContent.items.map(item => {
    return {
      text: item.str,
      x: item.transform[4],
      y: item.transform[5],
      width: item.width,
      height: item.height
    };
  });

  // Use the array order as-is (no coordinate sorting)
  return orderedText;
}

Key Points

The getTextContent() method returns text in order faithful to the PDF’s internal structure
Array indices represent the original order
No re-sorting by coordinates is performed

Python (PyMuPDF) Solution

PyMuPDF (fitz) is a Python binding for the MuPDF library.

Implementation that Preserves Order

import fitz  # PyMuPDF

def extract_text_with_order(pdf_path):
    doc = fitz.open(pdf_path)

    for page_idx, page in enumerate(doc):
        # Method 1: Raw text extraction (content stream order)
        raw_text = page.get_text("text")

        # Method 2: Extraction preserving detailed structure
        if not raw_text.strip():
            text_dict = page.get_text("dict")
            page_texts = []

            # Process blocks in original order
            for block in text_dict.get("blocks", []):
                if block.get("type") == 0:  # Text block
                    for line in block.get("lines", []):
                        line_text = ""
                        for span in line.get("spans", []):
                            line_text += span.get("text", "")
                        if line_text.strip():
                            page_texts.append(line_text)

            text = '\n'.join(page_texts)
        else:
            text = raw_text

        yield page_idx, text

    doc.close()

Key Points

get_text("text") preserves the PDF content stream order
get_text("dict") retrieves detailed structure information
Avoids coordinate-based sorting

Python (pdfplumber) Issues

pdfplumber is a popular Python library, but it performs coordinate-based processing by default:

# pdfplumber example (problematic approach)
import pdfplumber

with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        # extract_text() performs coordinate sorting internally
        text = page.extract_text()  # Order may be disrupted

Implementation Comparison Table

Feature	PDF.js (JavaScript)	PyMuPDF (Python)	pdfplumber (Python)
Content stream order preservation	Yes	Yes	No
Coordinate information retrieval	Yes	Yes	Yes
Processing speed	Medium	High	Low
Memory usage	Medium	Low	High
Japanese support	Yes	Yes	Partial
Browser support	Yes	No	No

Practical Example: Hybrid Approach

An implementation example that leverages both order preservation and coordinate information:

JavaScript Implementation

class PDFTextExtractor {
  constructor() {
    this.textItems = [];
  }

  async extractWithMetadata(page) {
    const textContent = await page.getTextContent();

    // Preserve original order while also recording position information
    this.textItems = textContent.items.map((item, index) => ({
      originalIndex: index,  // Original order
      text: item.str,
      x: item.transform[4],
      y: item.transform[5],
      width: item.width,
      height: item.height
    }));

    return this.textItems;
  }

  // Sort by coordinates when needed
  getTextByPosition() {
    return [...this.textItems].sort((a, b) => {
      if (Math.abs(a.y - b.y)  5) {  // Considered same line
        return a.x - b.x;  // Left to right
      }
      return b.y - a.y;  // Top to bottom
    });
  }

  // Get in original order
  getTextByOriginalOrder() {
    return this.textItems;  // Already in original order
  }
}

Python Implementation

class PDFTextExtractor:
    def __init__(self):
        self.text_items = []

    def extract_with_metadata(self, pdf_path):
        doc = fitz.open(pdf_path)

        for page_num, page in enumerate(doc):
            # Get detailed information in dictionary format
            text_dict = page.get_text("dict")

            item_index = 0
            for block in text_dict.get("blocks", []):
                if block.get("type") == 0:  # Text block
                    for line in block.get("lines", []):
                        for span in line.get("spans", []):
                            self.text_items.append({
                                'original_index': item_index,
                                'page': page_num,
                                'text': span.get("text", ""),
                                'bbox': span.get("bbox", []),  # [x0, y0, x1, y1]
                                'font': span.get("font", ""),
                                'size': span.get("size", 0)
                            })
                            item_index += 1

        doc.close()
        return self.text_items

    def get_text_by_position(self):
        # Sort by coordinates when needed
        return sorted(self.text_items,
                     key=lambda x: (-x['bbox'][1], x['bbox'][0]))

    def get_text_by_original_order(self):
        # Maintain original order
        return self.text_items

Best Practices

1. Choose Based on Use Case

def choose_extraction_method(use_case):
    if use_case == "full_text_search":
        # Full-text search: prioritize original order
        return "original_order"
    elif use_case == "layout_analysis":
        # Layout analysis: prioritize coordinate information
        return "position_based"
    elif use_case == "content_extraction":
        # Content extraction: hybrid
        return "hybrid"

2. Error Handling

async function safeExtractText(page) {
  try {
    const textContent = await page.getTextContent();

    // Check for empty text
    if (!textContent.items || textContent.items.length === 0) {
      console.warn('No text found in page');
      return [];
    }

    // Check for garbled characters
    const items = textContent.items.filter(item => {
      // Remove CID characters
      return !item.str.includes('(cid:');
    });

    return items;
  } catch (error) {
    console.error('Text extraction failed:', error);
    return [];
  }
}

3. Performance Optimization

# Processing large PDFs
def extract_large_pdf(pdf_path, batch_size=10):
    doc = fitz.open(pdf_path)
    total_pages = len(doc)

    for start_idx in range(0, total_pages, batch_size):
        end_idx = min(start_idx + batch_size, total_pages)
        batch_texts = []

        for page_idx in range(start_idx, end_idx):
            page = doc[page_idx]
            # Memory-efficient processing
            text = page.get_text("text")
            batch_texts.append(text)
            # Release page object
            page = None

        yield batch_texts

    doc.close()

Troubleshooting

Common Problems and Solutions

Japanese character garbling

# Explicit UTF-8 encoding
text = page.get_text("text").encode('utf-8', errors='ignore').decode('utf-8')

Vertical text processing

// Determine writing direction
function getWritingDirection(item) {
  // Determine from transform matrix
  return Math.abs(item.transform[1]) > 0.5 ? 'vertical' : 'horizontal';
}

Out of memory

# Streaming processing
def stream_pdf_text(pdf_path):
    doc = fitz.open(pdf_path)
    for page in doc:
        yield page.get_text("text")
        page = None  # Explicit release
    doc.close()

Summary

Preserving order is an important challenge in PDF transparent text extraction. Key points are:

Understanding the problem: Coordinate-based sorting is the main cause of order disruption
Choosing the right library: Use PDF.js (JavaScript) or PyMuPDF (Python)
Implementation approach: Maintain the content stream order
Hybrid approach: Leverage both order and coordinate information

By applying this knowledge, you can build more accurate and reliable PDF text extraction systems.

Background#

What is PDF Transparent Text?#

The Problem: Why Text Order Gets Disrupted#

PDF Internal Structure#

Problems with Common Extraction Methods#

Specific Problem Examples#

Solutions: Language-Specific Approaches#

JavaScript (PDF.js) Solution#

Implementation that Preserves Order#

Key Points#

Python (PyMuPDF) Solution#

Implementation that Preserves Order#

Key Points#

Python (pdfplumber) Issues#

Implementation Comparison Table#

Practical Example: Hybrid Approach#

JavaScript Implementation#

Python Implementation#

Best Practices#

1. Choose Based on Use Case#

2. Error Handling#

3. Performance Optimization#

Troubleshooting#

Common Problems and Solutions#

Summary#

References#

Background

What is PDF Transparent Text?

The Problem: Why Text Order Gets Disrupted

PDF Internal Structure

Problems with Common Extraction Methods

Specific Problem Examples

Solutions: Language-Specific Approaches

JavaScript (PDF.js) Solution

Implementation that Preserves Order

Key Points

Python (PyMuPDF) Solution

Implementation that Preserves Order

Key Points

Python (pdfplumber) Issues

Implementation Comparison Table

Practical Example: Hybrid Approach

JavaScript Implementation

Python Implementation

Best Practices

1. Choose Based on Use Case

2. Error Handling

3. Performance Optimization

Troubleshooting

Common Problems and Solutions

Summary

References