Background
When extracting the transparent text layer from PDF files, we encountered the problem of “text order differing from the original PDF.” This article explains the cause of this issue and solutions in both JavaScript and Python. There may be some inaccuracies, but we hope this serves as a useful reference.
What is PDF Transparent Text?
The transparent text layer of a PDF is searchable text information embedded within the PDF file. OCR-processed PDFs and digitally generated PDFs contain this transparent text layer, enabling the following features:
- Text search
- Copy and paste
- Screen reader narration
- Machine translation
The Problem: Why Text Order Gets Disrupted
PDF Internal Structure
PDF files store text in a format called “content streams.” These streams contain text and its position information, but the content is not necessarily stored in reading order.
Example: Conceptual diagram of PDF content stream
[Position: x=100, y=200, Text="Heading"]
[Position: x=300, y=400, Text="Footnote"]
[Position: x=100, y=300, Text="Body text"]
Problems with Common Extraction Methods
Many PDF processing libraries extract text through the following procedure:
- Retrieve text and position information from the content stream
- Sort by coordinates (top to bottom, left to right)
- Output the sorted result
This “sort by coordinates” process is the main cause of text order disruption.
Specific Problem Examples
- Mixed vertical and horizontal text: Commonly seen in Japanese documents
- Multi-column layouts: Newspaper or magazine formats
- Inserted figures and tables: Elements that interrupt the flow of body text
- Headers and footers: Elements that span across pages
Solutions: Language-Specific Approaches
JavaScript (PDF.js) Solution
PDF.js is a JavaScript-based PDF rendering library developed by Mozilla.
Implementation that Preserves Order
// Order-preserving text extraction using PDF.js
async function extractTextWithOrder(page) {
// getTextContent() maintains the content stream order
const textContent = await page.getTextContent();
// items is an array preserving the original order
const orderedText = textContent.items.map(item => {
return {
text: item.str,
x: item.transform[4],
y: item.transform[5],
width: item.width,
height: item.height
};
});
// Use the array order as-is (no coordinate sorting)
return orderedText;
}
Key Points
- The
getTextContent()method returns text in order faithful to the PDF’s internal structure - Array indices represent the original order
- No re-sorting by coordinates is performed
Python (PyMuPDF) Solution
PyMuPDF (fitz) is a Python binding for the MuPDF library.
Implementation that Preserves Order
import fitz # PyMuPDF
def extract_text_with_order(pdf_path):
doc = fitz.open(pdf_path)
for page_idx, page in enumerate(doc):
# Method 1: Raw text extraction (content stream order)
raw_text = page.get_text("text")
# Method 2: Extraction preserving detailed structure
if not raw_text.strip():
text_dict = page.get_text("dict")
page_texts = []
# Process blocks in original order
for block in text_dict.get("blocks", []):
if block.get("type") == 0: # Text block
for line in block.get("lines", []):
line_text = ""
for span in line.get("spans", []):
line_text += span.get("text", "")
if line_text.strip():
page_texts.append(line_text)
text = '\n'.join(page_texts)
else:
text = raw_text
yield page_idx, text
doc.close()
Key Points
get_text("text")preserves the PDF content stream orderget_text("dict")retrieves detailed structure information- Avoids coordinate-based sorting
Python (pdfplumber) Issues
pdfplumber is a popular Python library, but it performs coordinate-based processing by default:
# pdfplumber example (problematic approach)
import pdfplumber
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
# extract_text() performs coordinate sorting internally
text = page.extract_text() # Order may be disrupted
Implementation Comparison Table
| Feature | PDF.js (JavaScript) | PyMuPDF (Python) | pdfplumber (Python) |
|---|---|---|---|
| Content stream order preservation | Yes | Yes | No |
| Coordinate information retrieval | Yes | Yes | Yes |
| Processing speed | Medium | High | Low |
| Memory usage | Medium | Low | High |
| Japanese support | Yes | Yes | Partial |
| Browser support | Yes | No | No |
Practical Example: Hybrid Approach
An implementation example that leverages both order preservation and coordinate information:
JavaScript Implementation
class PDFTextExtractor {
constructor() {
this.textItems = [];
}
async extractWithMetadata(page) {
const textContent = await page.getTextContent();
// Preserve original order while also recording position information
this.textItems = textContent.items.map((item, index) => ({
originalIndex: index, // Original order
text: item.str,
x: item.transform[4],
y: item.transform[5],
width: item.width,
height: item.height
}));
return this.textItems;
}
// Sort by coordinates when needed
getTextByPosition() {
return [...this.textItems].sort((a, b) => {
if (Math.abs(a.y - b.y) 5) { // Considered same line
return a.x - b.x; // Left to right
}
return b.y - a.y; // Top to bottom
});
}
// Get in original order
getTextByOriginalOrder() {
return this.textItems; // Already in original order
}
}
Python Implementation
class PDFTextExtractor:
def __init__(self):
self.text_items = []
def extract_with_metadata(self, pdf_path):
doc = fitz.open(pdf_path)
for page_num, page in enumerate(doc):
# Get detailed information in dictionary format
text_dict = page.get_text("dict")
item_index = 0
for block in text_dict.get("blocks", []):
if block.get("type") == 0: # Text block
for line in block.get("lines", []):
for span in line.get("spans", []):
self.text_items.append({
'original_index': item_index,
'page': page_num,
'text': span.get("text", ""),
'bbox': span.get("bbox", []), # [x0, y0, x1, y1]
'font': span.get("font", ""),
'size': span.get("size", 0)
})
item_index += 1
doc.close()
return self.text_items
def get_text_by_position(self):
# Sort by coordinates when needed
return sorted(self.text_items,
key=lambda x: (-x['bbox'][1], x['bbox'][0]))
def get_text_by_original_order(self):
# Maintain original order
return self.text_items
Best Practices
1. Choose Based on Use Case
def choose_extraction_method(use_case):
if use_case == "full_text_search":
# Full-text search: prioritize original order
return "original_order"
elif use_case == "layout_analysis":
# Layout analysis: prioritize coordinate information
return "position_based"
elif use_case == "content_extraction":
# Content extraction: hybrid
return "hybrid"
2. Error Handling
async function safeExtractText(page) {
try {
const textContent = await page.getTextContent();
// Check for empty text
if (!textContent.items || textContent.items.length === 0) {
console.warn('No text found in page');
return [];
}
// Check for garbled characters
const items = textContent.items.filter(item => {
// Remove CID characters
return !item.str.includes('(cid:');
});
return items;
} catch (error) {
console.error('Text extraction failed:', error);
return [];
}
}
3. Performance Optimization
# Processing large PDFs
def extract_large_pdf(pdf_path, batch_size=10):
doc = fitz.open(pdf_path)
total_pages = len(doc)
for start_idx in range(0, total_pages, batch_size):
end_idx = min(start_idx + batch_size, total_pages)
batch_texts = []
for page_idx in range(start_idx, end_idx):
page = doc[page_idx]
# Memory-efficient processing
text = page.get_text("text")
batch_texts.append(text)
# Release page object
page = None
yield batch_texts
doc.close()
Troubleshooting
Common Problems and Solutions
- Japanese character garbling
# Explicit UTF-8 encoding
text = page.get_text("text").encode('utf-8', errors='ignore').decode('utf-8')
- Vertical text processing
// Determine writing direction
function getWritingDirection(item) {
// Determine from transform matrix
return Math.abs(item.transform[1]) > 0.5 ? 'vertical' : 'horizontal';
}
- Out of memory
# Streaming processing
def stream_pdf_text(pdf_path):
doc = fitz.open(pdf_path)
for page in doc:
yield page.get_text("text")
page = None # Explicit release
doc.close()
Summary
Preserving order is an important challenge in PDF transparent text extraction. Key points are:
- Understanding the problem: Coordinate-based sorting is the main cause of order disruption
- Choosing the right library: Use PDF.js (JavaScript) or PyMuPDF (Python)
- Implementation approach: Maintain the content stream order
- Hybrid approach: Leverage both order and coordinate information
By applying this knowledge, you can build more accurate and reliable PDF text extraction systems.