Creating a Transparent Text PDF from a Single Page Using Google Cloud Vision API

Overview

I had the opportunity to create a transparent text PDF from a PDF using Google Cloud Vision API, so this is a personal note for future reference.

Below is an example of searching for simple.

Background

This time, we target PDFs consisting of a single page.

Procedure

Creating the Image

Create an image to be used as the OCR target.

With the default settings, the resulting image was blurry, so I set the resolution to 2x and performed position alignment considering the resolution in the process described below.

Install the following packages.

PyMuPDF
Pillow

import fitz  # PyMuPDF
from PIL import Image
import json
from tqdm import tqdm
import io

# 入力PDFファイルと出力PDFファイル
input_pdf_path = "./input.pdf"  # 単一ページのPDFファイル
output_pdf_path = "./output.pdf"

# 入力PDFファイルを開き、単一ページを読み込み
pdf_document = fitz.open(input_pdf_path)
page = pdf_document[0]  # 最初のページを選択

# ページを画像としてレンダリングし、OCRでテキストを抽出
# pix = page.get_pixmap()  # 解像度300 DPIでレンダリング

zoom = 2.0

# 解像度を上げるためにズーム設定
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)

img = Image.open(io.BytesIO(pix.tobytes("png")))
img.save("./image.png")

Google Cloud Vision API

Apply the Google Cloud Vision API to the output image.

{
    "textAnnotations": [
        {
            "boundingPoly": {
                "vertices": [
                    {
                        "x": 141,
                        "y": 152
                    },
                    {
                        "x": 1082,
                        "y": 152
                    },
                    {
                        "x": 1082,
                        "y": 1410
                    },
                    {
                        "x": 141,
                        "y": 1410
                    }
                ]
            },
            "description": "Sample PDF...",
            "locale": "la"
        },
        {
            "boundingPoly": {
                "vertices": [
                    {
                        "x": 141,
                        "y": 159
                    },
                    {
                        "x": 363,
                        "y": 156
                    },
                    {
                        "x": 364,
                        "y": 216
                    },
                    {
                        "x": 142,
                        "y": 219
                    }
                ]
            },
            "description": "Sample"
        },
        {
            "boundingPoly": {
                "vertices": [
                    {
                        "x": 382,
                        "y": 156
                    },
                    {
                        "x": 506,
                        "y": 154
                    },
                    {
                        "x": 507,
                        "y": 213
                    },
                    {
                        "x": 383,
                        "y": 215
                    }
                ]
            },
            "description": "PDF"
        },
...

Save the output JSON file with a name such as ./google_ocr.json.

Then, retrieve the OCR results as follows.

json_path = "./google_ocr.json"

# JSONファイルからOCRテキストデータを読み込む
with open(json_path, "r") as f:
    response = json.load(f)

texts = response["textAnnotations"]

Creating the Transparent Text

Apply the results to the PDF using the following script. The key point is that it was necessary to “adjust the font size to check if it fits.”

# ページサイズを取得
rect = page.rect

# OCRテキストを透明テキストとして追加
if texts:
    for text in tqdm(texts[1:]):  # texts[0]はページ全体のテキストなのでスキップ
        vertices = text["boundingPoly"]["vertices"]

        x_min = min([v["x"] for v in vertices if v])
        y_min = min([v["y"] for v in vertices if v])
        x_max = max([v["x"] for v in vertices if v])
        y_max = max([v["y"] for v in vertices if v])

        x_min = x_min / zoom
        y_min = y_min / zoom
        x_max = x_max / zoom
        y_max = y_max / zoom

        # バウンディングボックスを定義
        bbox_rect = fitz.Rect(x_min, y_min, x_max, y_max)
        content = text["description"]

        # 初期フォントサイズ
        fontsize = 10
        fits = False

        # フォントサイズを調整して収まるか確認
        while fontsize > 0:
            res = page.insert_textbox(bbox_rect, content, fontsize=fontsize, color=(0, 0, 0, 0), render_mode=3, align=1)
            if res >= 0:
                fits = True
                break
            fontsize -= 1  # フォントサイズを縮小

        if not fits:
            print(f"'{content}' could not fit in the rectangle.")

# 変更したPDFを保存
pdf_document.save(output_pdf_path)
pdf_document.close()

Results

We target the PDF available at the following link.

https://pdfobject.com/pdf/sample.pdf

As a result, we were able to create a transparent text PDF as shown below.

Summary

I hope this article serves as a useful reference when OCR is needed for specific pages only.

Overview#

Background#

Procedure#

Creating the Image#

Google Cloud Vision API#

Creating the Transparent Text#

Results#

Summary#