Comparing Local Archiving Methods for Yahoo News Articles (SingleFile, Playwright, ArchiveBox, WARC, yt-dlp)

Yahoo News articles may be deleted after a certain period. When archiving them locally for personal records, several tools are available.

Here, the following five methods were tested against the same article, and the results were compared.

SingleFile CLI — saves as a single HTML file
Playwright PDF — converts the page to PDF
ArchiveBox — batch saves in multiple formats (including WARC)
WARC — a standard web archive format
yt-dlp — downloads embedded videos

Comparison Results

Method	Format	Folder Size	Ads	Video
SingleFile CLI	Single HTML	1.3MB	Included	×
Playwright PDF	PDF	2.5MB	Mostly excluded	×
ArchiveBox	Multiple formats	43MB	Included	△
yt-dlp	MP4	27MB	-	○

The 43MB from ArchiveBox includes SingleFile, PDF, WARC, and extracted text. When all methods are used together, a single article consumes approximately 74MB of storage.

SingleFile CLI

SingleFile is a tool that saves a web page as a single HTML file with embedded images and CSS.

SingleFile CLI GitHub While the Chrome extension is well known, a CLI version is also available.

Installation and Execution

npm install -g single-file-cli
single-file 'https://news.yahoo.co.jp/articles/xxxxx' output.html

Removing Unwanted Elements

The --removed-elements-selector option can be used to remove specific elements.

single-file 'https://news.yahoo.co.jp/articles/xxxxx' output.html \
  --removed-elements-selector='header, footer, nav, aside, [id^="yads_"]'

However, depending on the CSS selectors specified, article components such as the source attribution and update timestamps may be unintentionally removed. It is advisable to verify the saved output when using element removal.

Characteristics

Images are embedded as Base64, so everything is contained in a single HTML file
Can be opened directly in a browser
Reproduces the page appearance almost exactly
Ads are preserved as-is
Dynamically loaded elements (e.g., lazy-loaded images via JavaScript) may be missed

Playwright PDF

This method opens the page with Playwright and converts it to PDF using page.pdf().

Script

from datetime import datetime
from playwright.sync_api import sync_playwright

def save_article_pdf(url: str, output_path: str):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")

        saved_at = datetime.now().strftime("%Y-%m-%d %H:%M")
        page.pdf(
            path=output_path,
            format="A4",
            margin={"top": "20mm", "right": "15mm", "bottom": "25mm", "left": "15mm"},
            print_background=True,
            display_header_footer=True,
            header_template=(
                '<div style="font-size:8px; width:100%; text-align:right;'
                ' padding-right:15mm; color:#888;">'
                '<span class="pageNumber"></span> / <span class="totalPages"></span></div>'
            ),
            footer_template=(
                f'<div style="font-size:7px; width:100%; padding:0 15mm;'
                f' color:#888; display:flex; justify-content:space-between;">'
                f'<span>Source: {url}</span>'
                f'<span>Saved: {saved_at}</span></div>'
            ),
        )
        browser.close()

Why Ads Are Mostly Excluded

The PDF generated by Playwright contained almost no ads. This is likely due to the following reasons:

page.pdf() renders using @media print CSS, so elements hidden in print mode (such as ad slots) are not included in the output
Playwright operates in a clean browser context without cookies or login state, so ad delivery scripts behave differently from a regular browser

This is not intentional element removal — ads are excluded as a side effect of print mode behavior.

Characteristics

PDF format makes it easy to view
The footer can include the source URL and save timestamp, preserving provenance information
The entire page is saved while ads are mostly excluded due to print mode behavior
Compact at 2.5MB

ArchiveBox

ArchiveBox is a self-hosted tool that generates archives in multiple formats from a single URL.

ArchiveBox top page Docker-based deployment is recommended.

Setup and Execution

mkdir -p archivebox/data && cd archivebox

# Initialize
docker run --rm -v "$(pwd)/data:/data" archivebox/archivebox:latest init --setup

# Archive an article
docker run --rm -v "$(pwd)/data:/data" archivebox/archivebox:latest \
  add 'https://news.yahoo.co.jp/articles/xxxxx'

# Browse via Web UI
docker run -d -p 8100:8000 -v "$(pwd)/data:/data" archivebox/archivebox:latest server 0.0.0.0:8000

Generated Files

The following files were generated for a single article archive.

File	Content	Size
`singlefile.html`	SingleFile format single HTML	6.5MB
`output.pdf`	Full-page PDF	6.7MB
`output.html`	DOM snapshot	764KB
`readability/content.html`	Article body extracted by Readability	14KB
`readability/content.txt`	Plain text body	3KB
`mercury/`	Article body extracted by Mercury Parser	-
`htmltotext.txt`	Full text	58KB
`warc/*.warc.gz`	WARC format	619KB
`media/`	Media files	-
`archive.org.txt`	Wayback Machine registration URL	-

Characteristics

A single command saves multiple formats including SingleFile, PDF, WARC, and extracted text
The Web UI allows searching and browsing archived articles
Readability / Mercury text extraction is performed automatically
Automatic Wayback Machine registration is enabled by default
The entire page is saved as-is, including ads and comments
At 43MB per article, storage consumption is significant

WARC Format

WARC (Web ARChive) is a standard archive format (ISO 28500) used by the Internet Archive. In this test, the WARC file automatically generated by ArchiveBox (via wget internally) was examined.

A 619KB .warc.gz file was generated, containing 54 WARC records (HTTP request/response pairs). Since this format records HTTP-level communications as-is, it can reproduce the state of the page at the time of retrieval.

The scope of what wget captures is the HTML and related resources (CSS, images, etc.), so dynamically loaded ads via JavaScript are generally not included.

Viewing requires a dedicated tool such as ReplayWeb.page, a web application where WARC files can be opened by drag-and-drop in the browser.

ReplayWeb.page

Since WARC is included in ArchiveBox’s output, there is no need to generate it separately.

yt-dlp (Video Archiving)

Embedded videos in articles are not saved by any of the above methods. When video preservation is needed, yt-dlp can be used separately.

yt-dlp has an extractor for Yahoo Japan News (yahoo:japannews), allowing video retrieval by simply specifying the article URL.

Installation

brew install yt-dlp ffmpeg

Or:

pip install yt-dlp

ffmpeg is required for HLS decoding.

Execution

yt-dlp 'https://news.yahoo.co.jp/articles/xxxxx'

To specify an output path:

yt-dlp -o '~/Downloads/%(title)s.%(ext)s' 'https://news.yahoo.co.jp/articles/xxxxx'

How It Works

Yahoo News article pages contain server-side rendered JSON (__PRELOADED_STATE__) with video metadata.

"video":{"autostart":1,"vid":XXXXXXXX,"credit":"...","durationString":"X:XX"}

yt-dlp extracts the vid from this JSON and retrieves the HLS manifest (.m3u8) from Yahoo’s video delivery API. The stream is encrypted with AES-128, but the key is included in the manifest and ffmpeg handles decryption.

In this test, an HD (1280x720) video of approximately one minute was saved as a 27MB MP4 file.

Articles Without Video

When a text-only article is specified, yt-dlp fails with an error as no video is detected.

Output Folder Structure

The files generated during this test, organized by method:

output/
├── singlefile/          1.3MB   Single HTML
├── playwright-pdf/      2.5MB   PDF (with source info)
├── archivebox/           43MB   All formats (SingleFile, PDF, WARC, text extraction, etc.)
└── yt-dlp/               27MB   Video (MP4)

When all methods are used together, a single article amounts to approximately 74MB. Since ArchiveBox alone includes SingleFile, PDF, and WARC at 43MB, the combination of ArchiveBox + yt-dlp (~70MB/article) provides the broadest coverage.

For 100 articles, the estimate is approximately 4.3GB with ArchiveBox alone, or about 7GB including videos.

Choosing by Use Case

Based on the results, no single method covers everything. A practical approach is to choose based on the intended purpose.

Faithful preservation of the original → ArchiveBox (retains a complete record of the page including ads; WARC and text extraction are auto-generated)
PDF for reading → Playwright PDF (ads excluded via print mode, with source information)
Quick standalone save → SingleFile CLI (self-contained in a single HTML file)
Video preservation → yt-dlp (downloads embedded video as MP4)

With any method, when using CSS selectors to remove elements, there is a risk of unintentionally deleting article components (source attribution, update timestamps, etc.). For archival reliability, saving the entire page as-is is the safer approach.

Copyright Notice

The copyright of downloaded content belongs to the original distributors (TV stations, media outlets, etc.). Use beyond the scope of private use (Article 30 of the Japanese Copyright Act) — such as redistribution, republication, or commercial use — constitutes copyright infringement.

Comparison Results#

SingleFile CLI#

Installation and Execution#

Removing Unwanted Elements#

Characteristics#

Playwright PDF#

Script#

Why Ads Are Mostly Excluded#

Characteristics#

ArchiveBox#

Setup and Execution#

Generated Files#

Characteristics#

WARC Format#

yt-dlp (Video Archiving)#

Installation#

Execution#

How It Works#

Articles Without Video#

Output Folder Structure#

Choosing by Use Case#

Copyright Notice#

Comparison Results

SingleFile CLI

Installation and Execution

Removing Unwanted Elements

Characteristics

Playwright PDF

Script

Why Ads Are Mostly Excluded

Characteristics

ArchiveBox

Setup and Execution

Generated Files

Characteristics

WARC Format

yt-dlp (Video Archiving)

Installation

Execution

How It Works

Articles Without Video

Output Folder Structure

Choosing by Use Case

Copyright Notice