Yahoo News articles may be deleted after a certain period. When archiving them locally for personal records, several tools are available.

Here, the following five methods were tested against the same article, and the results were compared.

  • SingleFile CLI — saves as a single HTML file
  • Playwright PDF — converts the page to PDF
  • ArchiveBox — batch saves in multiple formats (including WARC)
  • WARC — a standard web archive format
  • yt-dlp — downloads embedded videos

Comparison Results

MethodFormatFolder SizeAdsVideo
SingleFile CLISingle HTML1.3MBIncluded×
Playwright PDFPDF2.5MBMostly excluded×
ArchiveBoxMultiple formats43MBIncluded
yt-dlpMP427MB-

The 43MB from ArchiveBox includes SingleFile, PDF, WARC, and extracted text. When all methods are used together, a single article consumes approximately 74MB of storage.

SingleFile CLI

SingleFile is a tool that saves a web page as a single HTML file with embedded images and CSS.

SingleFile CLI GitHub While the Chrome extension is well known, a CLI version is also available.

Installation and Execution

npm install -g single-file-cli
single-file 'https://news.yahoo.co.jp/articles/xxxxx' output.html

Removing Unwanted Elements

The --removed-elements-selector option can be used to remove specific elements.

single-file 'https://news.yahoo.co.jp/articles/xxxxx' output.html \
  --removed-elements-selector='header, footer, nav, aside, [id^="yads_"]'

However, depending on the CSS selectors specified, article components such as the source attribution and update timestamps may be unintentionally removed. It is advisable to verify the saved output when using element removal.

Characteristics

  • Images are embedded as Base64, so everything is contained in a single HTML file
  • Can be opened directly in a browser
  • Reproduces the page appearance almost exactly
  • Ads are preserved as-is
  • Dynamically loaded elements (e.g., lazy-loaded images via JavaScript) may be missed

Playwright PDF

This method opens the page with Playwright and converts it to PDF using page.pdf().

Script

from datetime import datetime
from playwright.sync_api import sync_playwright

def save_article_pdf(url: str, output_path: str):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")

        saved_at = datetime.now().strftime("%Y-%m-%d %H:%M")
        page.pdf(
            path=output_path,
            format="A4",
            margin={"top": "20mm", "right": "15mm", "bottom": "25mm", "left": "15mm"},
            print_background=True,
            display_header_footer=True,
            header_template=(
                '<div style="font-size:8px; width:100%; text-align:right;'
                ' padding-right:15mm; color:#888;">'
                '<span class="pageNumber"></span> / <span class="totalPages"></span></div>'
            ),
            footer_template=(
                f'<div style="font-size:7px; width:100%; padding:0 15mm;'
                f' color:#888; display:flex; justify-content:space-between;">'
                f'<span>Source: {url}</span>'
                f'<span>Saved: {saved_at}</span></div>'
            ),
        )
        browser.close()

Why Ads Are Mostly Excluded

The PDF generated by Playwright contained almost no ads. This is likely due to the following reasons:

  • page.pdf() renders using @media print CSS, so elements hidden in print mode (such as ad slots) are not included in the output
  • Playwright operates in a clean browser context without cookies or login state, so ad delivery scripts behave differently from a regular browser

This is not intentional element removal — ads are excluded as a side effect of print mode behavior.

Characteristics

  • PDF format makes it easy to view
  • The footer can include the source URL and save timestamp, preserving provenance information
  • The entire page is saved while ads are mostly excluded due to print mode behavior
  • Compact at 2.5MB

ArchiveBox

ArchiveBox is a self-hosted tool that generates archives in multiple formats from a single URL.

ArchiveBox top page Docker-based deployment is recommended.

Setup and Execution

mkdir -p archivebox/data && cd archivebox

# Initialize
docker run --rm -v "$(pwd)/data:/data" archivebox/archivebox:latest init --setup

# Archive an article
docker run --rm -v "$(pwd)/data:/data" archivebox/archivebox:latest \
  add 'https://news.yahoo.co.jp/articles/xxxxx'

# Browse via Web UI
docker run -d -p 8100:8000 -v "$(pwd)/data:/data" archivebox/archivebox:latest server 0.0.0.0:8000

Generated Files

The following files were generated for a single article archive.

FileContentSize
singlefile.htmlSingleFile format single HTML6.5MB
output.pdfFull-page PDF6.7MB
output.htmlDOM snapshot764KB
readability/content.htmlArticle body extracted by Readability14KB
readability/content.txtPlain text body3KB
mercury/Article body extracted by Mercury Parser-
htmltotext.txtFull text58KB
warc/*.warc.gzWARC format619KB
media/Media files-
archive.org.txtWayback Machine registration URL-

Characteristics

  • A single command saves multiple formats including SingleFile, PDF, WARC, and extracted text
  • The Web UI allows searching and browsing archived articles
  • Readability / Mercury text extraction is performed automatically
  • Automatic Wayback Machine registration is enabled by default
  • The entire page is saved as-is, including ads and comments
  • At 43MB per article, storage consumption is significant

WARC Format

WARC (Web ARChive) is a standard archive format (ISO 28500) used by the Internet Archive. In this test, the WARC file automatically generated by ArchiveBox (via wget internally) was examined.

A 619KB .warc.gz file was generated, containing 54 WARC records (HTTP request/response pairs). Since this format records HTTP-level communications as-is, it can reproduce the state of the page at the time of retrieval.

The scope of what wget captures is the HTML and related resources (CSS, images, etc.), so dynamically loaded ads via JavaScript are generally not included.

Viewing requires a dedicated tool such as ReplayWeb.page, a web application where WARC files can be opened by drag-and-drop in the browser.

ReplayWeb.page

Since WARC is included in ArchiveBox’s output, there is no need to generate it separately.

yt-dlp (Video Archiving)

Embedded videos in articles are not saved by any of the above methods. When video preservation is needed, yt-dlp can be used separately.

yt-dlp has an extractor for Yahoo Japan News (yahoo:japannews), allowing video retrieval by simply specifying the article URL.

Installation

brew install yt-dlp ffmpeg

Or:

pip install yt-dlp

ffmpeg is required for HLS decoding.

Execution

yt-dlp 'https://news.yahoo.co.jp/articles/xxxxx'

To specify an output path:

yt-dlp -o '~/Downloads/%(title)s.%(ext)s' 'https://news.yahoo.co.jp/articles/xxxxx'

How It Works

Yahoo News article pages contain server-side rendered JSON (__PRELOADED_STATE__) with video metadata.

"video":{"autostart":1,"vid":XXXXXXXX,"credit":"...","durationString":"X:XX"}

yt-dlp extracts the vid from this JSON and retrieves the HLS manifest (.m3u8) from Yahoo’s video delivery API. The stream is encrypted with AES-128, but the key is included in the manifest and ffmpeg handles decryption.

In this test, an HD (1280x720) video of approximately one minute was saved as a 27MB MP4 file.

Articles Without Video

When a text-only article is specified, yt-dlp fails with an error as no video is detected.

Output Folder Structure

The files generated during this test, organized by method:

output/
├── singlefile/          1.3MB   Single HTML
├── playwright-pdf/      2.5MB   PDF (with source info)
├── archivebox/           43MB   All formats (SingleFile, PDF, WARC, text extraction, etc.)
└── yt-dlp/               27MB   Video (MP4)

When all methods are used together, a single article amounts to approximately 74MB. Since ArchiveBox alone includes SingleFile, PDF, and WARC at 43MB, the combination of ArchiveBox + yt-dlp (~70MB/article) provides the broadest coverage.

For 100 articles, the estimate is approximately 4.3GB with ArchiveBox alone, or about 7GB including videos.

Choosing by Use Case

Based on the results, no single method covers everything. A practical approach is to choose based on the intended purpose.

  • Faithful preservation of the original → ArchiveBox (retains a complete record of the page including ads; WARC and text extraction are auto-generated)
  • PDF for reading → Playwright PDF (ads excluded via print mode, with source information)
  • Quick standalone save → SingleFile CLI (self-contained in a single HTML file)
  • Video preservation → yt-dlp (downloads embedded video as MP4)

With any method, when using CSS selectors to remove elements, there is a risk of unintentionally deleting article components (source attribution, update timestamps, etc.). For archival reliability, saving the entire page as-is is the safer approach.

The copyright of downloaded content belongs to the original distributors (TV stations, media outlets, etc.). Use beyond the scope of private use (Article 30 of the Japanese Copyright Act) — such as redistribution, republication, or commercial use — constitutes copyright infringement.