Home Articles Books Search About
日本語
Comparing Local Archiving Methods for Yahoo News Articles (SingleFile, Playwright, ArchiveBox, WARC, yt-dlp)

Comparing Local Archiving Methods for Yahoo News Articles (SingleFile, Playwright, ArchiveBox, WARC, yt-dlp)

Yahoo News articles may be deleted after a certain period. When archiving them locally for personal records, several tools are available. Here, the following five methods were tested against the same article, and the results were compared. SingleFile CLI — saves as a single HTML file Playwright PDF — converts the page to PDF ArchiveBox — batch saves in multiple formats (including WARC) WARC — a standard web archive format yt-dlp — downloads embedded videos Comparison Results Method Format Folder Size Ads Video SingleFile CLI Single HTML 1.3MB Included × Playwright PDF PDF 2.5MB Mostly excluded × ArchiveBox Multiple formats 43MB Included △ yt-dlp MP4 27MB - ○ The 43MB from ArchiveBox includes SingleFile, PDF, WARC, and extracted text. When all methods are used together, a single article consumes approximately 74MB of storage. ...

ReplayWeb.page: A Browser-Based Web Archive Replay Tool

ReplayWeb.page: A Browser-Based Web Archive Replay Tool

Introduction In Digital Humanities, preserving and reproducing web content is a critical challenge. Websites are constantly updated and disappear, requiring mechanisms for long-term preservation of web pages as research subjects. ReplayWeb.page is a browser-based web archive replay tool developed by the Webrecorder project. It allows you to view archive files in WARC (Web ARChive) and WACZ (Web Archive Collection Zipped) formats directly in your browser. Key Features of ReplayWeb.page Client-Side Processing The most distinctive feature is its client-side processing using Service Workers. Traditional web archive replay tools (like the Wayback Machine) require server-side processing, but ReplayWeb.page completes all processing within the browser. This eliminates the need to build and maintain servers. ...