Home Articles Books Search About
日本語
Building an Automated DH Tool Awareness System with Playwright, RSS, and AI

Building an Automated DH Tool Awareness System with Playwright, RSS, and AI

Why track DH tools In the Digital Humanities (DH) field, new tools are continuously developed and released. OCR engines for historical documents, IIIF viewers, text transcription platforms, and kuzushiji (classical Japanese cursive) recognition systems are just a few examples. In Japan, several organizations actively develop and publish such tools: NDL (National Diet Library of Japan) develops OCR tools for digitized materials. CODH (Center for Open Data in the Humanities, ROIS-DS) maintains kuzushiji recognition models and the IIIF Curation Platform. National Museum of Japanese History develops Minna de Honkoku (a crowdsourced transcription platform) and related IIIF tools. Keeping up with these releases manually is time-consuming. The goal was to build a system that systematically collects new DH tool releases and generates weekly summary articles, similar to a “current awareness” service. ...

Comparing Local Archiving Methods for Yahoo News Articles (SingleFile, Playwright, ArchiveBox, WARC, yt-dlp)

Comparing Local Archiving Methods for Yahoo News Articles (SingleFile, Playwright, ArchiveBox, WARC, yt-dlp)

Yahoo News articles may be deleted after a certain period. When archiving them locally for personal records, several tools are available. Here, the following five methods were tested against the same article, and the results were compared. SingleFile CLI — saves as a single HTML file Playwright PDF — converts the page to PDF ArchiveBox — batch saves in multiple formats (including WARC) WARC — a standard web archive format yt-dlp — downloads embedded videos Comparison Results Method Format Folder Size Ads Video SingleFile CLI Single HTML 1.3MB Included × Playwright PDF PDF 2.5MB Mostly excluded × ArchiveBox Multiple formats 43MB Included △ yt-dlp MP4 27MB - ○ The 43MB from ArchiveBox includes SingleFile, PDF, WARC, and extracted text. When all methods are used together, a single article consumes approximately 74MB of storage. ...

Automating researchmap KAKENHI-Achievement Linking with Playwright

Automating researchmap KAKENHI-Achievement Linking with Playwright

Introduction researchmap is a platform for researchers in Japan to manage and publish their academic achievements. In addition to registering publications, presentations, and other works, researchers can link them to KAKENHI (Grants-in-Aid for Scientific Research) projects to aggregate outputs per research project. I looked into whether this linking could be done in bulk via the API or CSV import. As far as I could tell, it appeared to be limited to manual operations through the Web UI. So I tried automating it with Playwright. ...