Why track DH tools

In the Digital Humanities (DH) field, new tools are continuously developed and released. OCR engines for historical documents, IIIF viewers, text transcription platforms, and kuzushiji (classical Japanese cursive) recognition systems are just a few examples. In Japan, several organizations actively develop and publish such tools:

  • NDL (National Diet Library of Japan) develops OCR tools for digitized materials.
  • CODH (Center for Open Data in the Humanities, ROIS-DS) maintains kuzushiji recognition models and the IIIF Curation Platform.
  • National Museum of Japanese History develops Minna de Honkoku (a crowdsourced transcription platform) and related IIIF tools.

Keeping up with these releases manually is time-consuming. The goal was to build a system that systematically collects new DH tool releases and generates weekly summary articles, similar to a “current awareness” service.

Collection targets

The system collects information from three types of sources.

X (Twitter): accounts of researchers and organizations actively developing and releasing DH tools.

RSS feeds: Current Awareness Portal (NDL) at https://current.ndl.go.jp/rss.xml.

GitHub: public repositories of DH-related organizations and individuals, via the GitHub API.

Methods investigated and decisions

Collecting X posts

Finding a reliable, free way to collect tweets required testing several approaches.

MethodCostResult
X API (Basic plan)$100/monthReliable but too expensive for this use case. Rejected.
Web search (site:x.com via Google)FreeOnly returned an indexed subset of tweets. Many recent posts were missing. Rejected.
RSSHub (self-hosted)FreeUses X’s internal API and retrieves all tweets. Requires Docker. Considered running it temporarily in a GitHub Actions container, but the Playwright approach turned out to be simpler. Rejected.
Playwright (no login)FreeSome accounts returned 0 tweets because X shows a login wall for unauthenticated visitors. Rejected.
Playwright (with Cookie auth)FreeAll three accounts were successfully scraped. Running daily minimizes the chance of missing tweets that scroll off the timeline. Adopted.

AI summarization

MethodCostResult
Claude Code (manual)Included in subscriptionCannot be automated in a CI pipeline. Used for initial testing only.
Claude API (direct Anthropic)Pay-per-useWorks, but requires managing a separate API key. Rejected.
OpenRouterPay-per-use (~$0.05/run)Provides access to multiple models through a single API key, OpenAI-compatible endpoint. Adopted.

RSS feeds

  • Current Awareness Portal: the /feed endpoint returned 0 items, but /rss.xml returned 30 items. The system uses the latter.
  • CODH: the website was under maintenance during development and RSS was unavailable. X posts serve as a substitute source.

Final architecture

GitHub Actions (daily at JST 7:00)
  +-- Playwright + Cookie --> X posts (3 accounts)
  +-- RSS --> Current Awareness Portal
  +-- GitHub API --> Repository updates
  +-- Save to data/daily/YYYY-MM-DD.json

GitHub Actions (weekly, Sunday JST 9:00)
  +-- Aggregate week's daily/*.json files
  +-- AI via OpenRouter generates article (topic: new tool releases only)
  +-- Generate both Zenn and Hugo articles
  +-- Create PR --> Review and merge

The daily workflow runs collect_daily.py, commits the JSON data file, and pushes. The weekly workflow runs generate_weekly.py, sets the Zenn article to published, runs sync_to_hugo.py to create the Hugo version, and opens a pull request for human review before merging.

Implementation details

Daily collection (collect_daily.py)

The daily script collects from all three source types in a single run.

For X, it launches a headless Chromium browser via Playwright, sets auth_token and ct0 cookies on the .x.com domain, then navigates to each account’s profile page. It scrolls the timeline five times to load more tweets, then extracts tweet text, URL, and datetime from article[data-testid="tweet"] elements in the DOM. The time element’s parent a tag provides the permalink, and the datetime attribute provides the timestamp.

For Current Awareness, the script fetches the RSS XML with urllib.request and parses it with xml.etree.ElementTree from the standard library. It extracts title, link, pubDate, and content:encoded (falling back to description) from each <item>.

For GitHub, it calls the public events API (/users/{owner}/events/public) and filters for CreateEvent (repository creation only), ReleaseEvent, and PushEvent. When the API response omits the commits field (which happens for unauthenticated requests), the script falls back to extracting the branch name from the ref field.

The only external dependency is Playwright. Everything else uses Python’s standard library.

Weekly article generation (generate_weekly.py)

The weekly script loads all data/daily/*.json files for the past seven days, deduplicates items by URL, and sends them to the OpenRouter API. The API call uses urllib.request with an OpenAI-compatible endpoint at https://openrouter.ai/api/v1/chat/completions.

The prompt strictly filters for “new tool development or release” topics. The exclusion rules are explicit:

  • Digital archive launches or renewals (these are not tool development)
  • Metadata service changes
  • General news, event announcements, personnel changes
  • Git pushes alone do not warrant an article entry – only items with clear feature additions are included
  • Vague language like “it is speculated” or “it is possible” is prohibited
  • If no qualifying topics exist for the week, the article states that plainly

The writing style rules are embedded directly in the prompt: polite Japanese (desu/masu) style, no bold text, no assertions, no template closings.

Hugo integration

sync_to_hugo.py reads published Zenn articles matching dh-weekly-*.md, parses the Zenn-style frontmatter, and converts it to Hugo frontmatter with a DH-Weekly category and appropriate tags. It extracts the date from the filename and generates a Zenn source URL.

During development, a bug surfaced in the Hugo template schema_json.html: the safeJS pipeline in Hugo 0.157 caused a “slice bounds out of range” panic when processing multibyte characters (common in Japanese text). The fix was to replace the safeJS pipeline with .Plain | truncate 5000 | jsonify.

X requires authentication to view most profiles without a login wall. The system uses cookie-based authentication rather than the official API.

Initial setup requires manually extracting auth_token and ct0 values from the browser’s DevTools (Application tab, Cookies for x.com). These are stored locally in .x_cookies.json, which is gitignored.

In GitHub Actions, the cookie values are stored as a repository secret named TWITTER_COOKIE in the format auth_token=xxx;ct0=yyy. The script reads this from the environment variable and parses it.

Automated login via Playwright was attempted but X’s bot detection blocked it consistently. Manual cookie extraction was the more reliable path. The cookies expire after a few months and need periodic renewal.

Prompt tuning process

Getting the AI to generate articles with the right scope required several iterations.

The first generation mixed in off-topic items. Digital archive renewals (which are not tool development) appeared alongside actual tool releases. Adding explicit exclusion rules to the prompt fixed this.

Vague speculative language (“this may indicate future development”) appeared in early outputs. Adding a rule to ban speculative phrasing and to skip items with insufficient information resolved it.

Git push events from GitHub initially generated article entries even when no meaningful feature change was evident. Adding the rule that pushes alone do not warrant coverage – only items with concrete, identifiable feature changes – eliminated these false positives.

After these adjustments, the system produced 0 off-topic items across 6 generated articles.

Costs

ComponentCost
X posts (Playwright + Cookie)Free
Current Awareness (RSS)Free
GitHub (public API)Free
AI article generation (OpenRouter)~$0.05/run, ~$0.20/month
GitHub ActionsWithin free tier (2,000 min/month, each run takes ~2 minutes)

The total recurring cost is approximately $0.20 per month.