ReplayWeb.page: A Browser-Based Web Archive Replay Tool

Introduction

In Digital Humanities, preserving and reproducing web content is a critical challenge. Websites are constantly updated and disappear, requiring mechanisms for long-term preservation of web pages as research subjects.

ReplayWeb.page is a browser-based web archive replay tool developed by the Webrecorder project. It allows you to view archive files in WARC (Web ARChive) and WACZ (Web Archive Collection Zipped) formats directly in your browser.

Key Features of ReplayWeb.page

Client-Side Processing

The most distinctive feature is its client-side processing using Service Workers. Traditional web archive replay tools (like the Wayback Machine) require server-side processing, but ReplayWeb.page completes all processing within the browser. This eliminates the need to build and maintain servers.

WARC/WACZ Format Support

It supports both the international standard WARC format and the WACZ format proposed by Webrecorder. The WACZ format packages WARC files with ZIP compression, including indexes and metadata, enabling efficient random access.

Loading from Multiple Data Sources

Archive files can be loaded from various sources including local files, URLs, Google Drive, Dropbox, and S3. Thanks to HTTP range request support, even large archive files can be partially accessed without downloading the entire file.

Embedding Support

Using the <replay-web-page> custom element, you can embed archive replay widgets on any web page. This is useful for publishing research outcomes and creating digital exhibitions.

Use Cases in DH Research

Long-Term Website Preservation

Research-relevant websites can be saved in WARC/WACZ format, maintaining them in a referenceable state for the future. Web content from specific points in time can be accurately reproduced without worrying about broken links or alterations.

Building Digital Exhibitions

In museum and library digital exhibitions, you can build displays that faithfully reproduce past websites. Using ReplayWeb.page’s embedding feature, archived content can be seamlessly integrated within exhibition websites.

It is well-suited for preserving and reproducing ephemeral content such as social media posts and threads. Combined with Webrecorder’s capture tool (ArchiveWeb.page), you can build a complete workflow from capture to replay.

Basic Workflow

Prepare archives: Capture pages using ArchiveWeb.page, wget, or Browsertrix to create WARC/WACZ files
Access ReplayWeb.page: Open replayweb.page in your browser
Load files: Select a local file or specify a URL to load the archive
Browse: Browse saved web pages just as they appeared originally

The Webrecorder project provides the following tools in addition to ReplayWeb.page:

ArchiveWeb.page: Browser extension for capturing web pages
Browsertrix: Large-scale web crawling and archive automation
py-wacz: Python library for working with WACZ files

Technical Architecture

ReplayWeb.page uses Service Workers to intercept archived HTTP responses and serve them to the browser. When a user accesses a URL within the archive, the Service Worker retrieves the corresponding response from the WARC/WACZ file, returning the saved content without sending a request to the original server.

This mechanism enables accurate reproduction of dynamic web pages including JavaScript and CSS, exactly as they were at the time of preservation.

Conclusion

ReplayWeb.page is an innovative tool that completes web archive replay entirely within the browser. Its serverless operation and support for standard WARC/WACZ formats solve challenges in preserving, reproducing, and sharing web content in DH research. It is an essential tool for researchers interested in digital preservation.

Introduction#

Key Features of ReplayWeb.page#

Client-Side Processing#

WARC/WACZ Format Support#

Loading from Multiple Data Sources#

Embedding Support#

Use Cases in DH Research#

Long-Term Website Preservation#

Building Digital Exhibitions#

Preserving Social Media#

Basic Workflow#

Related Tools#

Technical Architecture#

Conclusion#

Reference Links#