This repository contains the source code, datasets, and experimental results for the IMC 2025 paper "Toward Better Efficiency vs. Fidelity Tradeoffs in Web Archives".
In this paper, we make the case that a web archive does not have to make a binary choice between dynamic or static crawling. Instead, by using a browser for a carefully chosen small subset of crawls, an archive can significantly improve its ability to serve statically crawled pages with high fidelity. First, we show how to reuse crawled resources, both across pages and across multiple crawls of the same page over time. Second, by leveraging a dynamic crawl of a page, we show that subsequent static crawls of the page can be augmented to fetch resources without executing the scripts which request them.
- Python 3.8
- Node.js 22
- Chrome 130
- Chrome for testing is recommended
- webrecorder extension need to be installed
git clone https://github.com/[your-username]/IMC-25-Artifact.git
cd IMC-25-Artifact
# Install Python dependencies
pip install -r requirements.txt
pip install -e .
# Install Node.js dependencies
npm installAfter installation, you need to update the configuration file to match your environment:
Edit warctradeoff/config.json and update the following settings:
{
"host": "localhost:8080",
"host_proxy": "localhost:8079",
"host_proxy_test": "localhost:8078",
"host_proxy_patch": "localhost:8078",
"collection": "static_replay",
"pywb_env": ". /path/to/your/pywb/env/bin/activate",
"chrome_data_dir": "/path/to/your/chrome/data/",
"archive_dir": "/path/to/your/archive/files/"
}Note: Make sure the directories specified in chrome_data_dir and archive_dir exist and are writable.
Basic usage examples and command-line instructions can be found in measurements/README.md. This includes information on recording, replaying, analyzing differences, and distributed measurement workflows. Please refer to that file for detailed usage and measurement instructions.
For detailed information about the available datasets, please refer to measurements/dataset.md.
If you use this code or datasets in your research, please cite our paper:
@inproceedings{zhu2025toward,
title={Toward Better Efficiency vs. Fidelity Tradeoffs in Web Archives},
author={Zhu, Jingyuan and Sun, Huanchen and Madhyastha, Harsha V},
booktitle={Proceedings of the 2025 ACM Internet Measurement Conference (IMC)},
year={2025}
}