◀ To begin, please select a subject in the sidebar
WARC files
Storing native format web archives
by: Mark Williamson
How does Hanzo collect and preserve web and social media content?
Hanzo collects web and social media content by crawling the original websites and social media accounts online. Each component that makes up a webpage is downloaded and stored inside a native format web archive. The original web content is stored exactly as it was captured, along with a rich collection of metadata, inside WARC (web archive) files.
What is a Web Archive File (WARC)?
WARC files are defined by ISO 28500
The standard was created by an international body of experts in digital preservation, including people from the Internet Archive and the Library of Congress
A WARC file is a container for web archives
It preserves web data exactly as it was returned from the webserver. This is called “native format”
It also contains a host of relevant metadata that allows a forensic examiner to verify the integrity of all that has been captured
Who uses WARCs?
The worlds largest memory institutions use WARC, including:
Internet Archive, Library of Congress, British Library, Bibliothèque nationale de France, California Digital Library, and many others
The standard is developed and maintained by the International Internet Preservation Consortium
And of course Hanzo Archives
So what is actually inside a WARC file?
A WARC file is a collection of records
Each record relates to an element of a webpage
Record types include request, response and metadata
Each record has a digital hash to show it hasn’t been altered in any way
Each WARC contains a great number of records, combined and compressed into a file
A full archive of a site is comprised of many WARC files
How do the records get into the WARC
What does the WARC file look like?
What can I do with a WARC file?
Access software allows you to recreate the web content inside the WARC, and view it just like the original page
Hanzo Archives software is the most sophisticated access tool to date
Open source software like the Java Wayback Machine from the Internet Archive can also display content stored in WARCs
Can I make my own WARC file?
WARC Tools: An open source package for reading and writing WARCs
wget-WARC: An open source tool for crawling websites and storing them in WARCs
Heritrix: An open source archival crawler for collecting websites and storing them in WARCs
What are WARC Tools?
WARC Tools is an open source package for reading and writing WARCs
They’re written by Hanzo Archives
These tools are used as a library of code for Hanzo and other archiving technologies
WARC Tools are designed to showcase the ISO standard and to encourage the use of WARC files
WARC Tools and wget-warc are used by The Archive Team
Are there alternatives?
Many archiving companies use proprietary formats for web archives, which can lead to vendor lock-in, in addition to being a dead-end from a preservation perspective
Some companies make images of webpages, rather than capturing the native content. Unless your website is non-interactive/dynamic, you’ll lose all of the user experience by choosing this option
Some older archiving tools write the original content directly to disk. This loses all the metadata that the WARC format uses to ensure the integrity and authenticity of the archived content
WARCs at-a-glance
High definition, legally defensible capture of web content
Open standard means no vendor lock-in
Open source tools means you can work directly with the data, should you want to
Native format archive means you get the full user experience of a website, long after the original has changed or is gone