hanzo Archives

All you need to know about...



Decks to understand webarchiving in minutes !

◀ To begin, please select a subject in the sidebar

hanzo Archives

WARC files

Storing native format web archives

by: Mark Williamson

How does Hanzo collect and preserve web and social media content?

Hanzo collects web and social media content by crawling the original websites and social media accounts online. Each component that makes up a webpage is downloaded and stored inside a native format web archive. The original web content is stored exactly as it was captured, along with a rich collection of metadata, inside WARC (web archive) files.

What is a Web Archive File (WARC)?

  • WARC files are defined by ISO 28500
  • The standard was created by an international body of experts in digital preservation, including people from the Internet Archive and the Library of Congress
  • A WARC file is a container for web archives
  • It preserves web data exactly as it was returned from the webserver. This is called “native format”
  • It also contains a host of relevant metadata that allows a forensic examiner to verify the integrity of all that has been captured

Who uses WARCs?

The worlds largest memory institutions use WARC, including:

  • Internet Archive, Library of Congress, British Library, Bibliothèque nationale de France, California Digital Library, and many others
  • The standard is developed and maintained by the International Internet Preservation Consortium
  • And of course Hanzo Archives

So what is actually inside a WARC file?

  • A WARC file is a collection of records
  • Each record relates to an element of a webpage
  • Record types include request, response and metadata
  • Each record has a digital hash to show it hasn’t been altered in any way
  • Each WARC contains a great number of records, combined and compressed into a file
  • A full archive of a site is comprised of many WARC files

How do the records get into the WARC

Warc contents

What does the WARC file look like?

warc content

What can I do with a WARC file?

Access software allows you to recreate the web content inside the WARC, and view it just like the original page

  • Hanzo Archives software is the most sophisticated access tool to date
  • Open source software like the Java Wayback Machine from the Internet Archive can also display content stored in WARCs

Can I make my own WARC file?

  • WARC Tools: An open source package for reading and writing WARCs
  • wget-WARC: An open source tool for crawling websites and storing them in WARCs
  • Heritrix: An open source archival crawler for collecting websites and storing them in WARCs

What are WARC Tools?

  • WARC Tools is an open source package for reading and writing WARCs
  • They’re written by Hanzo Archives
  • These tools are used as a library of code for Hanzo and other archiving technologies
  • WARC Tools are designed to showcase the ISO standard and to encourage the use of WARC files
  • WARC Tools and wget-warc are used by The Archive Team

Are there alternatives?

  • Many archiving companies use proprietary formats for web archives, which can lead to vendor lock-in, in addition to being a dead-end from a preservation perspective
  • Some companies make images of webpages, rather than capturing the native content. Unless your website is non-interactive/dynamic, you’ll lose all of the user experience by choosing this option
  • Some older archiving tools write the original content directly to disk. This loses all the metadata that the WARC format uses to ensure the integrity and authenticity of the archived content

WARCs at-a-glance

  • High definition, legally defensible capture of web content
  • Open standard means no vendor lock-in
  • Open source tools means you can work directly with the data, should you want to
  • Native format archive means you get the full user experience of a website, long after the original has changed or is gone

For more information…

Contact us for a live demo or with any questions.