Web Archiving Overview
Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive. Collection is best carried out using an archival web crawler. The is stored in a suitable form for preservation, with tools for accessing the content for playback.
The Theory

Web archiving is best carried out using an archival web crawler, optimised for link discovery, to ensure the crawler can find all assets within the scope of the crawl, including dynamic links, Ajax generated links, links embedded in Flash and other proprietary formats. All content found is downloaded, including html files, css, embedded media files such as video, Flash, etc. The content is then stored in a suitable form for preservation, and optionally with forensic metadata for authenticity.
The web archive should provide tools for accessing the content, preferably in its native-form, for authentic playback.
Hanzo’s Implementation
Hanzo’s archival web crawler is based on a radical design, departing significantly from search-enginer style crawlers. It is optimised for link discovery within the modern web, enterprise web, and social media. Hanzo’s crawler can find all links and assets within the scope of the crawl, including highly dynamic websites with Ajax/Javascript generated links, links embedded in Flash and other proprietary formats, enabling us to capture dynamic html, css, embedded media, video, Flash, social media, etc.
All downloaded content is stored in ISO 28500 WARC files, together with environmental data from the http conversation between the crawlers and the web servers, as well as other environmental data and metadata needed for strong forensic preservation. All downloaded and generated data is processed to provide a range of indexes for search and e-discovery, comprehensive reports, digests, and visualisations.
All Hanzo’s archive content is preserved in it’s native format, together with all environment data and metadata, unchanged since the day it was archived. Access tools enable a user to browse the archived websites in real time, in a standard web browser. The experience will be the same as the original website. Additionally, a user may read reports on the content, search the content, and export it.
Hanzo’s, archive export capability is comprehensive. It allows the export of a fully native archive that can be installed anywhere, and accessed in a standard browser. It is also possible to export in other formats, such as PDFs, images, text, etc. Related to this, Hanzo have integrated with enterprise archives and discovery solutions, enabling those solutions to treat website content like any other electronic information.
With Hanzo Enterprise web archiving solutions, businesses can:
- Archive any type of web material to a clearly defined and managed archive policy
- Capture the whole of their web presence in a time-structured archive
- Retain the archive according to their records management policies
- Preserve archive content, metadata and other records in an authentic, admissible form
- Organise, search, review and export to litigation support teams and opposing parties
Finally, Hanzo provides web archiving services as SaaS or Appliance. Our SaaS model is complete with professional services, including quality assurance, crawl engineering and technical support services, thereby enabling customers to achieve high-quality archive content, critical for legal requirements.
Find Out More
For more information on web archiving, see the following: