This post examines the anatomy of websites and web pages, in relation to the selection of a web archiving solution.
Archiving Websites is a Business Necessity
Online marketing and web-based communications is a highly optimized and competitive business.
For many companies across many industry segments the Web is the most important means of communication with clients, partners, and investors with the result that websites attract considerable budget and creativity yielding sophisticated, media rich content, including Flash, video, dashboards, data visualizations, Ajax interactions, etc.
These sophisticated, media-rich websites and social media still require traditional records management for compliance, legal, and business reasons.
Archiving such records is still a key component of any corporate Records Management program.
Therefore, when considering how to archive this content, it is important to consider the native-format of web content itself, and how it will be used in compliance and records management. Archiving such content comprehensively needs a native-format web archive, to ensure it is forensically sound and complete. Native-format is also important when the original website is complex, interactive, social, and containing rich media. Such an archive is purpose-built for the web, and not a compromise.
Fortunately, Hanzo has takes this native-format approach and archives in a form consistent with the nature of the web itself.
So, what is the nature of the web?
Anatomy of Websites
What is a Website?
From a users perspective, a website is a collection of linked pages, and each page can be interacted with directly in the browser. From a regulators perspective, a website is electronically stored information, that has context, links, and data. Both are correct.
The graph below shows a typical crawl, where each node is a page or asset within the site and each line is a link to another page or asset. The diagram shows the structure and complexity of the websites within the overall site. Furthermore, any page or asset may also be hosted on separate physical servers.
As you can see, modern websites are increasingly complex in scope, interaction, media types, sophistication of presentation, and physical structure: all of which needs to be taken into account when considering how it is to be archived.
What is a web page?
We also need to understand a little about web pages, which are not always what they seem.
Data may exist in html pages, Flash content, media, or responses to form posts querying one or many databases.
The diagram below shows what may be behind a simple web page.
In the above diagram, in the browser, you can see text, a photograph, essentially a simple blog post. Behind the scenes, this simple page may be hugely complex, and may require the functionality of many systems:
- Image server
- Comment system
- User account system
- Web application / template to assemble it all
Different elements of the page may belong to different departments within the organisation, and possibly even different organisations. Bringing all of these elements together is critical for an archive and failing to do so properly may be expensive and subject to interpretation.
One of my favourite examples to illustrate this is from a client, a giant insurance company’s investment arm, whose broker/dealer website is powered by Sharepoint, where their web pages are so sophisticated that there are multiple valid views of many pages. In other words, the page displays differently, depending on where you navigated from, where your mouse is on the page, etc. There are pages with a single URL that have significantly different content depending on Ajax calls back into Sharepoint.
Its not possible, or at least not affordable, to simply back it up, and restore when needed. As suggested in the diagram above, how can you back up files you don’t have access to? And if you could, how do you reassemble them for browsing? In most cases where corporate websites are more sophisticated than ever, this is clearly going to be a complex, time-consuming, error-prone endeavour.
The client example described above, highlights how websites and web pages have evolved to be more like applications than documents. This trend is a significant driver for Hanzo web archiving services, particularly in financial services, pharmaceutical, medical, and government sectors.
The only way to archive such websites comprehensively is to gather all the content into one place – a native-format web archive.
This requires the use of sophisticated crawler technology to automatically discover website content and links. The crawler must be capable of parsing and interacting with a considerable range of content types, typical of websites today, in order to discover hidden or just hard to find links, such as non-http video, Flash content, form POSTs, Ajax-generated links, and so on.
With modern websites these are the types of links, content and structure that are more common than not.
The nature of the web itself requires a native-format web archive. This is the key reason why some of the largest corporations in the world have selected Hanzo Archives to provide this new vital component of their corporate Records Management.
You can obtain more information on native-format web archiving in our Web Archiving white paper.
(Thanks to Marc Spaniol at Max Planck Internet Institute for the original graph)