How Websites Differ From Other Electronic Files

Regarding my previous post, covering Judge Hedges decision:

[From Hanzo Archives - Finding “No Reason to Treat Websites Differently than Other Electronic Files,” Court Grants Adverse Inference for Failure to Preserve Website : Electronic Discovery Law]

I think it is important to recognise that while Judge Hedges is absolutely right from a legal perspective, it doesn’t automatically follow from a technical perspective. This needs some explaining.

Websites are compound, complex, interconnected and hyperlinked collections of compound, complex, interconnected and hyperlinked documents. Lots of moving parts (and syllables).

This is quite different to other ESI. Consider, for example, an electronic file, such as a word or excel document, corresponds to a single document; an email is an envelope containing a single message, with metadata and attachments. A web-based document on the other hand, more often than not consists of many files: an html page, javascript code, style sheet(s), images, embedded media (possibly streaming), and links to different parts of the html file, other html files or documents, or other websites. Websites are not the same!

At the human level, Courts view web-based documents the same as your average human reader, i.e. the compound document described above, not the individual component parts. To preserve such documents in their native form, it is necessary to collect all the components parts correctly and store each and all of them unchanged.

As such, file-oriented methods for preservation are clearly inappropriate for websites.

Finding “No Reason to Treat Websites Differently than Other Electronic Files,” Court Grants Adverse Inference for Failure to Preserve Website : Electronic Discovery Law

Re: Arteria Prop. Pty Ltd. v. Universal Funding V.T.O., Inc., 2008 WL 4513696 (D.N.J. Oct. 1, 2008):

[From Finding "No Reason to Treat Websites Differently than Other Electronic Files," Court Grants Adverse Inference for Failure to Preserve Website : Electronic Discovery Law]

This is a great decision for Hanzo customers. Here are the key issues raised:

  1. You are responsible for your website, no matter who maintains it or hosts it, it is your responsibility
  2. If you reasonably anticipate litigation you are required to preserve your website - “litigation hold”

This decision clearly underlines our product strategy.

We’ve designed our web archiving tool for exactly this scenario. As responsible owner of your websites, you should have records and information management policies in place to systematically archive your websites — a “web archive”. You can’t rely on the developers or agencies involved in its development or hosting.

Hanzo archives any number of websites, from multiple URLs, CMS, databases and technologies, according to an agreed archive policy, and stores them in a secure, authenticating web archive. The web archive is an independent store for all your website content, enabling you to retain them according to your information management policies. This requires no additional effort by or consent from your developers, website designers, marketing agencies or hosting partners.

Secondly, the web archive provides a litigation hold for any or all of the web archive content. Moreover it is fully browsable, searchable and exportable, enabling discovery of your web resources in a fraction of the time taken using traditional preservation methods.

A more complete description of the case and the decision are on the Electronic Discovery Law blog.

Producer required to re-produce .TIFF documents in a “reasonably searchable” format : Electronic Discovery Blog

Producing web resources in native format can be burdensome. But we’ve changed that dramatically with our web archiving products, such as HE. Don’t expect to get away with this anymore…

[From Producer required to re-produce .TIFF documents in a “reasonably searchable” format : Electronic Discovery Blog]

HE uses client-side archiving technology, including web crawlers, API’s and plug-ins, that enable preservation of web resources in their native format: exactly the same format presented to browsers.

These resources are stored in archive files together with metadata verifying their authenticity. The archive files are ingested into a web archive and indexed. These can then be browsed the same way as the original website, along any captured timeline, and searched across full text, metadata and time.

More information on this is in our white paper “E-discovery: Why Archiving Your Web Presence is a Business Necessity“.

If a Web page, blog, thread in a customer forum, or your whole website were required by the courts, how would you be able to obtain the exact version required and present it as it was originally? How would you verify its authenticity? As regulations concerning corporate records and e-discovery proceedings are extended to include Web content, can you be certain you are compliant?

This white paper looks at compliance and e-discovery issues relating to Web content, and assesses the technologies you need to archive your Total Web Presence.

A new classical music model emerges >> Great Leaps Forward

A new classical music model emerges >> Great Leaps Forward :To help you and everyone European with their music collection, try this archive of 100’s of public domain recordings of classical music - many thanks to the splendid and innovative European Archive. All available for free - in Europe anyway (where our IP laws are ever-so-slightly better than elsewhere) - this really is a classical music model worth investigating.

Amazon Web Services Success Story : Hanzo

Amazon published a case study on Hanzo today, in which they describe our use of EC2, SQS and S3 to host Hanzoweb and our web archive, see here for more information:

Hanzo turned to Amazon Web Services as an all inclusive, web-scale solution for hosting, data processing and storage. They are currently using Amazon EC2 to host their website Hanzoweb.com and process their busy web crawlers and Amazon Simple Storage Service (S3) to store the endless web data they collect

Hanzo at Web Archiving Training Session in Paris 2nd March 2007

Web Archiving Training Session :Hanzo are presenting two of our web archiving systems at the European Archives Web Archiving Training Session on 2nd March.

  1. Hanzoweb - our social web archiving system for small businesses and individuals - designed for people who need to archive web resources using the simplest, minimally intrusive, user interface.
  2. Hanzo Enterprise - our corporate web archiving system - a sophisticated system designed for archiving professionals who need to archive intranet resources and public web sites. Aimed at organisations with ambitious archiving requirements and for meeting compliance obligations for their archives.

Internet access permitting, we will demo key features of both systems.See you in Paris!

The Nations’ Memorybank

The Nations’ Memorybank:

We’re looking forward to this…

The Nations’ Memorybank is “the on-line archive for your personal heritage…. collect and store all of your personal and family heritage in one place – and your relatives around the world will be able to join in.”

The persistent blogosphere >> Jon Udell

The persistent blogosphere >> Jon Udell archiveand this…A conversation with Tony Hammond about digital object identifiers

Interesting posts, the sentiment is absolutely spot on, but the discussion was focused on Handles, DOI and other identifiers, not persistence at all (although DSpace is mentioned, its hardly a beacon for archives). I had this and this to say.

Delicious archiving

If you have a del.icio.us account, you can use Hanzo’s feed archiving tool to capture your bookmarks and archive them. Ditto, of course, for a wide range of other bookmarking tools, provided they generate a feed.This means you can use del.icio.us and many of its cool tools and plug-ins as your Hanzoweb collection tool. Assuming you have an account on del.icio.us already, here’s a few examples.

Continue Reading »

Archiving Feeds

Hanzoweb now has a feed archiving crawler! Which means you can archive RSS and Atom feeds - used to distribute content and update notifications around the web - such as blogs, photo streams, news, content updates and Hanzoweb itself.Some good examples:

  1. Your family’s Flickr accounts
  2. Your del.icio.us feed
  3. Your technorati keyword alerts
  4. Your kids first blogs
  5. Your google news alerts (beats paying a cuttings service)

Quick how to…

To archive a feed in Hanzoweb, use the familiar Collect This bookmarklet grab the URL, tag it and describe it, then click the Advanced tab, select the feed URL you want to archive, pick your preferred scope and click Collect. This creates an archive Agent, which monitors the feed every hour or so and archives all items in the feed as they appear.

Continue Reading »