With Hanzo being a service provider for eDiscovery needs, I find legal blogs an invaluable resource of industry content. Take this recent web article, on the Paralegal Knowledge Institute website, which caught my attention: “How to Defensibly Collect Web Pages and Social Media Posts When Risk of Spoilation is High or it is Infeasible to Collect from the Web Host,” authored by Paul Easton, Esq., Managing Director, Global Colleague LLC. and Tom Klaff, CEO, Surety LLC.
As a web archivist, I’m always interested in discovering what legal counsel determines as the best way to present web-based evidence in court. The main reason is to get an understanding of how Hanzo can continue to serve and educate our growing eDiscovery client base. So, naturally, I read this article with considerable interest.
Easton and Klaff discuss how web-based evidence used to have no bearing in court. How times have changed. Today, web-based evidence is becoming a necessity in a growing number of cases.
The thing is, dynamic website content and complex social media conversations are difficult to capture. The article mentions PDFs and screen-video captures as a possible route to follow. I would suggest that either of these is, at best, a second-hand representation of the content.
At Hanzo archives, we have a better solution: Native format archiving. We capture all of the data the web servers send out, and we play that back exactly as the user would have seen it at that moment in time. In addition, we capture a host of supporting metadata to ensure that the authenticity of the capture is beyond question.
Therefore, I would like to suggest that while Easton and Klaff have totally understood the question and the problems what they are lacking is a great answer. An answer that Hanzo Archives has already provided.
Easton and Klaff identify authenticity as the key issue. Authenticity is how we tell that the material really did come from where we said it did when we said it did. Our technology doesn’t just capture a picture of the content and add metadata to it – we capture the actual content itself. We make and record the same requests to the web server that your browser might on your behalf when you are viewing it, and we capture each response along with all the headers and metadata from the servers. What’s more, we digitally hash each one independently. We also extract a set of metadata, the viewable text, extracted links, etc., all of which are also hashed and stored in the same way at the same time.
Lastly and not least we also make both a PDF and PNG representation of the page as it appeared at capture time, and this is also hashed and stored along with all the other data. Because we can play back the native format, it’s easy to check if the picture we make matches the native format content replayed from our access software or vice versa.
This “mesh” of hashes – each web page you see on the web is made up of possibly 100’s of separate responses on average – adds up to a picture that is very hard to argue with. When you understand that we archive this away straight into WORM equivalent storage at capture time, you can be sure what we have captured, and what we reproduce, is exactly what was there.
We also make and store repeated requests to and responses from NTP servers to ensure that our timestamps, and the timestamps from the WORM storage are supported with further independent verification directly from third party servers at the time of capture also stored alongside all the other records we have recorded.
To tamper with or corrupt the data to change what it says you would need to generate several hash collisions for multiple independent records generated from the content – including all of the native responses that make up the content, the metadata and the visual representation in the PDF (and likely for other pages in the archive too since webservers rarely serve each page in isolation). The computing cost of this, not to mention the financial cost, is well beyond the reach of current computing capabilities. All of this ignores the fact that the WORM equivalent storage would need to be tampered with to hide the fact that it had been modified.
If you look at Easton and Klaff’s proposed solution, you can see they have arrived in part at a lot of this logic – the hashing and timestamping, capturing the ‘conversation’ packets with the server. What Hanzo Archives solution does is to bring all of this and more together into a simple easy to use service that is demonstrably and unarguably authentic.
The question I’d like to pose is: Would you rather bet on the methods Easton and Klaff recommend, which may work in presenting your web-based evidence but require interpretation and debate, or check into Hanzo web and social media archiving? The web archives we provide to our clients capture all elements of dynamic websites and social media conversations, related metadata (meaning your archived material is immune to browser and software obsolescence) and includes hashes and proof of authenticity. Hanzo web archives fully integrate into Symantec Enterprise Vault™ and other eDiscovery tools and deliver archived data in full, contextual experience to the end user.
I guess in conclusion; my recommendation is you investigate all of your options. Attend our webinar and look at your options. After all, when it comes to the volatility of the web, it’s your data to lose.