| Spotlight |
|---|
|
| Open Source Web Archiving Tools |
|
Hanzo Open Source Projects: Hanzo WARC ToolsThe main goal of WARC Tools is to facilitate and promote the adoption of the WARC file format for storing web archives by the mainstream web development community by providing an open source software library, a set of command line tools, web server plug-ins and technical documentation for manipulation and management of web archive files, or WARC files. WARC files are produced by web archiving crawlers, such as Heritrix, the open-source, extensible, Web-scale, archiving quality Web crawler developed by the Internet Archive with the Nordic National Libraries, and Hanzo's own commercial crawlers. The project is lead by Hanzo Archives, in collaboration with Internet Archive Web team, and supported by the International Internet Preservation Consortium (IIPC). WARC Tools are to be implemented in a set of core libraries, and the functionality to be made available to end users as command line tools, extensions to existing tools, and simple web applications for accessing WARC content. In addition all the libraries will have APIs and dynamic language bindings and will be made available as software libraries for developers. The library and tools will be scriptable (command lines in shell scripts, dynamic language bindings to the library), and programmable (dynamic language bindings, Java packages, and the C library itself). Migration and interoperability with legacy tools is important and this project will implement functionality for these tools and the artefacts they create, in order to enable rapid progression and adoption of WARC by the mainstream. To this end, integrating or linking to HTTrack, curl and wget for example are all going to be part of the Hanzo WARC Tools. These will be released as open source code, together with installers, documentation and man pages. The library and tools will be implemented in ANSI C and will be highly portable, with build/installation on various Linux and Unix distributions, as well as Windows, together with unix man pages, build and installation guides, developer guides, etc.
Hanzo Search ToolsThe main goal of Search Tools is to facilitate and promote the adoption of the WARC file format for storing web archives by the mainstream web development community by providing an open source software library, a set of command line tools, web server plug-ins and technical documentation for full-text and metadata search of web archive files, or WARC files. The project is lead by Hanzo Archives. Hanzo s3 ToolsA set of *nix like command line tools for working with s3. They are backed by a simple python library that can be used to work with the material you have uploaded to S3. InstallationNote that the tools require Python 2.4 Check out the tools from subversion. Run the installer and change permissions on the tools. As root: svn checkout http://hanzo-s3-tools.googlecode.com/svn/trunk/ hanzo-s3-tools Setting your keysThe tools can pick up your s3 keys from your environment or you can specify them explicitly each time. If you are using a secure account (ie. your own and no one shares it) the environment variables provide great convenience. If you are specifying them explicitly don't forget that the command might be recorded in history. Set these variables:
Getting StartedThe first thing to do if you haven't already is to get an S3 account. Each command comes with built in help using the '-h' option:
If you have not used S3 before you need to make yourself a bucket. The names of the buckets are a global namespace so you will need to think of a unique name for your bucket:
You should now be able to see your bucket:
to see what is in the bucket type:
Of course there will be nothing in there are present. So upload a file:
If you do s3ls again on your bucket you will see that the bucket now contains your file. If you want to get your file you can use s3get
This will stream the contents of your file to the terminal - just like *nix 'cat' command. If you want to save your file you can type:
You can also just have a peep at the file without downloading it using
to show you the headers set on the content Look at the help for each command for more options. NotesThe s3put and s3get commands stream your files so that you can up and download files right up to the S3 limit (currently 2gb). s3ls allows you to specify files to look for in glob format:
More features including support for prefixes and metadata coming soon. |