This announcement about the release of a bundle of file format and metadata-related tools might be of interest to the list -----Original Message----- From: [log in to unmask] [mailto:[log in to unmask]] On Behalf Of Andrea Goethals Sent: 06 August 2009 19:40 To: [log in to unmask]; [log in to unmask]; [log in to unmask] Cc: Spencer McEwen; Vitaly Zakuta Subject: [DIGLIB] New tool available: File Information Tool Set (FITS) File Information Tool Set (FITS): http://fits.googlecode.com With the increase in web archiving and other born-digital projects that introduce new formats and genres to our digital preservation repositories, it is becoming more important that our tools support a wide range of file formats. In particular, our file format identification, validation and metadata extraction tools should work with a broad range of formats and genres. There are a number of these file tools in existence, but none of these tools individually can both support a wide range of formats and extract the technical metadata necessary to fully characterize digital content. In the fall of 2008 Harvard University Library began development on the File Information Tool Set (FITS) in response to this need. FITS acts as a wrapper around multiple open source file format identification, validation and metadata extraction tools. FITS invokes and manages the output of these tools. The native output from these tools is converted into a common format, "FITS XML", compared to one another and consolidated into a single XML output file. The tools currently wrapped by FITS are: * JHOVE * Exiftool from Phil Harvey * National Library of New Zealand Metadata Extractor * DROID from the UK National Archives * Ffident from Marco Schmidt * File Utility In addition, FITS includes two original tools: FileInfo and XmlMetadata. There are a number of tools that will be evaluated for incorporation into FITS in the future, including: * Apache Tika * JHOVE 2 * Aduna Aperture * MediaInfo FITS is written in Java and is compatible with Java 1.5 or higher. FITS can be invoked by its command-line interface or through its Java API. FITS produces a "status" value for each format identification it makes. When the status is SINGLE_RESULT, all tools that were able to identify the format agree on the file's format. When the status is CONFLICT, there is more than one purported format identified for the file. Because FITS combines the output of multiple tools it has to be able to handle conflicts among the tool's output when they don't agree. It handles this conflict in many ways: * Tool output is normalized before it is compared for conflicts. For example, one tool might report for a file format that it is "PNG", while another tool may output it as "Portable Network Graphics". In another example, one tool might output the resolution unit as "2"; another tool might output it as "inches". These values are normalized in the XSLT file that converts the tool's native output to FITS XML before the FITS XML for each tool is compared to each other. * Users configure a tool ordering preference. In cases of format identification conflicts, the format identified by the preferred tools will determine the format FITS reports. * Tools can be excluded from reporting on particular formats and/or on particular metadata elements if its output is found in testing to be incorrect or buggy. This is very useful for incorporating a tool into FITS because it is good at some things without having to accept known unreliable information from the tool. * FITS consults a configurable "format tree" to know when two reported formats for a file are not really conflicts because one of the formats is a more specific form of the other format. For example the format tree documents that the OpenDocument Text format is a more specific form of the Zip format. If a file is identified as being in both of these formats by FITS tools it is not reported as a conflict because technically they are both correct. Instead the more specific format, OpenDocument Text, is reported as the format. FITS is available to the public under the LGPL license. Harvard University Library (HUL) plans to use FITS in production in 2010 within its ingest service, but is making an early release of it available now for testing at http://fits.googlecode.com. Additional tools are being written at HUL to convert FITS XML into MIX, textMD, documentMD and other technical metadata schemas. We invite you to download and try using FITS. Any issues using it can be reported on the FITS website on the Issues web page (http://code.google.com/p/fits/issues/list). For more information please see the FITS website (http://fits.googlecode.com) or contact me directly. -- Andrea Goethals Digital Preservation and Repository Services Manager HUL - Office for Information Systems 90 Mt. Auburn Street Cambridge, MA 02138 phone: (617) 495-3724 [log in to unmask] -- R. John Robertson Research Fellow/ Open Education Resources programme support officer (JISCCETIS), Centre for Academic Practice and Learning Enhancement, University of Strathclyde The University of Strathclyde is a charitable body, registered in Scotland, with registration number SC015263