Things I might have tweeted, part 3: New tools to address electronic records challenges #marac

Peter Bajcsy, National Cenyer for Supercomputing Applications:

What do you do if you come accross an unsupported file format? Move the file, get a new format from the creator, or buy the software.

Presidential erecords: reagan, 200k, clinton, 33 million, gwb, 300 million.

Almost 80k different file name extensions.

We need to figure how to automate and scale systems dealing with erecords.

His wanted solution: cloud services. (I have questions about this, but hopefully he will address them).

Conversion Software Registry tells you how to covert from one file type to another.

Polyglot is their software to convert files, available in the cloud and as a download for your repository.

Universal content viewer is part of this, this software tries to display content of any format.

Content based file comparison: compare two files and evaluate information loss over multiple metrics, not just checksuns or something similar.

With this metric, you can tell how much data you will los due to conversion.

They are providing a prototype of polyglot for free, but t can only deal with open formats. You have to buy the system and licenses for proprietary formats to convert them.

Universal viewer, if possible for all formats, could lessen the need for DIPs.

William Underwood, Georgia Tech:
Tools for file format id

Unix file command is the most widely used, but it is limited.

Created a file signature database to extend the file command.

Automatic markup of emails with xml allows for searching and organization. Would be useful for legacy finding aid conversion, transcriptions, mass digitization, etc.

Manually created grammars for 14 categories of documents and used those grammars to apply the xml in each document.

Not practical if we have to create grammars for every type of document, but they are working on creating automatic grammar generators.

Maria Esteva, Texas Advanced Computing Center:
Mapping archival processing to visualization

Analyze large and multivariable collections for archival processing. Lots of different file formats.

Trying to extract as much information as possible from the records hemselves to help create finding aids.

Virtualization of archival collections: a visual representation of archival collections. Colors and size used to represent size and formats in different parts of collections. A non-textual way to see collections.

Visualization of files types can also help with processing of large collections of electronic records. Can show arrangement as well as preservation risk.

The demonstrations were pretty cool, and I can’t wait to get my hands on these various pieces of software and play around. My one critique is that the presenters made it a little difficult to find out where we could get more information about their prijects. Hopefully with some google-fu, I can figure out more.

Leave a Reply