Researchers at the Georgia Tech Research Institute (GTRI) are sharing results of advanced file-format recognition research with The National Archives of the United Kingdom. The effort could enhance worldwide capability to manage the vast array of file formats created since the computer age began.
Improving archivists' ability to categorize and access hundreds of different computer file formats is critical in the digital age. Increasingly, archives receive large quantities of government and other records in a wide variety of digital formats.
"The ultimate problem we're addressing here is technical obsolescence," said William Underwood, a principal research scientist leading the file-recognition effort for GTRI. "As software programs have been superseded over the years, it’s become critical to automate the enormous task of categorizing, verifying and viewing hundreds of past and present file formats."
One major facilitator of that task is the PRONOM service, developed by The National Archives of the U.K. This file-format registry, which can be utilized online by archivists and others worldwide, employs a database containing details of more than 750 different digital file formats. Those formats, in turn, are accessed by a file-format identification tool called DROID.
Underwood explained that archivists face the task of distinguishing among data files in hundreds of different formats. At the most basic level, categorizing these data formats requires software tools that examine file extensions, which are the identifying characters such as "doc" or "pdf" found at the end of filenames.
Yet a file extension -- an external identifier that is easily modified or deleted -- can be inaccurate. More critical is the capability to identify correctly the distinctive internal signature that characterizes a file's format.
GTRI, in cooperation with the U.S. National Archives and Records Administration (NARA), is helping the United Kingdom expand the roster of internal signatures in the PRONOM database. GTRI has added more than 50 such signatures to PRONOM in the past months, increasing the number of signatures in the database by almost a quarter, with more additions expected next year. This work is being performed at the request of the National Archives Center for Advanced Systems and Technologies (NCAST), a NARA unit.
Currently, about a third of PRONOM's 750 file formats have internal signatures. Increasing the number of internal signatures is important, Underwood said, because it helps the DROID tool identify files more accurately. In turn, increased accuracy enables digital archivists to better identify older, obsolete file formats and develop appropriate migration strategies and preservation tools.
"We are grateful to NARA and the Georgia Tech Research Institute for the work they have recently undertaken on file-format research," said David Thomas, director of technology at The National Archives of the UK. "The decision to share their work...has significantly improved the PRONOM database and will be of enormous benefit to the wider digital preservation community."
The technology contributed to The National Archives of the UK is derived from GTRI's research into Advanced Language Processing Technology Applied to Digital Records, a project sponsored by the U.S. Army Research Laboratory and by NCAST. This work applies computational linguistics technology to summarizing, accessing, reviewing and preserving electronic records of the Department of Defense, federal agencies and presidential administrations.
"In PRONOM/DROID, The National Archives of the U.K. has responded to an essential need for preserving and providing sustained access to valuable digital information," said Kenneth Thibodeau, director of NCAST. "We are happy to be able to contribute to enhancing a tool that we use in NARA's Electronic Records Archives system. This helps us and also benefits anyone who needs to preserve digital assets."
The first version of PRONOM was developed by The National Archives' Digital Preservation Department for internal use in March 2002 and was launched as a free online service to the public in February 2004. In 2007 The National Archives won the Digital Preservation Award for its development of the PRONOM and DROID tools.
In 2011, PRONOM data will be released in a linked, open format. This move will make it easier for others to reuse the data, and will provide a means to extend and develop the dataset. More information is available at http://labs.nationalarchives.gov.uk/wordpress/.
"The GTRI computational-linguistics team will certainly continue to contribute to PRONOM," Underwood said. "We're eager to use our experience in language-processing technology to support the evolution of this internationally important file format database."
Research News & Publications Office
Georgia Institute of Technology
75 Fifth Street, N.W., Suite 314
Atlanta, Georgia 30308 USA
Writer: Rick Robinson