Minimizing the DROID signature file

2017-01-08 | Martin Hoppenheit | 9 min read

The DROID file format identification tool is fuelled by a so-called signature file containing information from the PRONOM technical registry. For the average use case, the official signature files provided by PRONOM are perfectly fine as they allow DROID to identify a large range of different formats. The drawback: Analysis based on these signature files is rather slow, presumably because DROID has to crawl through so many file format signatures.

This article investigates whether DROID runs faster with a signature file that has been restricted to a limited number of file formats. In other words, we will see if it is worth to “minimize” a signature file. (Of course, restricting a signature file to certain file formats will lead DROID to identify all other file formats as “Unknown”.)

Basic ideas

In a nutshell, a DROID signature file consists of two parts: a set of InternalSignature definitions and a set of FileFormat definitions. The signatures describe bit patterns (a.k.a. magic numbers) that DROID looks for when it tries to identify a file format, and each file format definition lists the format’s name, version, PUID, MIME type, common file name extensions and similar properties as well as references to the InternalSignature definitions that are characteristic of this file format. If you want to see some XML, the InternalSignature and FileFormat entries for the TIFF format look like this:

<FFSignatureFile>
    <InternalSignatureCollection>
        <InternalSignature ID="9" Specificity="Specific">
            <ByteSequence Reference="BOFoffset">
                <SubSequence MinFragLength="0" Position="1"
                    SubSeqMaxOffset="0" SubSeqMinOffset="0">
                <Sequence>49492A00</Sequence>
                <DefaultShift>5</DefaultShift>
                <Shift Byte="00">1</Shift>
                <Shift Byte="2A">2</Shift>
                <Shift Byte="49">3</Shift>
                </SubSequence>
            </ByteSequence>
        </InternalSignature>
        <InternalSignature ID="10" Specificity="Specific">
            <!-- Content omitted for brevity. -->
        </InternalSignature>
    </InternalSignatureCollection>
    <FileFormatCollection>
        <FileFormat ID="1099" MIMEType="image/tiff"
            Name="Tagged Image File Format" PUID="fmt/353">
            <InternalSignatureID>9</InternalSignatureID>
            <InternalSignatureID>10</InternalSignatureID>
            <Extension>tif</Extension>
            <Extension>tiff</Extension>
        </FileFormat>
    </FileFormatCollection>
</FFSignatureFile>

With this structure in mind, the basic idea of filtering (and thus minimizing) a signature file is simple:

  1. Define a list of interesting PUIDs.
  2. Collect all FileFormat definitions whose PUIDs are on the list.
  3. Collect all InternalSignature definitions that are referenced by any of the FileFormat definitions collected in step 2.

This process could be carried out manually by messing with the XML file in an editor but of course it is more convenient to employ some automation. Enter the DROID Signature File Minimizer, or droidsfmin for short, a tool that does exactly this. It takes a list of PUIDs and filters a signature file so that only the relevant entries remain.

What’s next? Using this approach, we now assemble a collection of filtered signature files for different groups of file formats. Then we run DROID on a set of test files using the different signature files and compare its respective performance.

Filter criteria

We compare the following file format selections.

  • Original: This is the original DROID signature file v86 as provided by the National Archives. It contains the whole wisdom of PRONOM, in this version that means 1403 different format entries.
  • PDF: Contains entries for the basic PDF formats, i.e., PDF 1.0 to 1.7. No PDF/A, PDF/X or other PDF derivatives are included. PUIDs fmt/14, fmt/15, fmt/16, fmt/17, fmt/18, fmt/19, fmt/20, fmt/276.
  • PDF/A: Contains 8 entries for all PDF/A variations. PUIDs fmt/95, fmt/354, fmt/476, fmt/477, fmt/478, fmt/479, fmt/480, fmt/481.
  • TIFF: Contains 9 entries for the TIFF format, including the deprecated PUIDs for TIFF 3–6 and some specialized derivatives like GeoTIFF. PUIDs fmt/7, fmt/8, fmt/9, fmt/10, fmt/153, fmt/154, fmt/155, fmt/156, fmt/353.
  • Word+Excel: Contains 58 entries for a wide variety of Microsoft Word and Excel formats. PUIDs fmt/55, fmt/56, fmt/57, fmt/58, fmt/59, fmt/61, fmt/62, fmt/39, fmt/40, fmt/37, fmt/38, fmt/172, fmt/173, fmt/174, fmt/175, fmt/176, fmt/177, fmt/178, fmt/214, fmt/346, fmt/412, fmt/445, fmt/523, fmt/553, fmt/554, fmt/555, fmt/556, fmt/595, fmt/597, fmt/598, fmt/599, fmt/609, fmt/627, fmt/628, fmt/754, fmt/755 x-fmt/1, x-fmt/2, x-fmt/17, x-fmt/23, x-fmt/45, x-fmt/46, x-fmt/58, x-fmt/64, x-fmt/65, x-fmt/74, x-fmt/97, x-fmt/123, x-fmt/124, x-fmt/125, x-fmt/126, x-fmt/128, x-fmt/129, x-fmt/204, x-fmt/273, x-fmt/274, x-fmt/275, x-fmt/276.
  • XML: Contains 3 entries for XML, XML Schema and XSLT. PUIDs fmt/101, x-fmt/280, x-fmt/281.
  • WAVE: Contains 16 entries for different variants of the WAVE format. PUIDs fmt/6, fmt/2, fmt/1, fmt/141, fmt/142, fmt/143, fmt/527, fmt/703, fmt/704, fmt/705, fmt/706, fmt/707, fmt/708, fmt/709, fmt/710, fmt/711. This signature file will be tested with the Digital Preservation Stage Boss One corpus only (see below), since the Govdocs Selected corpus seems to contain no WAVE files.

Test corpora

For the main benchmark, we run DROID on the Govdocs Selected corpus which contains over 26,000 files with a total size of around 32 gigabytes. Additionally, we take a short glance at the WAVE files used by Ross Spencer in his recent Digital Preservation Stage Boss One experiment.

Benchmark environment

The hardware and software environment on which we run the benchmark is defined by the following characteristics:

  • Intel Core i5-5200U 2.2 GHz quad core CPU
  • 4 GB RAM
  • SanDisk SD7SB6S SSD
  • Debian 8
  • Java OpenJDK Runtime Environment 1.7.0 64 bit
  • DROID 6.2.1

We use the command line version of DROID in no profile mode with commands like the following:

$ droid -R -Nr ./govdocs_selected/ -Ns signature_file.xml

No maximum bytes to scan limitation is defined and no container signature file is used. (N.B., both these factors have a significant impact on DROID’s accuracy and performance themselves. Setting a max bytes to scan limit usually makes DROID much faster but can lead to failures in identifying for example PDF/A files. Using a container signature file improves the identification of some office files and other container based formats.) With each format selection (i.e., filtered signature file) we run DROID ten times and measure the average number of CPU seconds. (CPU seconds is the amount of time DROID has actively been executed by the computer’s processor(s). This is different from just comparing the wallclock times of when DROID was started and finished because the DROID process may have been interrupted by other processes on the system.)

Since the benchmarking steps of running DROID again and again and measuring execution time is rather tedious we automate the process with some scripts. Details can be found in the benchmark directory of the droidsfmin GitHub repository.

Benchmark results

With the Govdocs Selected corpus and the format selections (filtered signature files) described above, we get the following results:

format selection number of PUIDs average CPU seconds
Original 1403 174.89
PDF 8 7.65
PDF/A 8 58.77
TIFF 9 7.68
Word+Excel 58 22.47
XML 3 7.17

We can see that using the filtered signature files, DROID is considerably faster. In the best case (XML) it beats the original signature file by the factor ~25, and even in the worst case (PDF/A) it is still ~3 times as fast as with the original signature file.

However, the increase in speed varies widely across the different format selections. This deserves a closer look.

It should be no surprise that DROID is slower with the Word+Excel selection (58 PUIDs) than with the XML selection (3 PUIDs) – after all, fewer format entries meaning better performance was the premise this whole endeavour was based upon in the first place. But contrary to what one might expect, the performance is not tied to the number of PUIDs in the signature file alone. Compare the PDF/A selection to the PDF and the Word+Excel selections: Although PDF/A and PDF both contain 8 PUIDs, the PDF selection is much faster. And although the Word+Excel selection contains 58 PUIDs, it is not slower, but faster than the PDF/A selection (That might however be different if we used a container signature file which we have not done here!). This becomes even more obvious when we use the WAVE selection on the Digital Preservation Stage Boss One corpus. Although the selection contains only 1.14% of the number of PUIDs in the original signature file it is only slightly faster!

format selection number of PUIDs average CPU seconds
Original 1403 105.22
WAVE 16 93.34

So while the number of format entries in a signature file does have a significant impact on DROID’s performance this is not the only aspect to keep in mind. Some signatures are more complex than others, and not every signature can be processed in constant time.

One such problem are wildcard signatures. Consider two bit patterns FFCC and FF*CC. Let’s say both of them identify some file format and both of them have offset 0, so they should be found at the very beginning of a file. If we look for the first pattern in a file we have to check exactly two bytes (at position 0 and at position 1) to decide if the pattern matches or not. Two bytes to read and we are done. But if we look for the second pattern things become awkward: After we have found the byte FF at position 0 we have to search for the byte CC anywhere in the rest of the file because the wildcard * means that any number of bytes may appear between FF and CC! So we check the byte at position 1, and if it’s not CC then we check the byte at position 2, and if it’s not CC then we check the byte at position 3, and if it’s not CC then … In the worst case, that means looking at each and every byte in the whole file instead of just the first two bytes. So these two very similar bit patterns will cause DROID to consume a very different amount of time to analyze a file. And in fact, all format entries in the PDF/A selection and 13 out of 16 in the WAVE selection are based on signatures containing wildcards.

Conclusion

So is it worth it? Generally, yes.

If your archival policy accepts only a well-defined set of file formats for ingest then you can easily restrict the DROID signature file to just these formats. This way it may be faster, but even better, all formats that your policy does not allow will be flagged as “Unknown” which is more outstanding than manually scanning a list of PUIDs for the illegal ones.

If you are processing a set of files which you expect to be of a certain format you can (with the help of the droidsfmin tool) quickly build a custom signature file for just this project which may considerably speed up the file format identification phase and let you focus on the unexpected “Unknown” formats in a second step using another signature file that is able to identify them.

However, as we have seen, this will not always work out. Depending on the file formats you choose and the complexity of their signatures the speed increase may be near to zero. If you encounter such a case, take a closer look at the offending PRONOM signature and see if you can improve it! And last but not least, the size and complexity of the signature file is of course only one factor that influences DROID’s performance. Another (probably even more important) factor lies in its algorithms and implementation details. Minimizing the signature file is certainly not the holy grail of file format identification performance, but it can be helpful.

To summarize, if you want to minimize a DROID signature file for speed (or other reasons), there are basically two things to do:

  1. Filter the file format entries in the signature file (with the droidsfmin tool) to contain only those you need, based on your archival policy or your current analysis.
  2. Avoid complex signatures, in particular wildcard signatures, or improve them where possible.