Overview

While investigating Archivematica, there were aspects of File Information Tool Set (FITS) behavior I wanted to verify, so I tried it using Docker. This is a memo of that process.

https://github.com/harvard-lts/fits

Installation

The installation method using Docker is described at the following page.

https://github.com/harvard-lts/fits?tab=readme-ov-file#docker-installation

However, when accessing the following page mentioned in the manual, the latest release (1.6.0) that includes the Dockerfile could not be downloaded.

https://projects.iq.harvard.edu/fits/downloads

Instead, the latest zip file could be downloaded from the following GitHub releases page.

https://github.com/harvard-lts/fits/releases/tag/1.6.0

After that, I extracted and built it according to the README instructions.

However, on M1 Mac, executing the steps as described resulted in the following error.

% docker run --rm -v `pwd`:/work fits -i fits.sh
2024-01-26 11:41:10 - ERROR - MediaInfo:95 - Error loading native library for this operating system for tool: MediaInfo. ostype=[Linux] -- jvmModel=[64] -- nativeLibPath=[/opt/fits/tools/mediainfo/linux] -- No native MediaInfo library for this OS
java.lang.UnsatisfiedLinkError: Unable to load library 'mediainfo':
libmediainfo.so: cannot open shared object file: No such file or directory
libmediainfo.so: cannot open shared object file: No such file or directory
/opt/fits/tools/mediainfo/linux/libmediainfo.so.0: cannot open shared object file: No such file or directory
...

After consulting ChatGPT 4 about this, it instructed me to add the following to the Dockerfile.

RUN apt-get update && \
    apt-get install -yqq \
    # Other dependencies
    mediainfo libmediainfo-dev \
    && rm -rf /var/lib/apt/lists/*

After adding the above, it worked correctly.

Trying It Out

This time, since I wanted to target a file with Japanese in the filename, I used “A Very Understandable Guide to Copyright and Classes.pdf” (Hiroshima University Information Media Education Research Center), which is published online under a CC BY license.

https://www.media.hiroshima-u.ac.jp/wp-content/uploads/2023/05/すごくわかる著作権と授業.pdf

Then, I executed the following.

docker run --rm -v `pwd`:/work fits -i すごくわかる著作権と授業.pdf

As a result, the following output was obtained.

<?xml version="1.0" encoding="UTF-8"?>
<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="1.6.0" timestamp="1/26/24, 12:49 PM">
  <identification>
    <identity format="PDF/X" mimetype="application/pdf" toolname="FITS" toolversion="1.6.0">
      <tool toolname="Droid" toolversion="6.5.2" />
      <tool toolname="Exiftool" toolversion="12.50" />
      <tool toolname="Tika" toolversion="2.6.0" />
      <version toolname="Tika" toolversion="2.6.0">PDF/X-4</version>
      <externalIdentifier toolname="Droid" toolversion="6.5.2" type="puid">fmt/488</externalIdentifier>
    </identity>
  </identification>
  <fileinfo>
    <size toolname="Jhove" toolversion="1.26.1">13845166</size>
    <creatingApplicationName toolname="Jhove" toolversion="1.26.1">Adobe PDF Library 17.0/Adobe InDesign 18.1 (Macintosh)</creatingApplicationName>
    <lastmodified toolname="Exiftool" toolversion="12.50">2023-01-14T06:28:17Z</lastmodified>
    <created toolname="Exiftool" toolversion="12.50">2023-01-14T05:31:26Z</created>
    <filepath toolname="OIS File Information" toolversion="1.0" status="SINGLE_RESULT">/work/すごくわかる著作権と授業.pdf</filepath>
    <filename toolname="OIS File Information" toolversion="1.0" status="SINGLE_RESULT">すごくわかる著作権と授業.pdf</filename>
    <md5checksum toolname="OIS File Information" toolversion="1.0" status="SINGLE_RESULT">1f6a11a1b23607a0e29e10efdc153584</md5checksum>
    <fslastmodified toolname="OIS File Information" toolversion="1.0" status="SINGLE_RESULT">1684976952000</fslastmodified>
  </fileinfo>
  <filestatus>
    <well-formed toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">true</well-formed>
    <valid toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">true</valid>
  </filestatus>
  <metadata>
    <document>
      <title toolname="Jhove" toolversion="1.26.1">すごくわかる著作権と授業_PDF版.indd</title>
      <author toolname="Exiftool" toolversion="12.50" status="SINGLE_RESULT">Adobe InDesign 18.1 (Macintosh)</author>
      <language toolname="Jhove" toolversion="1.26.1">ja-JP</language>
      <pageCount toolname="Jhove" toolversion="1.26.1">64</pageCount>
      <hasOutline toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">no</hasOutline>
      <hasAnnotations toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">yes</hasAnnotations>
      <graphicsCount toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">7</graphicsCount>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>DINNextLTPro-Medium</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>DINNextRoundedLTPro-Regular</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>GaramondPremrPro</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>GaramondPremrPro-Smbd</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>GothicMB101Pro-DeBold</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>HiraginoUDSansFStdN-W3</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>HiraginoUDSansFStdN-W4</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>HiraginoUDSansFStdN-W5</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>HiraginoUDSansFStdN-W6</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>HiraginoUDSansRStdN-W4</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>HiraginoUDSansRStdN-W5</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>HiraginoUDSansRStdN-W6</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>HiraginoUDSansStd-W3</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>ReimPro-ExBold</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>RyuminPro-ExBold</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>RyuminPro-Medium</fontName>
      </font>
      <font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
        <fontName>ShueiMGoStd-B</fontName>
      </font>
    </document>
  </metadata>
  <statistics fitsExecutionTime="2256">
    <tool toolname="MediaInfo" toolversion="22.09" status="did not run" />
    <tool toolname="OIS Audio Information" toolversion="0.1" status="did not run" />
    <tool toolname="ADL Tool" toolversion="0.1" status="did not run" />
    <tool toolname="VTT Tool" toolversion="0.1" status="did not run" />
    <tool toolname="Droid" toolversion="6.5.2" executionTime="299" />
    <tool toolname="jpylyzer" toolversion="2.1.0" status="did not run" />
    <tool toolname="Jhove" toolversion="1.26.1" executionTime="1201" />
    <tool toolname="embARC" toolversion="0.2" status="did not run" />
    <tool toolname="file utility" toolversion="5.43" executionTime="963" />
    <tool toolname="Exiftool" toolversion="12.50" executionTime="983" />
    <tool toolname="NLNZ Metadata Extractor" toolversion="3.6GA" status="did not run" />
    <tool toolname="OIS File Information" toolversion="1.0" executionTime="231" />
    <tool toolname="OIS XML Metadata" toolversion="0.2" status="did not run" />
    <tool toolname="ffident" toolversion="0.2" executionTime="738" />
    <tool toolname="Tika" toolversion="2.6.0" executionTime="2137" />
  </statistics>
</fits>

The following is an explanation by ChatGPT 4.


The provided XML is an example of output generated by FITS (File Information Tool Set). FITS is a tool used to identify and report characteristics and metadata of digital files. This XML output shows the analysis results of a specific PDF file. Each section is explained below:

<identification> Section

  • Contains information about the file format, MIME type (in this case application/pdf), and the tools used to identify it (Droid, Exiftool, Tika).
  • The version tag indicates the PDF file version (PDF/X-4).
  • The externalIdentifier tag shows the PUID (Persistent Unique Identifier) of the file format identified by the Droid tool.

<fileinfo> Section

  • Contains basic information about the file, such as file size, creating application, last modification date, creation date, file path, filename, MD5 checksum, and filesystem last modification date.

<filestatus> Section

  • Indicates whether the file format is well-formed and valid. This shows whether the file conforms to standards.

<metadata> Section

  • Contains metadata about the document. For example, information such as title, author, language, page count, presence of outlines, presence of annotations, number of images, and fonts used.

<statistics> Section

  • Shows the execution time and status of each tool. This reveals how long each tool took to analyze the file.

This output presents a detailed analysis of a PDF file and provides important information for the fields of archiving, libraries, and digital preservation. FITS is used to integrate the results of various tools to provide a comprehensive view of file characteristics.


Summary

Using FITS, I was able to learn various information about digital files. I plan to continue investigating for further utilization of FITS.