Overview
While investigating Archivematica, there were aspects of File Information Tool Set (FITS) behavior I wanted to verify, so I tried it using Docker. This is a memo of that process.
https://github.com/harvard-lts/fits
Installation
The installation method using Docker is described at the following page.
https://github.com/harvard-lts/fits?tab=readme-ov-file#docker-installation
However, when accessing the following page mentioned in the manual, the latest release (1.6.0) that includes the Dockerfile could not be downloaded.
https://projects.iq.harvard.edu/fits/downloads
Instead, the latest zip file could be downloaded from the following GitHub releases page.
https://github.com/harvard-lts/fits/releases/tag/1.6.0
After that, I extracted and built it according to the README instructions.
However, on M1 Mac, executing the steps as described resulted in the following error.
% docker run --rm -v `pwd`:/work fits -i fits.sh
2024-01-26 11:41:10 - ERROR - MediaInfo:95 - Error loading native library for this operating system for tool: MediaInfo. ostype=[Linux] -- jvmModel=[64] -- nativeLibPath=[/opt/fits/tools/mediainfo/linux] -- No native MediaInfo library for this OS
java.lang.UnsatisfiedLinkError: Unable to load library 'mediainfo':
libmediainfo.so: cannot open shared object file: No such file or directory
libmediainfo.so: cannot open shared object file: No such file or directory
/opt/fits/tools/mediainfo/linux/libmediainfo.so.0: cannot open shared object file: No such file or directory
...
After consulting ChatGPT 4 about this, it instructed me to add the following to the Dockerfile.
RUN apt-get update && \
apt-get install -yqq \
# Other dependencies
mediainfo libmediainfo-dev \
&& rm -rf /var/lib/apt/lists/*
After adding the above, it worked correctly.
Trying It Out
This time, since I wanted to target a file with Japanese in the filename, I used “A Very Understandable Guide to Copyright and Classes.pdf” (Hiroshima University Information Media Education Research Center), which is published online under a CC BY license.
https://www.media.hiroshima-u.ac.jp/wp-content/uploads/2023/05/すごくわかる著作権と授業.pdf
Then, I executed the following.
docker run --rm -v `pwd`:/work fits -i すごくわかる著作権と授業.pdf
As a result, the following output was obtained.
<?xml version="1.0" encoding="UTF-8"?>
<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="1.6.0" timestamp="1/26/24, 12:49 PM">
<identification>
<identity format="PDF/X" mimetype="application/pdf" toolname="FITS" toolversion="1.6.0">
<tool toolname="Droid" toolversion="6.5.2" />
<tool toolname="Exiftool" toolversion="12.50" />
<tool toolname="Tika" toolversion="2.6.0" />
<version toolname="Tika" toolversion="2.6.0">PDF/X-4</version>
<externalIdentifier toolname="Droid" toolversion="6.5.2" type="puid">fmt/488</externalIdentifier>
</identity>
</identification>
<fileinfo>
<size toolname="Jhove" toolversion="1.26.1">13845166</size>
<creatingApplicationName toolname="Jhove" toolversion="1.26.1">Adobe PDF Library 17.0/Adobe InDesign 18.1 (Macintosh)</creatingApplicationName>
<lastmodified toolname="Exiftool" toolversion="12.50">2023-01-14T06:28:17Z</lastmodified>
<created toolname="Exiftool" toolversion="12.50">2023-01-14T05:31:26Z</created>
<filepath toolname="OIS File Information" toolversion="1.0" status="SINGLE_RESULT">/work/すごくわかる著作権と授業.pdf</filepath>
<filename toolname="OIS File Information" toolversion="1.0" status="SINGLE_RESULT">すごくわかる著作権と授業.pdf</filename>
<md5checksum toolname="OIS File Information" toolversion="1.0" status="SINGLE_RESULT">1f6a11a1b23607a0e29e10efdc153584</md5checksum>
<fslastmodified toolname="OIS File Information" toolversion="1.0" status="SINGLE_RESULT">1684976952000</fslastmodified>
</fileinfo>
<filestatus>
<well-formed toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">true</well-formed>
<valid toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">true</valid>
</filestatus>
<metadata>
<document>
<title toolname="Jhove" toolversion="1.26.1">すごくわかる著作権と授業_PDF版.indd</title>
<author toolname="Exiftool" toolversion="12.50" status="SINGLE_RESULT">Adobe InDesign 18.1 (Macintosh)</author>
<language toolname="Jhove" toolversion="1.26.1">ja-JP</language>
<pageCount toolname="Jhove" toolversion="1.26.1">64</pageCount>
<hasOutline toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">no</hasOutline>
<hasAnnotations toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">yes</hasAnnotations>
<graphicsCount toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">7</graphicsCount>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>DINNextLTPro-Medium</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>DINNextRoundedLTPro-Regular</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>GaramondPremrPro</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>GaramondPremrPro-Smbd</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>GothicMB101Pro-DeBold</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>HiraginoUDSansFStdN-W3</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>HiraginoUDSansFStdN-W4</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>HiraginoUDSansFStdN-W5</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>HiraginoUDSansFStdN-W6</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>HiraginoUDSansRStdN-W4</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>HiraginoUDSansRStdN-W5</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>HiraginoUDSansRStdN-W6</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>HiraginoUDSansStd-W3</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>ReimPro-ExBold</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>RyuminPro-ExBold</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>RyuminPro-Medium</fontName>
</font>
<font toolname="Jhove" toolversion="1.26.1" status="SINGLE_RESULT">
<fontName>ShueiMGoStd-B</fontName>
</font>
</document>
</metadata>
<statistics fitsExecutionTime="2256">
<tool toolname="MediaInfo" toolversion="22.09" status="did not run" />
<tool toolname="OIS Audio Information" toolversion="0.1" status="did not run" />
<tool toolname="ADL Tool" toolversion="0.1" status="did not run" />
<tool toolname="VTT Tool" toolversion="0.1" status="did not run" />
<tool toolname="Droid" toolversion="6.5.2" executionTime="299" />
<tool toolname="jpylyzer" toolversion="2.1.0" status="did not run" />
<tool toolname="Jhove" toolversion="1.26.1" executionTime="1201" />
<tool toolname="embARC" toolversion="0.2" status="did not run" />
<tool toolname="file utility" toolversion="5.43" executionTime="963" />
<tool toolname="Exiftool" toolversion="12.50" executionTime="983" />
<tool toolname="NLNZ Metadata Extractor" toolversion="3.6GA" status="did not run" />
<tool toolname="OIS File Information" toolversion="1.0" executionTime="231" />
<tool toolname="OIS XML Metadata" toolversion="0.2" status="did not run" />
<tool toolname="ffident" toolversion="0.2" executionTime="738" />
<tool toolname="Tika" toolversion="2.6.0" executionTime="2137" />
</statistics>
</fits>
The following is an explanation by ChatGPT 4.
The provided XML is an example of output generated by FITS (File Information Tool Set). FITS is a tool used to identify and report characteristics and metadata of digital files. This XML output shows the analysis results of a specific PDF file. Each section is explained below:
<identification> Section
- Contains information about the file format, MIME type (in this case
application/pdf), and the tools used to identify it (Droid, Exiftool, Tika). - The
versiontag indicates the PDF file version (PDF/X-4). - The
externalIdentifiertag shows the PUID (Persistent Unique Identifier) of the file format identified by the Droid tool.
<fileinfo> Section
- Contains basic information about the file, such as file size, creating application, last modification date, creation date, file path, filename, MD5 checksum, and filesystem last modification date.
<filestatus> Section
- Indicates whether the file format is well-formed and valid. This shows whether the file conforms to standards.
<metadata> Section
- Contains metadata about the document. For example, information such as title, author, language, page count, presence of outlines, presence of annotations, number of images, and fonts used.
<statistics> Section
- Shows the execution time and status of each tool. This reveals how long each tool took to analyze the file.
This output presents a detailed analysis of a PDF file and provides important information for the fields of archiving, libraries, and digital preservation. FITS is used to integrate the results of various tools to provide a comprehensive view of file characteristics.
Summary
Using FITS, I was able to learn various information about digital files. I plan to continue investigating for further utilization of FITS.