Fondo Real de Cholula book & metadata splitter

Background

The workflows for the Fondo Real de Cholula digitization project are focused on making the relatively complex process of tethered camera digitization as straightforward and fast as possible. The digitization process groups all images from a given box together, using the following filename template:

frc_vol001_0001_a.tif

frc_vol001_0001_b.tif

frc_vol001_0002_a.tif

frc_vol001_0002_b.tif

In this template, frc refers to the collection name, vol001 refers to the box number assigned during digitization, the four digit sequence number refers to the photo capture number (each capture originally includes two facing pages), and and refer to verso and recto pages in the split TIFF images exported by the team in Puebla.

frc_files
Example directory of Fondo Real de Cholula files, from box 001

While the filenaming template is relatively straightforward, the physical collection is more complex. Each box is made up of several bound objects called expedientes. Each expediente is a bundle of pages and documents, generally related to one another in some way. An expediente might contain all the documents produced by a single court case, or all the documents created by a notary in a given year.

Crucially, the digitization team describes the collection expediente by expediente, rather than box by box. The metadata spreadsheets produced by the team in Puebla contain one line per expediente. The filenaming template and file structure ignore expedientes altogether, meanwhile. After the first phase of the project ended in 2019, the team had digitized approximately 42 thousand images, divided simultaneously into 37 boxes (according to the file structure) and 393 expedientes (according to the metadata).

frc_directories
Fondo Real de Cholula digital collection broken down by box (directories) and expediente (metadata). Each box contains several expedientes.

 

In addition to descriptive information, each row of metadata identifies the first page of the expediente. In order to ingest the collection into our public Islandora repository, the files must be properly sorted into directories with accompanying metadata for each expediente. The first page identifier in each metadata row makes this possible.

frc_metadata
Fondo Real de Cholula metadata sheet showing identifier values for each expediente. Expedientes from the same box have the same vol number.

Before the following script could run, the collection was fully processed, the metadata was split into separate spreadsheets (one per expediente), and PNG derivatives of all images were created. Our repository displays these PNGs using the OpenSeadragon viewer. In this case, all subdirectories had already been created, and the individual metadata spreadsheets had already been distributed into their corresponding directories using command line processes.

The script

The Python script is written to split one box directory at a time. Given the size of the collection (37 directories, 42 thousand images), this makes the process more manageable, allowing it to be restarted in the event of an error, and freeing up the computer for other processes as needed. When the script is run, the user is prompted to type the name of the box to be processed (‘frc_vol001’, e.g.)

copyFunction

The script is broken into two functions. The first (“copyFunction”) copies files from the source directory, where they are simply sorted according to box, into a destination directory, where they will be sorted into individual expediente subdirectories. It creates a CSV list of the files in the directory with their corresponding identifiers (i.e. the filename minus the file extension).

copyfunction_1

Next, the script reads the master metadata file for the collection and populates a list of each expediente‘s first page.

copyfunction_2

The script opens the list of files in the source directory and assigns each file a destination folder. It does this by checking each filename against the list of first pages from the previous step: when a filename matches the name of a first page, that file is assigned to the directory matching the first page name. All subsequent files will be assigned to the same directory, until another first page file is encountered. Since all filenames are unique and all lists are sorted alphabetically, this recursive process works reliably.

copyfunction_3

Finally, the script reads each line of the copy list file it just created and uses the value of each column to copy each file from its source to its previously-determined destination directory. It then removes the temporary files created during this process.

copyfunction_4

frc_manifest

The second function in the script is the manifest function, which adds a line to each metadata spreadsheet for each image that was just copied into the directory. These additional lines are required by our repository in order to associate each image in a directory with the book-level object described in the existing metadata line.

First, the script creates a list of subdirectories pertaining to the box submitted by the user, by reading the vol### element of each subdirectory.

frc_manifest_1

The script then opens each relevant manifest and populates a line for each image file in the directory, containing each image’s filename, identifier, and page number. The page numbers are determined using a simple counter that resets after each manifest has been populated. Because the each sequence number used by the filename occurs twice (once with an _a suffix and once with _b), the page numbers could not be directly lifted from the identifiers in this case.

frc_manifest_2

When this finishes for the last expediente in the list, the script has finished. Each resulting subdirectory, containing images and a complete metadata file for the expediente, is ready for ingest into our Islandora repository. A full copy of the script, with comments explaining each step in more detail, is available at the following GitHub repository:

https://github.com/DavidABliss/frc_metadata_splitter/blob/master/frc_sorter.py

Final thoughts

At the time that I wrote it in October 2019, this was the most complex script I had created by far. Looking back at the script even just a few months later, I can see several bits that could be handled far more efficiently. More recent scripts have handled similar tasks in far fewer lines of code, without the need to repeatedly write, read, and remove temporary CSV files. Still, this script saved me hours of work time by allowing me to copy files in bulk overnight, without asking me to cross-reference between the metadata sheet and file lists. This allowed me to focus my attention on other detailed tasks in preparation for our repository launch.