The workflows for the Fondo Real de Cholula digitization project are focused on making the relatively complex process of tethered camera digitization as straightforward and fast as possible. The digitization process groups all images from a given box together, using the following filename template:
In this template, frc refers to the collection name, vol001 refers to the box number assigned during digitization, the four digit sequence number refers to the photo capture number (each capture originally includes two facing pages), and a and b refer to verso and recto pages in the split TIFF images exported by the team in Puebla.
While the filenaming template is relatively straightforward, the physical collection is more complex. Each box is made up of several bound objects called expedientes. Each expediente is a bundle of pages and documents, generally related to one another in some way. An expediente might contain all the documents produced by a single court case, or all the documents created by a notary in a given year.
Crucially, the digitization team describes the collection expediente by expediente, rather than box by box. The metadata spreadsheets produced by the team in Puebla contain one line per expediente. The filenaming template and file structure ignore expedientes altogether, meanwhile. After the first phase of the project ended in 2019, the team had digitized approximately 42 thousand images, divided simultaneously into 37 boxes (according to the file structure) and 393 expedientes (according to the metadata).
In addition to descriptive information, each row of metadata identifies the first page of the expediente. In order to ingest the collection into our public Islandora repository, the files must be properly sorted into directories with accompanying metadata for each expediente. The first page identifier in each metadata row makes this possible.
Before the following script could run, the collection was fully processed, the metadata was split into separate spreadsheets (one per expediente), and PNG derivatives of all images were created. Our repository displays these PNGs using the OpenSeadragon viewer. In this case, all subdirectories had already been created, and the individual metadata spreadsheets had already been distributed into their corresponding directories using command line processes.
The Python script is written to split one box directory at a time. Given the size of the collection (37 directories, 42 thousand images), this makes the process more manageable, allowing it to be restarted in the event of an error, and freeing up the computer for other processes as needed. When the script is run, the user is prompted to type the name of the box to be processed (‘frc_vol001’, e.g.)
The script is broken into two functions. The first (“copyFunction”) copies files from the source directory, where they are simply sorted according to box, into a destination directory, where they will be sorted into individual expediente subdirectories. It creates a CSV list of the files in the directory with their corresponding identifiers (i.e. the filename minus the file extension).
Next, the script reads the master metadata file for the collection and populates a list of each expediente‘s first page.
The script opens the list of files in the source directory and assigns each file a destination folder. It does this by checking each filename against the list of first pages from the previous step: when a filename matches the name of a first page, that file is assigned to the directory matching the first page name. All subsequent files will be assigned to the same directory, until another first page file is encountered. Since all filenames are unique and all lists are sorted alphabetically, this recursive process works reliably.
Finally, the script reads each line of the copy list file it just created and uses the value of each column to copy each file from its source to its previously-determined destination directory. It then removes the temporary files created during this process.
The second function in the script is the manifest function, which adds a line to each metadata spreadsheet for each image that was just copied into the directory. These additional lines are required by our repository in order to associate each image in a directory with the book-level object described in the existing metadata line.
First, the script creates a list of subdirectories pertaining to the box submitted by the user, by reading the vol### element of each subdirectory.
The script then opens each relevant manifest and populates a line for each image file in the directory, containing each image’s filename, identifier, and page number. The page numbers are determined using a simple counter that resets after each manifest has been populated. Because the each sequence number used by the filename occurs twice (once with an _a suffix and once with _b), the page numbers could not be directly lifted from the identifiers in this case.
When this finishes for the last expediente in the list, the script has finished. Each resulting subdirectory, containing images and a complete metadata file for the expediente, is ready for ingest into our Islandora repository. A full copy of the script, with comments explaining each step in more detail, is available at the following GitHub repository:
At the time that I wrote it in October 2019, this was the most complex script I had created by far. Looking back at the script even just a few months later, I can see several bits that could be handled far more efficiently. More recent scripts have handled similar tasks in far fewer lines of code, without the need to repeatedly write, read, and remove temporary CSV files. Still, this script saved me hours of work time by allowing me to copy files in bulk overnight, without asking me to cross-reference between the metadata sheet and file lists. This allowed me to focus my attention on other detailed tasks in preparation for our repository launch.