Bagging scripts

The following scripts emerged out of a need to improve and streamline bagging practices at UT Libraries. As UTL’s Systems and Digital Archivist, my duties include building digital preservation capacity across the Libraries. One area for growth I identified early on was training archivists working within special collections units to bag digital collections materials according to the Library of Congress’ BagIt specification.

UTL has used BagIt to package special collections material bound for its tape archive for nearly a decade, and has developed a local specification for bag-info.txt contents that ensures bagged materials are tracked in local records, well-described internally, and (in the case of digitized special collections materials) traceable to originating physical collections.

Until very recently, however the work of creating and validating bags at UTL was performed exclusively by members of the Digital Stewardship unit. This meant archivists and collections managers delivering materials to be added to the tape archive were not directly responsible for describing those materials, and prevented staff outside the unit from learning about bagging and digital preservation more generally. It was also rather inefficient, as a batch of files would regularly require a series of back-and-forth messages between Digital Stewardship staff and collections managers to determine the best descriptive information to be included in the bag-info.txt file.

More recently, we have begun to train collections managers at other Libraries units in creating and validating bags themselves. This ensures the bags we receive in Digital Stewardship are described by the most knowledgeable staff possible, and allows us to more quickly move files to tape. Depending on their interface preferences and the number of bags they need to create in a batch, UTL staff can use bagger (great for creating a single bag), bagit-java (allows batch bag creation, but newer versions have dropped command line support), or bagit-python (allows batch bag creation and easy to install, but lacks some of the flexibility offered by older versions of bagit-java) to prepare materials for the tape archive.

The following scripts address some of the limitations of bagit-python, namely its inability to accept a user-supplied bag-info.txt file as older versions of bagit-java can. They are also an attempt to remain flexible in our bagging practices at UTL: while some collections managers may be comfortable working on the command line, others are not, and these scripts make it easier to gather good descriptive metadata bound for bag-info.txt files, using approachable tools like spreadsheets.


batch_bagger

This script creates many bags at once, using a bag-info template and a CSV or XLSX spreadsheet to substitute bag-level text as needed. It is useful when creating many bags for a single collection, when each bag-info.txt file in the batch differs from the others in only slight ways.

I wrote this script in December 2021, when I was faced with the need to bag approximately 80 folders of linguistic material from the Archive of the Indigenous Languages of Latin America (AILLA). Each folder I needed to bag contained files corresponding to a different language, donated by a different researcher (or group of researchers), and each might contain one or several different filetypes. Aside from these differences, the bags could be described more or less identically.

This script allowed me to create a bag-info.txt template for the full set of files, and to gather folder-level descriptions of each bag’s language, donor, and filetypes in a spreadsheet that could be easily filled out by AILLA Language Data Curator Ryan Sullivant. Ryan was proficient in bagging, but I didn’t want him to have to create 80 different bag-info.txt files.

The script works by searching the bag-info template file for keywords enclosed in double square brackets, for example “External-Description: Linguistic research data provided by [[donor]].” Any such keywords that correspond to a column header in the user-supplied spreadsheet (CSV or XLSX) allow the placeholder keywords to be substituted out for the corresponding spreadsheet values for each bag. The script converts the contents of the bag-info template file (with proper keyword substitutions made) into a dictionary as required by bagit-python. The script then writes the bag in place and moves on to the next bag listed in the spreadsheet. It also writes a report spreadsheet listing the UUIDs assigned as an External-Identifier to each bag, to be used in UTL’s records.

This script will be especially useful going forward when I need to request descriptive information from collections managers about the folders of materials they send me for preservation. Many staff are not comfortable writing bags on the command line themselves, but are comfortable working in spreadsheets. Collaborating with other staff on a bag-info template will also demystify the bagging process somewhat, by showing them what information we collect for each bag for long term preservation.


pybagger

This is a simple script that aims to recreate the core command-line functionality of bagit-java. After I wrote batch_bagger.py, I presented it to a colleague who told me that the need to provide a spreadsheet limited its applicability. What my colleague needed was a simple workflow that would allow her to write her own bag-info.txt file and provide it at the point of bagging, which is how she had used bagit-java.

Similarly to batch_bagger, pybagger loads the contents of a user-provided bag-info.txt file into a Python dictionary, which is then used for the resulting bag-info output in the course of bagging the user-supplied directory. The command syntax for running pybagger is very similar to bagit-java, and users can run it as part of a for loop to bag a list of directories in one sitting, providing a bag-info.txt file for each one. Unlike bagit-java, the user does not need to specify the manifest or tagmanifest algorithms: the UTL-standard sha256 algorithm will be used.

Both pybagger and batch_bagger have an “unpack” option that will move the contents of a bag’s data directory out of that folder before deleting it alongside the bag-info.txt, bagit.txt, manifest.txt, and tagmanifest.txt files. The unpack option effectively undoes the bagging process in a given directory, and is useful when a bag creator notices an error in a bag-info.txt file that will require the bags to be recreated.