The following scripts emerged out of a need to improve and streamline bagging practices at UT Libraries. As UTL’s Systems and Digital Archivist, my duties include building digital preservation capacity across the Libraries. One area for growth I identified early on was training archivists working within special collections units to bag digital collections materials according to the Library of Congress’ BagIt specification.
UTL has used BagIt to package special collections material bound for its tape archive for nearly a decade, and has developed a local specification for
bag-info.txt contents that ensures bagged materials are tracked in local records, well-described internally, and (in the case of digitized special collections materials) traceable to originating physical collections.
Until very recently, however the work of creating and validating bags at UTL was performed exclusively by members of the Digital Stewardship unit. This meant archivists and collections managers delivering materials to be added to the tape archive were not directly responsible for describing those materials, and prevented staff outside the unit from learning about bagging and digital preservation more generally. It was also rather inefficient, as a batch of files would regularly require a series of back-and-forth messages between Digital Stewardship staff and collections managers to determine the best descriptive information to be included in the
More recently, we have begun to train collections managers at other Libraries units in creating and validating bags themselves. This ensures the bags we receive in Digital Stewardship are described by the most knowledgeable staff possible, and allows us to more quickly move files to tape. Depending on their interface preferences and the number of bags they need to create in a batch, UTL staff can use
bagger (great for creating a single bag),
bagit-java (allows batch bag creation, but newer versions have dropped command line support), or
bagit-python (allows batch bag creation and easy to install, but lacks some of the flexibility offered by older versions of
bagit-java) to prepare materials for the tape archive.
The following scripts address some of the limitations of
bagit-python, namely its inability to accept a user-supplied
bag-info.txt file as older versions of
bagit-java can. They are also an attempt to remain flexible in our bagging practices at UTL: while some collections managers may be comfortable working on the command line, others are not, and these scripts make it easier to gather good descriptive metadata bound for
bag-info.txt files, using approachable tools like spreadsheets.
This script creates many bags at once, using a
bag-info template and a CSV or XLSX spreadsheet to substitute bag-level text as needed. It is useful when creating many bags for a single collection, when each
bag-info.txt file in the batch differs from the others in only slight ways.
I wrote this script in December 2021, when I was faced with the need to bag approximately 80 folders of linguistic material from the Archive of the Indigenous Languages of Latin America (AILLA). Each folder I needed to bag contained files corresponding to a different language, donated by a different researcher (or group of researchers), and each might contain one or several different filetypes. Aside from these differences, the bags could be described more or less identically.
This script allowed me to create a
bag-info.txt template for the full set of files, and to gather folder-level descriptions of each bag’s language, donor, and filetypes in a spreadsheet that could be easily filled out by AILLA Language Data Curator Ryan Sullivant. Ryan was proficient in bagging, but I didn’t want him to have to create 80 different
The script works by searching the bag-info template file for keywords enclosed in double square brackets, for example “
External-Description: Linguistic research data provided by [[donor]].” Any such keywords that correspond to a column header in the user-supplied spreadsheet (CSV or XLSX) allow the placeholder keywords to be substituted out for the corresponding spreadsheet values for each bag. The script converts the contents of the
bag-info template file (with proper keyword substitutions made) into a dictionary as required by
bagit-python. The script then writes the bag in place and moves on to the next bag listed in the spreadsheet. It also writes a report spreadsheet listing the UUIDs assigned as an
External-Identifier to each bag, to be used in UTL’s records.
This script will be especially useful going forward when I need to request descriptive information from collections managers about the folders of materials they send me for preservation. Many staff are not comfortable writing bags on the command line themselves, but are comfortable working in spreadsheets. Collaborating with other staff on a bag-info template will also demystify the bagging process somewhat, by showing them what information we collect for each bag for long term preservation.
This is a simple script that aims to recreate the core command-line functionality of
bagit-java. After I wrote
batch_bagger.py, I presented it to a colleague who told me that the need to provide a spreadsheet limited its applicability. What my colleague needed was a simple workflow that would allow her to write her own
bag-info.txt file and provide it at the point of bagging, which is how she had used
pybagger loads the contents of a user-provided
bag-info.txt file into a Python dictionary, which is then used for the resulting bag-info output in the course of bagging the user-supplied directory. The command syntax for running
pybagger is very similar to
bagit-java, and users can run it as part of a
for loop to bag a list of directories in one sitting, providing a
bag-info.txt file for each one. Unlike bagit-java, the user does not need to specify the
tagmanifest algorithms: the UTL-standard
sha256 algorithm will be used.
batch_bagger have an “
unpack” option that will move the contents of a bag’s data directory out of that folder before deleting it alongside the
tagmanifest.txt files. The unpack option effectively undoes the bagging process in a given directory, and is useful when a bag creator notices an error in a
bag-info.txt file that will require the bags to be recreated.