The Limits of Efficiency: Daily Digital Archival Practice at LLILAS Benson

I gave the following presentation on May 27th, 2021, at the Latin American Studies Association’s 2021 virtual congress, “Crisis Global: Desigualdades y Centralidad de la Vida”. This presentation was part of a panel entitled “Historia digital: trabajando con archivos nacidos digitales”, alongside Nicolás F. Quiroga (CONICET/Universidad Nacional de Mar del Plata), Elvia Arroyo-Ramirez (UC Irvine), and Denise Frigo (Universidade Federal de Santa Maria). The panel was organized by Nicolás F. Quiroga, and Eden Medina (MIT) was the discussant.

Introduction and post-custodialism

My name is David Bliss, and I am the Systems and Digital Archivist at the University of Texas Libraries. From July 2017 until to April of this year, I was the Digital Processing Archivist at the Benson Latin American Collection at UT. Today I’ll be speaking primarily about my work in my previous role, although many of my reflections about digital archival labor also apply to my new position and, I suspect, to other digital archivists as well.

At the Benson, my work was primarily focused on developing and supporting post-custodial archival projects in Latin America. Post-custodialism refers to an archival methodology that aims to preserve and provide access to archival collections without physically relocating them from their contexts of creation. The term was coined in 1981 by archivist F. Gerald Ham, but its contemporary use by archival practitioners follows its usage by archival scholar Jeannette Bastian, who wrote about it extensively beginning in the late 1990s.[1]

Slide with title "Post-custodial archival theory".

Preserving and publishing records without taking physical custody

Digitization and online access

F. Gerald Ham, "Archival Strategies for the Post-Custodial Era" (1981)

Jeannette Bastian, "A Question of Custody: The Historical Records of the U.S. Virgin Islands" (1999)

Photo of digitization equipment and computer being used to scan a colonial document at the Archivo Judicial del Estado de Puebla

Post-custodial theory proposes that digitization tools and internet access hold the potential to fundamentally transform the relationship between those who create or hold historical records and archivists who seek to preserve these records and make them available. Traditional archival collecting models regard archivists as custodians of social memory, uniquely equipped to conserve and provide access to historical materials for future generations by taking direct control of documents in state, university, or private repositories.

For institutions collecting internationally, this has often meant relocating archival collections far away from their original contexts, where those records may have the most profound and immediate social value. In the aggregate, this has resulted in the extraction of large swaths of documentary cultural heritage from the Global South and its accumulation at well-financed institutions in North America and Europe. This pattern mirrors colonial archaeological extraction, not to mention the ongoing extraction of natural resources by economic forces concentrated in these same regions.

The post-custodial model is an attempt to intervene and disrupt this dynamic. The post-custodial team at the Benson partners with organizations in Latin America to digitize and describe their records on-site. We devise digitization and metadata workflows that will meet the structure and needs of their collections, then deliver equipment and train our partners in its use before returning to Austin, while our partners undertake the work locally. Once a collection has been digitized, a copy is sent to the Benson, where it undergoes image and metadata processing, digital preservation, and finally publication on our Latin American Digital Initiatives online repository.

Diagram with title "Post-custodial projects at LLILAS Benson", showing typical post-custodial project elements, broken down by chronologically and divided between those pieces contributed by the project partner, those contributed by LLILAS Benson, and the points of collaboration. 

Pre-project partner elements are Physical custody of records. Pre-project LLILAS Benson elements are Digital Infrastructure.

Project planning collaborative elements are Selection of Materials and Metadata templates. Project planning LLILAS Benson elements are equipment selection and workflow design.

Project launch partner elements are Digitization and description; Editing and derivatives; QC; and fixity check.

Processing and publication partner elements are Additional derivatives and publication and reuse.

Processing and publication LLILAS Benson elements are ingest, fixity check, QC, processing, preservation, and publication and reuse.

Broadly speaking, we’re leveraging the digital infrastructure and expertise of the University of Texas together with our partners’ rich contextual knowledge to provide broader access to their collections. Our partners retain physical and intellectual control over the collections at all times, they are free to reuse and publish the digital collections as they see fit, and they keep the digitization equipment used for each project for their own purposes following the conclusion of the project. Our hope is that these collaborations will help to develop our partners’ archival capacity, and that wider use of their collections by researchers and the public will bolster their work more broadly.

Post-custodial archival projects at the Benson date back more than a decade and encompass a variety of initiatives of different scopes and durations. Today I’d like to elaborate and reflect on my work for two such projects in order to shed light on how digital collections are constructed and what daily archival labor looks like. My hope is that this reflection will help researchers understand some of the practical limitations and implications of working with digital collections, work which is often even less publicly visible than traditional archival practice.

How digital collections are constructed

In August 2018, I traveled to Puebla, Mexico to visit the Archivo Judicial del Estado de Puebla, which houses the Fondo Real de Cholula. The Fondo Real is a large collection of judicial records from nearby Cholula, spanning from 1571 to the early 19th century. During the colonial period, Cholula was designated a Ciudad de Indios by the Spanish crown, granting its residents access to special legal structures and a degree of local political autonomy. The judicial proceedings recorded in the Fondo Real show how Cholultecas navigated the contemporary world through contracts, lawsuits, criminal complaints, and wills. It is believed to be the only such collection from New Spain that survived the Mexican Revolution, and it may be the most complete one in all of Latin America.

Slide with title "Fondo Real de Cholula at the Archivo Judicial del Estado de Puebla". Photo of several hundred archival boxes on metal shelves in an archive. Photo of man and woman examining a document on a table. Screenshot of a list of questions about the archive and collection.

The purpose of my initial visit was to assess the collection’s intellectual organization, the structure of its component documents, and the physical state of the building in which it was housed. These questions would shape our equipment selection and the metadata templates used to describe the collection. While our goal was to create a complete digital surrogate of the Fondo Real for online access, digitization itself is a highly material process, and material considerations determine whether a project succeeds or fails.

Over the six months that followed, we purchased equipment, devised digitization workflows and metadata templates, and documented all the steps involved in creating the digital collection. The digitization workflow we devised emphasizes speed and efficiency, capturing two facing pages of an expediente at once, then cropping and color correcting several hundred images at a time.

Slide with title "Fondo Real de Cholula: Project Planning". Photo of camera and tripod mounted on a table. Screenshot of digitization workflow guide. Diagram showing digitization process, with recto and verso pages photographed and named together, then edited, cropped, and exported & renamed separately, then reintegrated

This work, along with the metadata gathering, would ultimately be done by a team of historians in Puebla, who have access to the collection and are well equipped to read and understand the documents. Although we tend to defer to our partners’ preferred terminology for describing the documents, in setting a metadata template, we exert a great deal of influence over what types of information our partners can and can’t gather.

Designing a metadata template for this collection meant deciding which fields and values would be required — that is, which information the team would need to record about each and every expediente — and which would be optional. Optional fields can be thought of as nice to have and potentially useful to researchers, but may not necessarily be present in each expediente or a top-priority should the team come up on the time constraints imposed by our grant.

Slide with title "Fondo Real de Cholula: Metadata template". 

List of required fields:
Número de volumen
Cantidad de fojas
Tipo de caso/documento
Nombre del escribano
Identificador digital

List of optional fields:
Nombre del solicitante
Nombre del acusado
Ubicación del solicitante y/o acusado
Nombre del juez/autoridad
Calidad del juez/autoridad
Calidad del escribano
Nombre de testigo(s)
Otros nombres/entidades

Because there is no limit to the types of metadata fields we could conceivably ask the team to collect, setting a template provides our best opportunity to creatively shape the digital collection from the outset. However, the constraints our own team faced limited the scope of what we could creatively achieve. The expectation that we would digitize and describe as large a portion of the collection as possible on a limited grant timeline pushed us toward a relatively small number of straightforward required fields for the template which the team could record relatively quickly for each expediente and move on.

Similarly, when we received the digital collection at the Benson and processed it, our focus was on breaking each folder up into discrete expedientes and refining the metadata to ensure its compatibility with our online repository. We did not drastically alter scans, identify highlights, or add to the metadata our partners sent us, in part because we could not spare the time needed to do so.

Slide with URL

Screenshot of scanned document.

Screenshot of metadata display, showing a large number of metadata fields and values in Spanish, describing the document.

The result, which you can see today, is a digital collection that largely replicates the colonial ordering logic of the original documents. The more extensive fields we might have asked the team to gather could have made so-called “against the grain” readings of the collection easier, for instance by tracking the identities of minor actors named throughout the collection. What information we did gather will still allow researchers to recover the voices of everyday Cholultecas, and I want to stress that this was by all other metrics a very successful project. The team in Mexico digitized and described approximately 37 thousand pages in just 9 months, and our Metadata Librarian Itza Carbajal did a tremendous job of taking my initial metadata template and turning it into something realistic and workable, not to mention refining and processing the metadata for ingest into our repository. Of all the projects I’ve been a part of at the Benson, this is the one I’m most proud of, but it’s no coincidence that in the face of project constraints, the default metadata approach we were forced to fall back on largely defers to the notarios themselves, and primarily records the same elite voices who display the most agency in the documents.

It’s much easier to digitally recreate the archival grain than to digitally disrupt it.

This, I think, is key for scholars attempting to understand archival labor: We can be proficient in the subject matter of a collection and conversant in modern research methods used by scholars, but our efforts to make collections available in new and dynamic ways are often limited by project timelines and our responsibility, traditionally understood, to faithfully preserve collections as they come to us. Unfortunately, it’s much easier to digitally recreate the archival grain than to digitally disrupt it.

The labor of digital preservation

I’d like to pivot now to talking about a different collection and the labor involved in digital preservation.

In 2005, investigators uncovered the archives of the Policía Nacional de Guatemala in an abandoned warehouse in Guatemala City. The records in the warehouse, totaling approximately 80 million documents, cover the full span of the PN, from the mid-19th century to its disbanding in 1996 as part of the country’s peace accords. Most notably, the records contain direct evidence of the PN’s role in abductions, extrajudicial killings, and forced disappearances during the Guatemalan Civil War, atrocities which had previously been primarily attributed to the military. The discovery led to the formation of the Archivo Histórico de la Policía Nacional, under the direction of the Human Rights Ombudsman’s Office, in order to preserve and process the records in the warehouse.

Slide with title "Archivo Histórico de la Policía Nacional de Guatemala (AHPN)"

Image of documents stacked high in a dark room, labeled "Secuestros 88"

Image of two digitization technicians in lab coats with facemasks and hair coverings, scanning a document

Recognizing the significance and precarity of the records, the AHPN undertook a large-scale digitization and metadata project with the goal of reconstructing the collection digitally, making it searchable using a massive database of metadata gathered at the point of scanning. The team also partnered with the Benson to move a copy of the digital collection to Austin, in order to build an online access portal as well as provide a safeguard against tampering by government or police officials implicated by the records. Like the Fondo Real de Cholula digital archive, and like other human rights archives such as the Khmer Rouge Archives at the Documentation Center of Cambodia, the AHPN digital archive faithfully reconstructs much of the oppressive ordering logic of the original records in order to support subversive uses of the materials.[2]

My work with the AHPN digital archive began in spring 2018. At that time, we had a public copy of the digital archive available online and secure copies of the collection on hard drives, but we had never properly undertaken the work needed to preserve it long term. Given the sensitive nature of the collection and ongoing interference in the AHPN’s mission on the part of the Guatemalan government, the preservation of our digital copy was identified as an urgent priority.

Slide with title "Archivo Digital del AHPN"

Screenshot of AHPN digital archive website, showing a scanned document

As of spring 2018:
20 million scans
7 terabytes
Available online ( but not properly preserved

Digital preservation at UTL:
Extracting technical metadata using FITS
BagIt packaging
Storage on LTO tape

Digital preservation at UT Libraries involves extracting technical metadata for every file in a collection using a tool called the File Information Toolset, packaging collections according to the Library of Congress’ BagIt standard, and writing copies in duplicate to LTO tape. These processes protect the files from catastrophic data loss and ordinary bit rot, and ensure that any files retrieved from tape can be understood and read many years in the future. As of 2018, the digital collection stood at approximately 20 million images totaling more than 7 terabytes, and proper preservation of that copy of the collection involved approximately 6 months of continuous ingest and around-the-clock processing on a powerful dedicated workstation.

The structure of the collection complicates the preservation process even further. Because the original physical collection in Guatemala is so large, the digitization process cannot be a simple matter of scanning a record, giving it a recognizable name, and placing the file within the proper folder. Instead, when a document is scanned, it is assigned a random unique filename and placed in a random subfolder within one of 53 top-level directories. Metadata, including a document’s title, date, and origin location, are all collected in a database that links this information to the scanned image. The database connects images to one another and effectively recreates the PN’s original complex hierarchical recordkeeping structure. The database grants us some search functionality and protects the collection against tampering by significantly complicating the process of removing incriminating scans or inserting exonerating digital forgeries. The database structure also means that users looking for specific records may have to familiarize themselves with the organizational structure and recordkeeping practices of the PN. For many users these are the very same institutions that have hurt or killed members of their family.

In fall 2018, as initial preservation work on the digital collection was nearing an end, we received three hard drives from the AHPN, containing an updated copy of the collection, with approximately 2 million new scans and database entries created over the course of one year. The scanning process does not separate scans into discrete batches of files, but instead produces a continuously growing cohesive body of files and accompanying database. As we were not yet finished writing the previous year’s copy to tape, we recognized the unsustainability of processing and preserving successively larger copies of the collection year over year. I set to work developing a method of safely disaggregating two versions of the collection, in order to ingest only those files which had not previously been preserved.

Slide with title AHPN disaggregation

Screenshot of text file listing a large number of hash values and filepaths

Large diagram showing AHPN disaggregation process

This is accomplished by calculating and comparing a list of filepaths and hash values for both copies of the collection, then producing an actionable list of files using OpenRefine. This cuts the ingest size and processing time for successive copies of the collection by approximately 90 percent. The two disaggregated copies of the collection can then be merged locally to recreate the complete up to date copy.

All of this work is highly involved and time consuming, but it provides me very little opportunity to engage with the content of the collection. The work prioritizes collection integrity checks, redundancy measures, and extracting underlying technical metadata for the images, all of which is done entirely through semi-automated batch processing. Traditional physical conservation involves some degree of direct interface with an archival collection, which grants archivists an opportunity to familiarize themselves with the documents, familiarity which is useful in reference work and collaborating directly with researchers. With digital collections, meanwhile, preservation work can be done without even glancing at the content of a collection.

This significantly widens the gap between archival work and research, and limits our ability to promote our collections or collaborate with subject matter experts. Put simply, I can often speak at length about a digital collection’s structure or my work in processing and preserving it, but even when I am the archivist working most closely with a collection I am rarely equipped to provide reference support.

I can often speak at length about a digital collection’s structure or my work in processing and preserving it, but even when I am the archivist working most closely with a collection I am rarely equipped to provide reference support. […] In the case of the AHPN, the cold violence and bewildering scale of the collection sharpened the insensitivity of the dry, distant work needed to preserve it.

The number of simultaneous projects we take on exacerbates this problem: while we were processing and preserving the AHPN digital collection, our attention was otherwise focused on planning for the Fondo Real de Cholula project and shrinking our digital processing backlog. Many archivists – digital and otherwise – are forced to limit their engagement with collections as a result of being pulled in several directions at once, however in digital contexts the effects are exacerbated by years or even decades of collections backlogs and distant processing tools. In the case of the AHPN, the cold violence and bewildering scale of the collection sharpened the insensitivity of the dry, distant work needed to preserve it.


Undergirding these issues in digital archival labor is a fundamental tension between the expensive digital infrastructure required to support digital collections and the neoliberal austerity which progressively reduces cultural heritage staff lines, even as collection volume increases. Career advancement in digital archives frequently privileges technical skill, scripting, and bulk processing over contextual knowledge, not necessarily because contextual knowledge is undervalued, but because distant processing is often the only way for an archivist to successfully meet grant and institutional deadlines. A 2020 survey of digital stewardship practitioners in the US found that digital archivists are generally confident in their ability to undertake deep, comprehensive work with collections, but lack the staff and institutional latitude to do so.[3] I would like to suggest that the pressure to work at scale at the expense of engagement with our collections similarly inhibits the quality of our work by limiting the kinds of tasks we can perform.

Slide with title "Conclusions"

Work at scale is often necessary to meet deadlines, but comes at the expense of engagement with collections

Scholarly collaboration and promotion are important elements of archival work

Without the ability to engage with collections, we cannot provide reference support or build exhibitions

Without the freedom to set time aside each week to simply read and appreciate our collections, digital archival labor all too often reduces rich and complex collections to simple data points, and reduces archivists themselves to digital functionaries focused entirely on maintaining digital infrastructure and compliance with preservation standards. Precisely because processing, manipulating, and providing access to large digital collections requires a special set of skills, there is fertile ground for digital archivists and researchers to collaborate closely, but the material circumstances surrounding our work currently drive us apart and deny us the knowledge needed to work with scholars. I would like digital archivists to have the freedom to take on fewer or smaller projects in order to work more slowly, more creatively, and more collaboratively.

