Hijuelas digitization workflow

Developing a low-cost, fast book digitization workflow for use in Michoacán, Mexico.

In November 2016, I traveled to Morelia, Michoacán, Mexico, to lead a digitization workshop for a team of local historians. Over the following two years, the team would be digitizing a set of 192 land deed books called libros de hijuelas, roughly 150,000 pages in all. The books document the privatization of indigenous land, called the reparto de tierras which took place in the late 19th and early 20th centuries. Though these repartos took place across Mexico at this time, the process in Michoacán is important because of the state’s large indigenous population, and the libros de hijuelas in question are important because they are a more complete set than can be found anywhere else in Mexico.

15111121_1730234937297958_3456562660067512943_o
Hijuelas condition check with full collection in the background

Project background

The project is funded by a grant from the British Library’s Endangered Archives Programme, which supports efforts to rescue or preserve vulnerable archival materials. The hijuelas books, which are housed at the Archivo General e Historico del Poder Executivo de Michoacán (AGHPEM — the state archive of Michoacán), are in various states of decay owing in part to a lack of resources for conservation and adequate housing.

The loss of these books to decay is especially troubling since indigenous activists dispute Mexican government claims over land to this day: in April 2017 Mexican police killed 4 indigenous activists (warning: graphic photos) in Arantepacua. Preserving these documents and making them accessible will not only help researchers who may find it difficult to research in Morelia, it will also help vulnerable communities in Michoacán assert and defend their rights against the state and large private interests.

The grant was secured by Dr. Matthew Butler, Associate Professor of History at UT-Austin. Dr. Butler studies Mexican religion in the post-revolutionary period, and his research into the Cristero Rebellion has taken him to Morelia many times. The Endangered Archives Programme is an excellent fit for the Benson Latin American Collection, where post-custodial partnerships have been supported since 2009.

As with other post-custodial projects, the hijuelas digitization will proceed locally, with the Benson providing digitization equipment and assistance as needed. The Benson (and the British Library) will receive digital copies of the materials in exchange, and the libros de hijuelas will be preserved and made accessible without being removed from Michoacán, where they belong.


Designing the workflow: Needs of the project, equipment, and software

I was given from August to November 2016 to develop a digitization workflow that would allow the books to be digitized in under two years, meeting the standards of the British Library grant. My supervisor, Theresa Polk, had settled on an equipment setup using a Canon EOS 6D DSLR and Beseler copy stand, but just how the digitization would proceed using this equipment was unknown with just a few months to go before the work was set to begin.

Over the course of the fall, I developed and refined a digitization workflow using Adobe Lightroom. Past post-custodial collections we have received at the Benson used flatbed scanners, which produce high-resolution images but take a long time. To produce 150,000 scans over just two years would be frankly impossible using a flatbed scanner. In addition, from my own experience running quality control on these collections, I knew firsthand that the manual file naming and management which flatbed scans require introduce frequent errors which require lengthy (and frankly dreadful) close review.

Thankfully for us, Lightroom is an excellent tool for producing scans quickly and automatically naming files. I developed a workflow using the tethered capture feature and the built in (customizeable) naming templates. The features allow the digitizer to review image quality as soon as a photo is taken, and automatically apply the correct naming convention and file structure to all files in a sequence.

This was quite literally the only way the project could ever be completed on time, because it allows the digitization team to focus on taking photos and turning pages without bogging them down with the need to type in file names. It also (very selfishly) greatly reduces the amount of time I need to spend reviewing the materials for filename typos or misnaming once they are delivered.

IMG_20161028_121254
Professor Butler and Theresa testing the workflow at the Benson

I drafted two instructional documents, both available here. The first is focused on the equipment and basic principles of photography, while the other details the daily step-by-step digitization workflow we would be asking the team to follow. To make sure the these documents were clearly written and practical, in the weeks leading up to the November workshop, Professor Butler and Theresa tested it themselves. Theresa had an idea of what the process would look like, but Dr. Butler had no idea what to expect. They worked as a team of two, just like the team in Mexico would, and left confident that the team would be able to understand and implement it well.


The workflow itself

The workflow splits books into six sections (photographed as individual “tethered sessions” in Lightroom), which are captured individually and merged later. The six sections are:

  • Cover (portada) recto and verso;
  • Back cover (contraportada) recto and verso;
  • Main text recto and verso.
Hijuelas parts.png
The six sections of each book which are photographed. These sections are captured as CR2 (raw) files, then exported to TIFF

Photographing the pages front-to-back in one sequence might seem more straightforward, but this would require the team to re-orient the book for each photo and take much more time. Photographing the same “kind” of page all at once allows the team to very quickly turn pages and take photos. The reliable auto-focus on the camera ensures all photos are in focus, and it’s much easier to turn pages quickly when the book is held in place and not moved constantly. In my own experiments with the setup, I was able to digitize upwards of 200 pages per hour.

Once all six sections of a book are photographed, basic edits like cropping, rotation, and color correction are applied in batch and the files are exported to a single TIFF folder. Because the filename structure ensures the verso image of a page has the same number as its corresponding recto image, exporting to a single folder results in an ordered directory which can be flipped through just like reading a book.

tiff
Files from all six “shoots” are exported to a single TIFF folder, and the pages are automatically interleaved in the correct order

The photos are captured as CR2 (raw) files and exported as 8 bit, 300 dpi TIFFs per the British Library’s requirements. We preserve CR2 files as archival backups, and when the images are ready to be ingested into an online platform, we will be able to re-export to TIFFs or JPEGs at a higher dpi or bit depth if we want.


Future of the project and lessons learned

My biggest concern with the project was that the workflow would be too technically complex, and that the hijuelas team would struggle to implement it. Many of the decisions I made are based on principles I learned in digital archiving coursework. Although the team are accomplished historians, we did not expect they would necessarily be as comfortable as I am managing files and understanding each step’s role in producing a clean, organized directory of images.

IMG_7979
Teaching the team to use Adobe Lightroom. I was worried about my Spanish but the workshop was a huge success

Thankfully, the team has so far shown an excellent understanding of the workflow. In the first few months of work, the team digitized roughly 30 of the 192 books and delivered the images on a hard drive. At this pace, the team should be able to complete the digitization of the books well before the two-year deadline, freeing them up to focus on describing the materials. Most impressively, the team has demonstrated their ability to self-correct any issues encountered during digitization, which tells me they fully grasp it and are attentively looking at image quality at the point of digitization. This makes QC on our end much, much easier.

Although the team has moved quickly, there have been some hiccups in the process forcing us to intervene. While the autofocus of the DSLR lens reliably captures pages with clear writing, blank and faded pages are difficult to photograph. To address this, the team initially placed a snake weight on the center of the page for the lens to focus on, but this results in pages (albeit blank ones, usually) which are awkwardly covered by what looks like a shoestring. To tackle the problem head-on, we tested and delivered a new workflow using remote shutters with shutter lock functionality. These allow the team to place a token object on the page, focus the camera, and take the photo after removing the object. Pages are thus captured in focus without anything blocking them.

apa_02_002_v.png
Snake weight covering a blank page. The weight looks awkward and casts a small shadow

Other issues the team has encountered are outside anyone’s control and could not have been anticipated. In late March, the team reported that a large construction project in Morelia was sending a great deal of dust in the air throughout the city center.

IMG-20170327-WA0001
Dust settling from construction in Morelia

The team reported that they were forced to wipe off the photography surface every few minutes, significantly slowing down the photography process and raising preservation concerns. We knew the color check cards would need to be replaced regularly, but this dust will force us to replace them much quicker and more often than we expected.

I will need to adapt this workflow in the future to meet the needs of the material to be digitized. The process of photographing recto and verso as separate sections and merging them later works great for books, but it might not be ideal for loose documents, which can be thrown out of order easily, or which might not logically follow any numbering sequence.

I have also learned the limitations of the equipment we used for the project: The DSLR camera we used only has a 20.2 megapixel sensor, a serious limitation for photographing large objects. The stand itself is also inadequate for photographing from the height needed to photograph large objects. Some of the larger books barely fit on the base and may need to be photographed specially, taking multiple photos of each page and stitching these together in Photoshop. If possible, future projects will use a camera with a higher megapixel sensor (ideally 50) to capture finer detail.

Thankfully, I can come into any future digitization projects with a much clearer understanding of digital photography and the functions of Adobe Lightroom. This will allow me to adapt the workflow quickly without needing to learn these (crucial) skills for the first time, as I did here. I can also approach future workshops more confident in my Spanish and more aware of the challenges and opportunities presented by post-custodial archiving.