Austin, Texas – LLILAS Benson Latin American Studies and Collections hosted a day-long digital scholarship symposium on May 30, 2017, titled “Reading the First Books: Colonial Documents in the Digital Age.” Hashtag: #firstbooksdh
The symposium celebrates the culmination of “Reading the First Books: Multilingual, Early-Modern OCR for Primeros Libros,” a two-year effort to develop tools for the automatic transcription of early modern, multilingual printed books that involved a collaboration between students, faculty, and staff at The University of Texas at Austin and Texas A&M University. The project was funded by the National Endowment for the Humanities (NEH) Office of Digital Humanities.
The May 30 symposium will bring together invited scholars, librarians, software developers, and students for a day-long conversation on the themes of digital scholarship, colonial and early modern history, and Latin American studies. Two keynote speakers will address the symposium: Brook Danielle Lillehaugen, Department of Linguistics, Haverford College, and Taylor Berg-Kirkpatrick, Language Technologies Institute, Carnegie Mellon University.
“Reading the First Books” has extended Ocular, a tool for the automatic transcription of early modern printed books, to work across multiple languages and orthographies and to automatically generate both diplomatic and normalized transcriptions. It has applied Ocular to the Primeros Libros de las Américas, a multilingual collection of books printed in the Americas prior to 1601, to produce a newly corpus of machine-readable text in Huastec, Latin, Mixtec, Nahuatl, Otomi, Spanish, Tarascan (Purépecha), and Zapotec. The University of Texas Libraries and the Benson Latin American Collection are partners in the multinational Primeros Libros project along with over a dozen other institutions.
In this Year of Open, are the publications available online with open access?
Are videos or presentations from the symposium available online?
While this project’s funding concludes this year, are there follow-up efforts and work?
About the project
Reading the First Books: Multilingual, Early-Modern OCR for Primeros Libros is a two-year, multi-university effort to develop tools for the automatic transcription of early modern printed books. It is a collaboration between students, faculty, and staff at the University of Texas at Austin and Texas A&M University.
The Reading the First Books project will:
- Develop tools for the automatic transcription of books printed in multiple languages, using variable orthographies, during the first centuries of the printing press.
- Make those tools accessible for institutions and individuals by incorporating them into the Early Modern OCR Project (eMOP) at Texas A&M University, an open-source OCR workflow.
- Produce automatic transcriptions of the Primeros Libros de las Américas collection of books printed before 1601 in the Americas, written in Spanish, Latin, Nahuatl, Huastec, Mixtec, Otomi, Tarascan, and Zapotec.
This website provides information about the project along with news, updates, and reports about our progress.
Reading the First Books is funded by a National Endowment for the Humanities Digital Implementation Grant. Any views, findings, conclusions, or recommendations expressed in this web site do not necessarily represent those of the National Endowment for the Humanities.
Grant number: HK-230965-15
University of Texas, Austin (Austin, TX 78712-0100)
Sergio Romero (Project Director: 02/18/2015 to present)
Laura Mandell (Co Project Director: 07/16/2015 to present)
Reading the First Books: Multilingual, Early-Modern OCR for Primeros Libros
Enhancement of optical character recognition (OCR) technologies to improve researchers’ ability to discover and search early modern, multilingual printed texts. During this phase, the project team would focus on books printed in the Americas before 1601.
Digital facsimile collections of early modern printed books (books printed on hand presses in the 15th-17th century) greatly improve access to these cultural heritage materials for scholars, students, and the general public. The utility and accessibility of these digital collections, however, has been limited by the challenges of transcribing early modern printed books: their linguistic complexity, unstable orthography (spelling and punctuation), and uneven typesetting and inking make these books difficult to read for humans and machines alike. The goal of this project is to develop and implement groundbreaking methods in the automatic transcription of early modern printed books. This will increase access to books that are not just a vital record of historical thought during this exciting period in European, colonial, and indigenous American history, but also reflect the development of a new, transformative technology – the printing press.
Project fields: International Studies; Latin American History; Latin American Languages
Program: Digital Humanities Implementation Grants
Ocular is a state-of-the-art historical OCR system.
Its primary features are:
- Unsupervised learning of unknown fonts: requires only document images and a corpus of text.
- Ability to handle noisy documents: inconsistent inking, spacing, vertical alignment, etc.
- Support for multilingual documents, including those that have considerable word-level code-switching.
- Unsupervised learning of orthographic variation patterns including archaic spellings and printer shorthand.
- Simultaneous, joint transcription into both diplomatic (literal) and normalized forms.