Focus On: In Codice Ratio

Focus-on is our exclusive series of articles describing cutting-edge projects exploiting machine learning in Italy and around the world. Our initial installment is devoted to In Codice Ratio, an interdisciplinary research aiming at the analysis and knowledge discovery of historical documents taken from the collections of the Vatican Secret Archives. Our guest writer is Donatella Firmani, assistant professor at Roma Tre and one of the main contributors to the project, together with Paolo Merialdo (Professor) and Elena Nieddu (P.h.D. Student).

The In Codice Ratio (ICR) Project

Historical handwritten documents are an essential source of knowledge concerning past cultures and societies. Automatic text processing methods promise to empower scholars with a quantitative and data-driven tool to study culture and society, but their power has been limited by the amount of digitally transcribed sources. Due to the many challenges involved in a fully automatic handwriting transcription (such as irregularities in writing, ligatures and abbreviations), many researchers in the last years have focused on solving easier problems, most notably keyword spotting. However, as more and more libraries worldwide digitize their collections, greater effort is being put into the creation of full-fledged transcription systems.

In Codice Ratio is an interdisciplinary project for the automatic transcription of the Vatican Registers, a corpus of more than 18.000 pages contained as part of the Vatican Secret Archives. Our workflow consists of a character recognition phase, featuring a deep convolutional neural network, and a proper transcription phase, relying on statistical language models. The Vatican Registers corpus consists of more than 18.000 pages of official correspondence of the Roman Curia in the 13th century, including letters, opinions on legal questions, addressed from and to kings and sovereigns, as well as to many political and religious institutions throughout Europe. Never having been transcribed in the past, these documents are of unprecedented historical relevance.

The main contribution of the In Codice Ratio project so far is an end-to-end transcription pipeline based on fine-grained segmentation of text elements into characters and symbols. Our pipeline first partitions sentences and words into text segments. Most segments contain actual characters, but there are also segments with spurious ink strokes. (Perfect segmentation cannot be achieved without transcription. This result is known as Sayre’s Paradox.) Then, the pipeline submits all the segments to a deep convolutional neural network (CNN) for optical character recognition (OCR), and reassemble such noisy labels into words and sentences using language statistics.

Samples of the crowdsourcing platform
Sample screen-shots of our crowdsourcing platform. On the right figure, segments forming target character ‘a’ have been selected by a student.

Our OCR network has a total of 23 classes (including minuscule characters of the Latin alphabet) and is designed following recent progresses in deep learning, especially recent neural networks models for character-level classification. Its most notable feature is a special "non-character" class, handling spurious stroke combinations from the segmentation step. Other features include 56 x 56 single-channel images input and 8 adaptable layers: 3 convolutional layers, each applying 2 x 2 stride 2 max-pooling, and 2 feed-forward layers.

We trained the network by using a custom crowdsourcing procedure. Specifically, we implemented a dedicated crowdsourcing platform and employed more than a hundred high-school students to manually label the dataset. Each student was required to select, like in a jigsaw-puzzle, all the pieces in a word image to visually match a given character symbol, with the least possible amount of extra-strokes. Above we show a screenshot of a sample labeling task. To overcome the complexity of reading ancient fonts, we provided positive examples of each symbol (in green) and students were told to leverage visual patterns, rather than trying to read. After a data augmentation process, the result is a high-quality dataset of 23.000 characters, which is publicly available online.

Our deep CNN trained on this dataset achieves an overall accuracy of 96%, which is one of the highest results reported in the literature so far. We observed that while humans can easily distinguish character symbols from strokes combination that casually resemble writing patterns, this turns out to be a hard task for an automatic classifier. Respective to the non-character class, indeed, our classifier achieves 95% precision but only 74% recall. To this end, our end-to-end pipeline leverage language statistics to tolerate a certain amount of "false characters" from the OCR step. Our end-to-end system was able to produce good transcription for almost 80% of the examined words, providing paleographers a solid basis to speedup the transcription process at a large scale.



If you liked our article, remember that subscribing to the Italian Association for Machine Learning is free! You can follow us daily on Facebook, LinkedIn, and Twitter.

Previous Post Next Post