U-DIADS-Bib (Uniud - Document Image Analysis DataSet - Bible version) is a image dataset for pixel-precise document layout analysis. Layout analysis is a critical aspect of Document Image Analysis, particularly when it comes to ancient manuscripts. It serves as a foundational step in streamlining subsequent tasks such as optical character recognition and automated transcription.
Description
U-DIADS-Bib is a proprietary dataset developed through the collaboration of computer scientists and humanities at the University of Udine. It is composed of 200 images, 50 for each of the 4 different manuscripts that characterize it. These handwritten books were selected in collaboration with humanist partners considering both the complexity of their layout and the presence of significant and semantically distinguishable elements. In particular, the images of the four manuscripts were collected from the digital library Gallica. All manuscripts are Latin and Syriac Bibles published between the 6th and 12th centuries A.D.:
- Paris, Bibliothèque nationale de France, the Second Bible of Charles the Bald, Latin 2
- Paris, Bibliothèque nationale de France, the Book of Ezra to the Book of Revelation, Latin 14396
- Paris, Bibliothèque nationale de France, the New Testament, Latin 16746
- Paris, Bibliothèque nationale de France, the Old Testament in the Syriac Peshitta version, Syriaque 341
U-DIADS-Bib consists of 50 unique color page images for each manuscripts, saved in JPEG format with a resolution of 1344×2016 pixels. Each page is paired with its respective Ground Truth (GT) data, stored in a PNG image of identical dimensions to the original.
The Ground Truths encompass six distinct, non-overlapping annotated classes, including background, paratext, decoration, text, title, and chapter headings, encoded by RGB value as follows:
- Background, Black
- Paratext, Yellow
- Decoration, Cyan
- Main Text, Magenta
- Title, Red
- Chapter Headings, Lime
The Syriaque 341 manuscript contains five semantic classes and the missing one is the Chapter Headings.
For each of the manuscripts included U-DIADS-Bib, 10 images have been selected for the training set, 10 for the validation set and the remaining 30 for the test.
Moreover, as we believe that real-world use of layout analysis techniques will be broadly adopted only if the amount of required manually-labeled training data is reduced, we also provided a standardized few-shot version of our dataset called U-DIADS-BibFS, which we hope will encourage further efforts towards this goal.
Following are illustrated sample pages from the four manuscripts:
Latin 2 |
Latin 143962 |
Latin 16746 |
Syriaque 341 |
And their corresponding Ground Truths visualization of the six distinct annotation categories:
Latin 2 |
Latin 143962 |
Latin 16746 |
Syriaque 341 |
A full description of the dataset is given in Zottin, S., De Nardin, A., Colombi, E. et al. U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscripts. Neural Comput & Applications (2024). https://doi.org/10.1007/s00521-023-09356-5, arXiv.
Citation
For any scientific publication using this data, the following paper should be cited:
@Article{Zottin2024, author={Zottin, Silvia and De Nardin, Axel and Colombi, Emanuela and Piciarelli, Claudio and Pavan, Filippo and Foresti, Gian Luca}, title={U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscripts}, journal={Neural Computing and Applications}, year={2024}, month={Jan}, day={16}, issn={1433-3058}, doi={10.1007/s00521-023-09356-5}, url={https://doi.org/10.1007/s00521-023-09356-5} }