Home Data

Description

Text line segmentation holds fundamental importance among the phases of Document Image Analysis and is the subject of significant and ongoing research. It aims to identify the logical structure of a document. In the context of ancient and handwritten documents, text line segmentation becomes even more challenging due to various factors like irregular handwriting styles, faded ink, degraded paper quality, and the lack of consistent formatting. These documents often exhibit complex layouts, including overlapping lines, non-linear text flow, and variations in line spacing, making automated segmentation difficult to achieve with high accuracy. In traditional segmentation tasks, large annotated datasets are typically required to train machine learning models effectively. However, in historical and handwritten document analysis, obtaining such datasets is often impractical due to the scarcity and uniqueness of the source material.

To achieve this goal, we challenge researchers in the ICDAR community to develop a framework capable of accurately identifying text lines in ancient manuscript pages with pixel-level precision. To make the challenge more interesting, we built a novel dataset consisting of multi-language, multi-column manuscripts with very heterogeneous layouts to mimic a real-world application scenario as closely as possible. Furthermore, we challenge the participants to perform the task of text line segmentation on the proposed dataset in a few-shot setting, providing only three images, with their corresponding ground truths, as the training set for each one of these manuscripts comprising it.

Dataset

The dataset used for the competition is the Uniud - Document Image Analysis DataSet - Text Line version (U-DIADS-TL), a proprietary dataset developed through the collaboration of computer scientists and humanities scholars at the University of Udine. This dataset is a collection of 105 images, 35 for each of three distinct ancient manuscripts (Latin 2, Latin 14396, and Syriaque 341) each varying in layout structure, degradation level due to preservation and aging, and the alphabet in which they are written. These unique characteristics make this competition even more intriguing and challenging.

U-DIADS-TL consists of 35 unique color page images for each manuscript, saved in JPEG format at a resolution of 1344×2016 pixels. Each page is paired with its respective Ground Truth (GT) data, stored in a PNG image of identical size to the original. The GTs comprise two distinct and non-overlapping annotated classes, background and text lines.

For the competition, participants are provided with a total of 13 images per manuscript together with the corresponding ground truths. Out of the 13 images, 3 constitute the training set, while the remaining ones can be used for validation. The original 15 test images will be provided to participants approximately 10 days before the competition's deadline, allowing them to create and submit segmentation masks.

Download dataset: U-DIADS-TL.zip