Evaluation
The evaluation phase involves testing participants' systems on 15 private and unpublished images per manuscript once the competition is finished.
Pixel IU and Line IU evaluations are based on the Intersection over Union (IU) as main metric, that is
defined as:
IU = TP / (TP + FP + FN)
where TP denotes the True Positives, FP the False Positives and FN the False Negatives.
Pixel IU measures IU at the pixel level, assessing how accurately the lines have been detected. Here, TP refers to correctly detected pixels,
FP to extra pixels, and FN to missed pixels.
Line IU refers to IU at the line level, evaluating how many lines have been correctly detected. In this case, TP represents the number of correctly
detected lines, FP represents extra lines, and FN represents missed lines.
To evaluate a line IU prediction, a threshold is used to determine the matching between the predicted connected components and the ground truth connected components.
Two components are considered a match, and the line is deemed correctly detected, if both the pixel precision and recall of the components exceed a threshold of 75%.
If the pixel precision is below the threshold, it is considered a FP, and if the recall is below the threshold, it is considered a FN.
MatchScore(i, j) = T(G_j intersection R_i) / T(G_j union R_i)
A region pair (i, j) is considered a one-to-one match if the MatchScore(i, j) is equal to or above the threshold T_a. In this competition, T_a = 75%. Let N_1 and N_2 be the number of ground-truth and detected lines, respectively, and let M be the number of one-to-one matches. The tool calculates DR, RA, and FM as follows:
DR = M / N_1
RA = M / N_2
FM = (2 * DR *RA) / (DR + RA)
These evaluation metrics are calculated individually for each manuscript.
The final evaluation of the system which will define the leader board is obtained by averaging the Line IU metric of the three handwritten documents.
We provide the Python script with the evaluation code, which we will use for our final test. Download: coming soon!
Data augmentation and additional datasets
The use of any data augmentation is allowed for this challenge as well as the employment of external, public data for pre-training purposes, however the use of the latter must be clearly stated by the authors at the time of submission.
The alteration of the train-validation-test splits of the dataset provided for the competition is, however, strictly forbidden under penalty of exclusion from the competition. In particular, it's strictly forbidden to use the validation images as a part of the training set!
Submission
By the submission date, participants must send the following via email to the organizers:
- The Google Colab notebook containing executable training and evaluation code (including libraries). Note: the code should be set up so that when running the notebook, the reading of the dataset and the folders indicated for the evaluation code happens automatically.
- The segmentation maps predicted by the model for the test set data.
- The checkpoint of the final, pre-trained, version of the proposed model.
- A short report (maximum 3 pages) describing the method and approach, the results for each manuscript, and the average of all manuscripts for the metrics previously outlined.
In case of suspicious results, we reserve the right to retrain the model provided by the participants to validate the reported results. For this reason, we invite alle the partecipants to write their code with an eye to reproducibility (e.g. explicitely seeding RNG components) as to avoid inconsistencies with the provided results.