Multimedia Computing and Computer Vision Lab

Login  

Home

     

Courses

     

People

     

Research

     

Publications

     

Student Theses

     

Source Code / Datasets

     

Contact

     

An Annotated Dataset of Shredded Documents

From Multimedia Computing Lab - University of Augsburg


We have created two novel annotated datasets of hand-torn document pages, which are intended for testing of automatic reassembly algorithms.

  • Our first dataset (called bdw082010) consists of 48 double-sided sheets taken from a scientific magazine (Bild der Wissenschaft, issue 08/2010). Each of these sheets has been torn into 8, 16, 24, and 32 fragments. Since each sheet has a front- and a back side, the complete dataset contains 96 pages. It features a variety of different contents, including text, illustrations, and layout elements such as tables and diagrams. The dataset contains all digitized pieces (organized in subfolders according to their sheet number). For digitization we used an off-the-shelf scanner which has been equipped with a unicolor (green) foil to facilitate postprocessing.
  • For the second dataset (called booklet) we used an information brochure which was printed on thicker paper than the bdw082010 sheets. Thus, fragments of this dataset feature slightly different physical characteristics along their tearing boundaries. The booklet dataset consists of a total of 48 pages and is meant only for evaluation purposes. The digitization procedure was the same as for the first dataset.
Pieces of one page that are positioned and oriented randomly

Annotation (Ground Truth)

  • Pixel-level annotations (binary masks): After each fragment has been scanned from both sides, we separated the foreground (i.e., the content of the piece) from the unicolor background. Also, the observed contour of each piece has been approximated by a polygon, which is represented by its support points (a subset of contour points). Both, binary masks and support points are part of each dataset.
  • Manually reconstructed pages: We created an annotation tool which allows a human user to manually reconstruct page. Provided with all pieces of a single page, the user had to correctly arrange the digitized pieces by translating and rotating them individually. After the user has finished the manual reconstruction, the annotation tool automatically determines positive correspondences between piece-pairs (i.e., adjacent support points across each pair of pieces) and stores the result.

An example for two pages (front- and back side of a single sheet) of the bdw082010 dataset is given below. The manually reconstructed pages are shown on the left, the scanned pieces are depicted on the right:

Datasplits

Each page has been categorized into either featuring a picture, text or a combination of both. Afterwards, we distributed all pages across three disjoint sets: {train}, {val}, and {test}. We were careful to ensure that each such partition is representative for the whole dataset. The bdw08210 and the booklet dataset were both preprocessed separately.

The following table provides an overview of all available datasplits:

Table:datasplits of the bdw082010 and booklet dataset

Evaluation

Details about our two different quantitative performance measures can be found in the related papers [1,5] listed below. In summary, based on our ground truth, we proposed two measures to assess the quality of reconstruction results:

  • mean Adjustment Cost (mAC): This performance measure quantifies the degree of misalignment between pieces that were adjacent in the original document. The evaluation in [1] is based on the idea that reassembled document pages should entail low costs whenever the pieces' relative position is accurate.
  • mean Average Precision (mAP): Average Precision is a very common measure for the evaluation of object detection and image retrieval systems. In [5] we explained how it can be be adapted for the evaluation of a document reconstruction system.

MATLAB code

To facilitate the understanding of our annotation, we wrote a simple MATLAB script which reads in the ground truth and visualizes all point-correspondences between piece-pairs. A second script reads the rigid transformations to reposition all pieces as in the manually reconstructed page. These scripts are part of the download, which can be requested by e-mail (see below).

Acknowledgements

We thank the editorial staff of Bild der Wissenschaft and the publisher Konradin Medien GmbH for their permission to use the magazine and to make the dataset publicly available for research.

Citation

If you use this dataset in your work please cite the following paper:

[1] Fabian Richter, Christian X. Ries, Nicolas Cebron, Rainer Lienhart.
Learning to Reassemble Shredded Documents,
IEEE Transactions on Multimedia, 2012 DOI 10.1109/TMM.2012.2235415

Related Papers

[2] Fabian Richter, Christian Eggert, Rainer Lienhart.
Fisher Vector Encoding of Micro Color Features for (Real World) Jigsaw Puzzles,
International Conference on Document Analysis and Recognition (ICDAR), 2015 (to appear)
[3] Fabian Richter, Christian X. Ries, Rainer Lienhart.
Evaluation of Discriminative Models for the Reconstruction of Hand-Torn Documents,
Asian Conference on Computer Vision (ACCV), Singapore, November 2014. [PDF] Published by Springer, [the final publication is available at link.springer.com]
[4] Fabian Richter, Christian X. Ries, Stefan Romberg, Rainer Lienhart.
Partial Contour Matching for Document Pieces with Content-Based Prior,
IEEE International Conference on Multimedia and Expo 2014 (ICME), Chengdu, July 2014. [PDF] [Original PDF from IEEE Xplore]
[5] Fabian Richter, Christian X. Ries, Rainer Lienhart.
A Graph Algorithmic Framework for the Assembly of Shredded Documents. IEEE International Conference on Multimedia and Expo 2011 (ICME11), Barcelona, July 2011
Also Technical Report 2011-05, University of Augsburg, Institute of Computer Science, March 2011 [PDF]

Download

If you wish to obtain the dataset, please send an email to Christian Eggert.