Semi-automatic Segmentation & Alignment of Handwritten



Yüklə 11,83 Mb.
səhifə8/23
tarix07.09.2023
ölçüsü11,83 Mb.
#141855
1   ...   4   5   6   7   8   9   10   11   ...   23

Data & Software


The data set that is mainly being worked on in this project is the Labour’s Memory data set, which is the people’s movement archive for Uppsala County, also known as the Labour’s movement archive. Labour’s Memory consists of digitised and annotated documents from the period 1892 - 1985. Within the data set, there are 1836 .jpg images divided into 31 different folders corresponding to the specific department or area that the images relate to. Each document image has three additional files connected to it. The first is a .txt file containing the raw digitised transcription with correct line breaks. The second file is a .xml file containing metadata and information about the image. Coordinates for the text region and each text line, as well as a digitised transcription, can also be found. The third file contains information about when and how the document was processed for OCR. The third file also contains coordinates for the page, text line, baseline, word, and spaces.


The data sets accessed during the project are listed in Table 1. The data sets Labour’s Memory and Demokrati 100 are written in the Swedish language. In contrast, the IAM data set is written in the English language and is publicly available for non-commercial research purposes only. It is provided by the Research Group on Computer Vision and Artificial Intelligence INF, University of Bern (Marti & Bunke 2002). The data sets Labour’s Memory, and IAM were used for evaluation and testing during development, while Demokrati 100 was only used for internal testing of the algorithm during devel- opment.
    1. Complications with the data


One crucial factor to consider is the complexity of the given data in the Labour’s Memory


data set. Two images from this data set can look very different depending on the period

Table 1: Information about the data sets used in the project.





Data set

Pages (no.)

Ground Truth

Format

Labour’s Memory

1836

transcript & word/line boxes

JPG

IAM

1539

transcript & word/line boxes

PNG

Demokrati 100

4487

transcript & word/line boxes

JPG




      1. Year 1967 (b) Year 1899



Figure 3: Two example images taken from the Labour’s Memory data set which differ in styles

it was created in. The overall quality of the documents is very heterogeneous; although most images contain some form of noise, the amount varies a lot (3). Some documents are handwritten, while others are written using a typewriter. Handwritten documents are often very challenging to preprocess due to the variety found in handwriting. Some documents have page holes from a hole puncher, while some do not; if these holes are not removed during the preprocessing steps, they will most likely affect the positioning of the bounding boxes further down the pipeline. Additionally, in some documents, there is an overlap of characters between the lines. It is common that the characters ’f’ and ’g’ overlap with another text line, making the line and word separation more complex. Above, in Figure 3 are two different examples of images found in the Labour’s Memory data set with varying amounts of noise and in different styles.



    1. Yüklə 11,83 Mb.

      Dostları ilə paylaş:
1   ...   4   5   6   7   8   9   10   11   ...   23




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin