Skip to content

jiax264/DS5690_Donut_Paper_Presentation

Repository files navigation

Donut: OCR-free Document Understanding Transformer

Background/Motivation

I use a Rocketbook to take handwritten class notes and an app to take pictures of these notes for OCR processing. The OCR software scans, preprocesses, identifies text regions, matches to character patterns, and corrects errors, yielding digital, searchable text. Beyond academic notes, OCR is vital in business for automating the understanding and processing of documents like invoices and receipts, streamlining data entry, archiving, and content management.

Common Tasks in Visual Document Understanding (VDU)

  • Document classification
  • Information extraction
  • Visual question answering

Problems with Traditional OCR for VDU

The traditional OCR process for VDU is a two-stage process:

  1. Stage 1: Capturing text using OCR.
  2. Stage 2: Modeling the document's holistic understanding.

image

Issues with OCR in VDU:

  • High Cost: OCR as pre-processing is computationally expensive.
  • Inflexibility: Struggles with language and document-type variations, leading to poor generalization ability.
  • Error Propagation: OCR mistakes affect the entire VDU process, especially with complex characters like Korean or Chinese.

How Donut Addresses these Problems

  • OCR-Free Mapping: Directly translates raw images into outputs, bypassing traditional OCR.
  • Transformer Architecture: Utilizes end-to-end Transformer-based models for reading and understanding documents.
  • Pre-training and Fine-Tuning: Employs a two-step training process involving pre-training on images and annotations and fine-tuning on specific tasks.
  • Language and Domain Flexibility: Achieves versatility across languages and domains using synthetic data during pre-training.
  • Performance: Demonstrates superior speed and accuracy in VDU tasks across various benchmarks and datasets.

Donut’s Approach

image Donut uses a visual encoder to interpret the document image and a textual decoder to produce a JSON structured format without relying on OCR. Each model component is Transformer-based, and thus the model is trained easily in an end-to-end manner

Gradio Demos

Brief Critical Analysis on Results

image While Donut demonstrates the fastest inference time and superior performance across all tested datasets, the lack of error analysis in the paper is a notable omission for further advancements and reliability of the model.

Donut Pseudocode

image

Question 1: What is the input to the encoder?

image The process includes flattening 2D patches into 1D arrays and encoding their positions within the image. This allows the Transformer to focus on important patches using the attention mechanism.

Question 2: Why use patches and why not flatten the entire image into one long 1D array?

Hint 1 Consider how the number of comparisons in a Transformer's attention mechanism changes with the length of the input sequence.
Hint 2 Imagine moving an object in an image by a few pixels. How might this affect the input if it’s represented pixel by pixel versus in patches?
Answer Answer: By using patches, a model retains local structure, directly capturing features like the top of a building within a cohesive region. Processing pixel by pixel can be highly sensitive to shifts, causing translation variance where small changes lead to disproportionate impacts on the model's output. This pixel-based approach is also computationally intensive. Patches, on the other hand, enable parallel processing and result in a more efficient attention mechanism due to fewer parameters and a reduced need for pairwise comparisons.

image

Question 3: In line 6, we encounter the multi-head attention mechanism, where 'Y' denotes the sequence of tokens generated up to this point, and we incorporate the multi-head attention weights. Now, considering the integration of visual information, can anyone clarify the role of 'z' in this context?"

Answer Answer: The term 'z' represents a collection of embeddings produced by the encoder. Each embedding within 'z' is aligned with distinct segments of the input image, effectively capturing and condensing both the visual attributes and the spatial layout into a structured format that the Transformer's decoder can interpret and utilize.

Resource Links

Paper Citation

G. Kim, T. Hong, M. Yim, J. Nam, J. Yim, W. Hwang, S. Yun, D. Han, S. Park, "OCR-free Document Understanding Transformer," in arXiv:2111.15664 [cs.LG], Nov. 2021, Accessed on: Mar. 6, 2024. [Online]. Available: arXiv:2111.15664

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published