Donut: OCR-free Document Understanding Transformer

Background/Motivation

I use a Rocketbook to take handwritten class notes and an app to take pictures of these notes for OCR processing. The OCR software scans, preprocesses, identifies text regions, matches to character patterns, and corrects errors, yielding digital, searchable text. Beyond academic notes, OCR is vital in business for automating the understanding and processing of documents like invoices and receipts, streamlining data entry, archiving, and content management.

Common Tasks in Visual Document Understanding (VDU)

Document classification
Information extraction
Visual question answering

Problems with Traditional OCR for VDU

The traditional OCR process for VDU is a two-stage process:

Stage 1: Capturing text using OCR.
Stage 2: Modeling the document's holistic understanding.

Issues with OCR in VDU:

High Cost: OCR as pre-processing is computationally expensive.
Inflexibility: Struggles with language and document-type variations, leading to poor generalization ability.
Error Propagation: OCR mistakes affect the entire VDU process, especially with complex characters like Korean or Chinese.

How Donut Addresses these Problems

OCR-Free Mapping: Directly translates raw images into outputs, bypassing traditional OCR.
Transformer Architecture: Utilizes end-to-end Transformer-based models for reading and understanding documents.
Pre-training and Fine-Tuning: Employs a two-step training process involving pre-training on images and annotations and fine-tuning on specific tasks.
Language and Domain Flexibility: Achieves versatility across languages and domains using synthetic data during pre-training.
Performance: Demonstrates superior speed and accuracy in VDU tasks across various benchmarks and datasets.

Donut’s Approach

Donut uses a visual encoder to interpret the document image and a textual decoder to produce a JSON structured format without relying on OCR. Each model component is Transformer-based, and thus the model is trained easily in an end-to-end manner

Gradio Demos

Document Classification: Donut-RVLCDIP
Document Information Extraction: Donut-Base Fine-tuned CORD v2
Document Visual Question Answering: Donut-DoCvQA

Brief Critical Analysis on Results

While Donut demonstrates the fastest inference time and superior performance across all tested datasets, the lack of error analysis in the paper is a notable omission for further advancements and reliability of the model.

Donut Pseudocode

Question 1: What is the input to the encoder?

The process includes flattening 2D patches into 1D arrays and encoding their positions within the image. This allows the Transformer to focus on important patches using the attention mechanism.

Question 2: Why use patches and why not flatten the entire image into one long 1D array?

Hint 1

Consider how the number of comparisons in a Transformer's attention mechanism changes with the length of the input sequence.

Hint 2

Imagine moving an object in an image by a few pixels. How might this affect the input if it’s represented pixel by pixel versus in patches?

Answer

Answer: By using patches, a model retains local structure, directly capturing features like the top of a building within a cohesive region. Processing pixel by pixel can be highly sensitive to shifts, causing translation variance where small changes lead to disproportionate impacts on the model's output. This pixel-based approach is also computationally intensive. Patches, on the other hand, enable parallel processing and result in a more efficient attention mechanism due to fewer parameters and a reduced need for pairwise comparisons.

Question 3: In line 6, we encounter the multi-head attention mechanism, where 'Y' denotes the sequence of tokens generated up to this point, and we incorporate the multi-head attention weights. Now, considering the integration of visual information, can anyone clarify the role of 'z' in this context?"

Answer

Answer: The term 'z' represents a collection of embeddings produced by the encoder. Each embedding within 'z' is aligned with distinct segments of the input image, effectively capturing and condensing both the visual attributes and the spatial layout into a structured format that the Transformer's decoder can interpret and utilize.

Resource Links

Paper Citation

G. Kim, T. Hong, M. Yim, J. Nam, J. Yim, W. Hwang, S. Yun, D. Han, S. Park, "OCR-free Document Understanding Transformer," in arXiv:2111.15664 [cs.LG], Nov. 2021, Accessed on: Mar. 6, 2024. [Online]. Available: arXiv:2111.15664

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
An Image is worth 16x16 words.pdf		An Image is worth 16x16 words.pdf
BART.pdf		BART.pdf
Donut.pdf		Donut.pdf
Donut_Pseudocode.pdf		Donut_Pseudocode.pdf
README.md		README.md
SWIN.pdf		SWIN.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Donut: OCR-free Document Understanding Transformer

Background/Motivation

Common Tasks in Visual Document Understanding (VDU)

Problems with Traditional OCR for VDU

Issues with OCR in VDU:

How Donut Addresses these Problems

Donut’s Approach

Gradio Demos

Brief Critical Analysis on Results

Donut Pseudocode

Question 1: What is the input to the encoder?

Question 2: Why use patches and why not flatten the entire image into one long 1D array?

Resource Links

Paper Citation

About

Releases

Packages

jiax264/DS5690_Donut_Paper_Presentation

Folders and files

Latest commit

History

Repository files navigation

Donut: OCR-free Document Understanding Transformer

Background/Motivation

Common Tasks in Visual Document Understanding (VDU)

Problems with Traditional OCR for VDU

Issues with OCR in VDU:

How Donut Addresses these Problems

Donut’s Approach

Gradio Demos

Brief Critical Analysis on Results

Donut Pseudocode

Question 1: What is the input to the encoder?

Question 2: Why use patches and why not flatten the entire image into one long 1D array?

Resource Links

Paper Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages