Google develops VRDU AI dataset benchmark to scan and understand documents

Google's Athena Team announced the release of the Visually Rich Document Understanding (VRDU) dataset. Systems trained on this dataset can automatically scan and extract rich data from documents.

At the Knowledge Discovery in Databases (KDD) 2023 conference in Long Beach, CA, Google's Athena team presented its development of the Visually Rich Document Understanding (VRDU) dataset. This dataset can formulate a system that can automatically extract rich data from documents like receipts, insurance quotes, financial statements, and more.

While large models like PaLM 2 have impressive levels of accuracy, their real-world usability depends on the dataset's ability to train it. VRDU aims to bridge the gap between these models and complex real-world applications. To do this, the Athena team came up with five benchmarking requirements:

Rich Schema: In practice, we see a wide variety of rich schemas for structured extraction. Entities have different data types (numeric, strings, dates, etc.) that may be required, optional, or repeated in a single document or may even be nested. Extraction tasks over simple flat schemas like (header, question, answer) do not reflect typical problems encountered in practice.

Layout-Rich Documents: The documents should have complex layout elements. Challenges in practical settings come from the fact that documents may contain tables, key-value pairs, switch between single-column and double-column layout, have varying font-sizes for different sections, include pictures with captions and even footnotes. Contrast this with datasets where most documents are organized in sentences, paragraphs, and chapters with section headers — the kinds of documents that are typically the focus of classic natural language processing literature on long inputs.

Diverse Templates: A benchmark should include different structural layouts or templates. It is trivial for a high-capacity model to extract from a particular template by memorizing the structure. However, in practice, one needs to be able to generalize to new templates/layouts, an ability that the train-test split in a benchmark should measure.

High-Quality OCR: Documents should have high-quality Optical Character Recognition (OCR) results. Our aim with this benchmark is to focus on the VRDU task itself and to exclude the variability brought on by the choice of OCR engine.

Token-Level Annotation: Documents should contain ground-truth annotations that can be mapped back to corresponding input text, so that each token can be annotated as part of the corresponding entity. This is in contrast with simply providing the text of the value to be extracted for the entity. This is key to generating clean training data where we do not have to worry about incidental matches to the given value. For instance, in some receipts, the ‘total-before-tax’ field may have the same value as the ‘total’ field if the tax amount is zero. Having token level annotations prevents us from generating training data where both instances of the matching value are marked as ground-truth for the ‘total’ field, thus producing noisy examples.

Structured document benchmarking

VRDU is a combination of Registration Forms and Ad-Buy forms - publicly available datasets. This dataset can be used to process Single Template, Mixed Template, and Unseen Template Learning documents. It can identify and categorize types of information in structured and unstructured documents, and help researchers track progress on document understanding tasks. The performance of the dataset on these types of documents can be read in the paper published here.

Source: Google Research

Tags

Trending Stories

Serafim S3 controller: Good ergonomics, great tactility, and some weird stuff

BOOX Go Gen 2 Lumi: stunning E-Ink Android tablet with rich software

Build 2026, Surface Laptop Ultra, and fixed context menus for Windows 11

Far Far West early access review: a superb cowboys & magic co-op game

Forza Horizon 6: a stunning open-world Japanese adventure

Cuktech 30 Ultra charger: ports, adapters, 300W, and a big screen

"Enough is enough," Computex 2026, and the next trillion-dollar company

HONOR 600: a mid-ranger boasting flagship-class cameras and performance

Luna Ring Gen 2: beautiful, no-subscription wellness accessory

OneOdio Studio Max 2 Ultra-low Latency wireless DJ Headphones:

Dead as Disco Early Access: Brawl to the beat inside music videos

How to remap the Copilot key in Windows 11, without Microsoft's "fix"

Big Tech is turning you into a digital tenant, here is how to evict them

Login