At the Knowledge Discovery in Databases (KDD) 2023 conference in Long Beach, CA, Google's Athena team presented its development of the Visually Rich Document Understanding (VRDU) dataset. This dataset can formulate a system that can automatically extract rich data from documents like receipts, insurance quotes, financial statements, and more.
While large models like PaLM 2 have impressive levels of accuracy, their real-world usability depends on the dataset's ability to train it. VRDU aims to bridge the gap between these models and complex real-world applications. To do this, the Athena team came up with five benchmarking requirements:
Rich Schema: In practice, we see a wide variety of rich schemas for structured extraction. Entities have different data types (numeric, strings, dates, etc.) that may be required, optional, or repeated in a single document or may even be nested. Extraction tasks over simple flat schemas like (header, question, answer) do not reflect typical problems encountered in practice.
Layout-Rich Documents: The documents should have complex layout elements. Challenges in practical settings come from the fact that documents may contain tables, key-value pairs, switch between single-column and double-column layout, have varying font-sizes for different sections, include pictures with captions and even footnotes. Contrast this with datasets where most documents are organized in sentences, paragraphs, and chapters with section headers — the kinds of documents that are typically the focus of classic natural language processing literature on long inputs.
Diverse Templates: A benchmark should include different structural layouts or templates. It is trivial for a high-capacity model to extract from a particular template by memorizing the structure. However, in practice, one needs to be able to generalize to new templates/layouts, an ability that the train-test split in a benchmark should measure.
High-Quality OCR: Documents should have high-quality Optical Character Recognition (OCR) results. Our aim with this benchmark is to focus on the VRDU task itself and to exclude the variability brought on by the choice of OCR engine.
Token-Level Annotation: Documents should contain ground-truth annotations that can be mapped back to corresponding input text, so that each token can be annotated as part of the corresponding entity. This is in contrast with simply providing the text of the value to be extracted for the entity. This is key to generating clean training data where we do not have to worry about incidental matches to the given value. For instance, in some receipts, the ‘total-before-tax’ field may have the same value as the ‘total’ field if the tax amount is zero. Having token level annotations prevents us from generating training data where both instances of the matching value are marked as ground-truth for the ‘total’ field, thus producing noisy examples.
VRDU is a combination of Registration Forms and Ad-Buy forms - publicly available datasets. This dataset can be used to process Single Template, Mixed Template, and Unseen Template Learning documents. It can identify and categorize types of information in structured and unstructured documents, and help researchers track progress on document understanding tasks. The performance of the dataset on these types of documents can be read in the paper published here.
Source: Google Research