Google AI researchers propose a structure-aware sequence model, called FormNet, to mitigate suboptimal form serialization for extracting information from documents

This Article Is Based On The Research Paper 'FormNet: Structural Encoding beyond Sequential Modeling in
Form Document Information Extraction' and Google article. All Credit For This Research Goes To The Researchers Of This Paper 👏👏👏

Please Don't Forget To Join Our ML Subreddit

Using sequence modeling, researchers achieved improved peak performance on natural language and document processing tasks. Sequence models are machine learning self-attention models that input or output sequences of data based on past input/output. To analyze form-related documents, common practice is to serialize them first (usually left-to-right, top-to-bottom) and then apply state-of-the-art sequence patterns to them. On the other hand, standard serialization solutions suffer from the varying complexity of sophisticated form layouts, which frequently include tables, columns, boxes, and other elements. These particular problems with form-based document interpretation have been largely overlooked, despite their practical importance.

To make progress in this area, a team of researchers from Google Research’s Cloud AI team authored a research paper, “FormNet: Structural Encoding Beyond Sequential Modeling in Form Document Information Extraction”, featured in ACL 2022. FormNet is a sequence model that bridges the gap between simple sequence models and 2D convolutional models to reduce incorrectly shaped serialization. The model’s architecture starts with a “Rich Attention” mechanism, which uses the spatial relationships between tokens to calculate a more structurally relevant attention score. Then, using graph convolutional networks (GCNs), “super-tokens” are created by consolidating important information using neighboring token embeddings. Relevant knowledge about how tokens are spatially related to each other in forms is extracted using these graphs. These Super-Tokens are then fed into a transformer model, which performs successive tagging and extraction of entities.

The team also conducted a series of experiments using FormNet for finding information about documents. We first use the optical character recognition (OCR) and BERT multilingual vocabulary engine to detect and mark words in a form document. The tokens and 2D coordinates are then fed into a GCN for graph building and message passing. To further process GCN-encoded structure-aware tokens for semantic entity extraction, we use Extended Transformer Construction (ETC) layers with the suggested Rich Attention technique. Finally, we decode and retrieve the final features for output using the Viterbi method, which finds a sequence that maximizes the posterior probability.

The researchers concluded that FormNet outperforms previous methods despite using smaller model sizes, less pre-training data, and eliminating the use of visual features through a series of studies. It also achieves top performance on CORD, FUNSD and Payment benchmarks. So, despite sub-standard serialization, the ETC Transformer excels at understanding shapes thanks to a new Rich Attention method and Super-Token components proposed by the team.




James G. Williams