Stanford AI researchers propose “LinkBERT”: a new pre-training method that improves the training of linguistic models with document links

This Article is written as a summay by Marktechpost Staff based on the research paper 'LinkBERT: Pretraining Language Models with Document Links'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, github and blog post.

Please Don't Forget To Join Our ML Subreddit

Language models (LMs) are the cornerstone of modern natural language processing (NLP) systems, primarily because of their extraordinary ability to acquire knowledge from textual documents. These patterns are ingrained in our daily lives due to their use in tools such as search engines and voice assistants. Models such as the BERT and GPT series stand out because they can be pre-trained on large amounts of unannotated text input using self-supervised learning. These pre-trained models can then be easily modified for a wide range of new question-answering work without much adjustment specific to the veiled language modeling and causal language modeling task. The main disadvantage of these popular LM pre-training solutions is that they only model one document at a time and do not capture dependencies or knowledge that spans multiple documents. Due to their interdependence, the independent assessment of each document has certain limitations. This is best illustrated using text from the web or scientific literature, which is frequently used in LM training. Most of the time, this textual data contains links to documents, such as hyperlinks and reference links. These documentary links are essential since the knowledge can be found in several documents rather than just one.

Document links are essential for LMs to learn new information and make discoveries, just as they are for humans. We must remember that a text corpus is more than a list of documents; it is a graph of documents with links connecting them. Models trained without these dependencies may be unable to capture facts scattered across multiple documents, which is necessary for a variety of applications such as question answering and knowledge discovery. To take a step toward solving this challenge, a group of researchers at Stanford’s AI lab created LinkBERT, a new pre-training approach that includes information about document links during training. The LinkBERT LM is divided into three stages. The first step is to build a document graph from the text corpus using hyperlinks and citation links. Each document is seen as a node, and if there is a hyperlink between two documents, a directed edge is added between them. The second step is to use the graph to create link-aware training instances by grouping connected documents. The document is split into segments, which are then concatenated based on the links in the document graphic. Concatenation can be done in several ways, such as contiguous, random, and linked segments. The model can learn to recognize the relationships between parts of the text using these three alternative approaches.


Pre-training the LM using self-supervised link-aware tasks such as hidden language modeling (MLM) and document relationship prediction (DRP) is the final step. MLM hides certain tokens in the input text before predicting the tokens based on the tokens surrounding them. The goal is to lead the LM to gain multi-hop knowledge of topics that are interrelated via document links. It is thanks to the DRP that the model can categorize the relationship between two segments as contiguous or random. This work helps the LM learn more about document relevance and dependencies. LinkBERT models have been tested on various downstream tasks of various domains. The model consistently outperforms basic language models like BERT and PubmedBERT that have not been pre-trained with document links between tasks and domains. Due to the critical links between scientific publications via citation links, these positive results were particularly relevant for the biomedical field. LinkBERT also shows exemplary results for multi-hop reasoning.

It is quite convenient to use LinkBERT as a replacement for BERT. LinkBERT not only improves performance for general language comprehension tasks, but it also captures relationships between concepts and is effective for cross-document comprehension, according to a careful experimental study. The model also internalizes more global knowledge and is useful for jobs that require a lot of knowledge, such as answering questions. The LinkBERT models were released in the hope that they would pave the way for future research projects. Some of these projects include the generalization of sequence-to-sequence style LMs to perform document link-aware text generation, etc. The research work was also published in Association for Computational Linguistics 2022.

James G. Williams