Retrieval-augmented visual-language pre-training – Google AI Weblog

Giant-scale fashions, corresponding to T5, GPT-3, PaLM, Flamingo and PaLI, have demonstrated the flexibility to retailer substantial quantities of information when scaled to tens of billions of parameters and skilled on massive textual content and picture datasets. These fashions obtain state-of-the-art outcomes on downstream duties, corresponding to picture captioning, visible query answering and open vocabulary recognition. Regardless of such achievements, these fashions require an enormous quantity of information for coaching and find yourself with an incredible variety of parameters (billions in lots of circumstances), leading to vital computational necessities. Furthermore, the information used to coach these fashions can develop into outdated, requiring re-training each time the world’s data is up to date. For instance, a mannequin skilled simply two years in the past would possibly yield outdated details about the present president of the USA.

Within the fields of pure language processing (RETRO, REALM) and laptop imaginative and prescient (KAT), researchers have tried to handle these challenges utilizing retrieval-augmented fashions. Usually, these fashions use a spine that is ready to course of a single modality at a time, e.g., solely textual content or solely photos, to encode and retrieve info from a data corpus. Nevertheless, these retrieval-augmented fashions are unable to leverage all out there modalities in a question and data corpora, and will not discover the data that’s most useful for producing the mannequin’s output.

To deal with these points, in “REVEAL: Retrieval-Augmented Visible-Language Pre-Coaching with Multi-Supply Multimodal Data Reminiscence”, to look at CVPR 2023, we introduce a visual-language mannequin that learns to make the most of a multi-source multi-modal “reminiscence” to reply knowledge-intensive queries. REVEAL employs neural illustration studying to encode and convert various data sources right into a reminiscence construction consisting of key-value pairs. The keys function indices for the reminiscence objects, whereas the corresponding values retailer pertinent details about these objects. Throughout coaching, REVEAL learns the important thing embeddings, worth tokens, and the flexibility to retrieve info from this reminiscence to handle knowledge-intensive queries. This method permits the mannequin parameters to deal with reasoning concerning the question, moderately than being devoted to memorization.

We increase a visual-language mannequin with the flexibility to retrieve a number of data entries from a various set of information sources, which helps technology.

Reminiscence building from multimodal data corpora

Our method is much like REALM in that we precompute key and worth embeddings of information objects from totally different sources and index them in a unified data reminiscence, the place every data merchandise is encoded right into a key-value pair. Every key’s a d-dimensional embedding vector, whereas every worth is a sequence of token embeddings representing the data merchandise in additional element. In distinction to earlier work, REVEAL leverages a various set of multimodal data corpora, together with the WikiData data graph, Wikipedia passages and pictures, internet image-text pairs and visible query answering knowledge. Every data merchandise might be textual content, a picture, a mixture of each (e.g., pages in Wikipedia) or a relationship or attribute from a data graph (e.g., Barack Obama is 6’ 2” tall). Throughout coaching, we repeatedly re-compute the reminiscence key and worth embeddings because the mannequin parameters get up to date. We replace the reminiscence asynchronously at each thousand coaching steps.

Scaling reminiscence utilizing compression

A naïve answer for encoding a reminiscence worth is to maintain the entire sequence of tokens for every data merchandise. Then, the mannequin may fuse the enter question and the top-k retrieved reminiscence values by concatenating all their tokens collectively and feeding them right into a transformer encoder-decoder pipeline. This method has two points: (1) storing lots of of thousands and thousands of information objects in reminiscence is impractical if every reminiscence worth consists of lots of of tokens and (2) the transformer encoder has a quadratic complexity with respect to the whole variety of tokens instances okay for self-attention. Due to this fact, we suggest to make use of the Perceiver structure to encode and compress data objects. The Perceiver mannequin makes use of a transformer decoder to compress the total token sequence into an arbitrary size. This lets us retrieve top-okay reminiscence entries for okay as massive as 100.

The next determine illustrates the process of developing the reminiscence key-value pairs. Every data merchandise is processed via a multi-modal visual-language encoder, leading to a sequence of picture and textual content tokens. The important thing head then transforms these tokens right into a compact embedding vector. The worth head (perceiver) condenses these tokens into fewer ones, retaining the pertinent details about the data merchandise inside them.

We encode the data entries from totally different corpora into unified key and worth embedding pairs, the place the keys are used to index the reminiscence and values comprise details about the entries.

Giant-scale pre-training on image-text pairs

To coach the REVEAL mannequin, we start with the large-scale corpus, collected from the general public Internet with three billion picture alt-text caption pairs, launched in LiT. For the reason that dataset is noisy, we add a filter to take away knowledge factors with captions shorter than 50 characters, which yields roughly 1.3 billion picture caption pairs. We then take these pairs, mixed with the textual content technology goal utilized in SimVLM, to coach REVEAL. Given an image-text instance, we randomly pattern a prefix containing the primary few tokens of the textual content. We feed the textual content prefix and picture to the mannequin as enter with the target of producing the remainder of the textual content as output. The coaching purpose is to situation the prefix and autoregressively generate the remaining textual content sequence.

To coach all elements of the REVEAL mannequin end-to-end, we have to heat begin the mannequin to a very good state (setting preliminary values to mannequin parameters). In any other case, if we have been to start out with random weights (cold-start), the retriever would usually return irrelevant reminiscence objects that might by no means generate helpful coaching indicators. To keep away from this cold-start downside, we assemble an preliminary retrieval dataset with pseudo–ground-truth data to offer the pre-training an affordable head begin.

We create a modified model of the WIT dataset for this objective. Every image-caption pair in WIT additionally comes with a corresponding Wikipedia passage (phrases surrounding the textual content). We put collectively the encircling passage with the question picture and use it because the pseudo ground-truth data that corresponds to the enter question. The passage supplies wealthy details about the picture and caption, which is beneficial for initializing the mannequin.

To forestall the mannequin from counting on low-level picture options for retrieval, we apply random knowledge augmentation to the enter question picture. Given this modified dataset that accommodates pseudo-retrieval ground-truth, we practice the question and reminiscence key embeddings to heat begin the mannequin.

REVEAL workflow

The general workflow of REVEAL consists of 4 major steps. First, REVEAL encodes a multimodal enter right into a sequence of token embeddings together with a condensed question embedding. Then, the mannequin interprets every multi-source data entry into unified pairs of key and worth embeddings, with the important thing being utilized for reminiscence indexing and the worth encompassing your complete details about the entry. Subsequent, REVEAL retrieves the top-okay most associated data items from a number of data sources, returns the pre-processed worth embeddings saved in reminiscence, and re-encodes the values. Lastly, REVEAL fuses the top-okay data items via an attentive data fusion layer by injecting the retrieval rating (dot product between question and key embeddings) as a previous throughout consideration calculation. This construction is instrumental in enabling the reminiscence, encoder, retriever and the generator to be concurrently skilled in an end-to-end vogue.

Total workflow of REVEAL.


We consider REVEAL on knowledge-based visible query answering duties utilizing OK-VQA and A-OKVQA datasets. We fine-tune our pre-trained mannequin on the VQA duties utilizing the identical generative goal the place the mannequin takes in an image-question pair as enter and generates the textual content reply as output. We display that REVEAL achieves higher outcomes on the A-OKVQA dataset than earlier makes an attempt that incorporate a set data or the works that make the most of massive language fashions (e.g., GPT-3) as an implicit supply of information.

Visible query answering outcomes on A-OKVQA. REVEAL achieves increased accuracy compared to earlier works together with ViLBERT, LXMERT, ClipCap, KRISP and GPV-2.

We additionally consider REVEAL on the picture captioning benchmarks utilizing MSCOCO and NoCaps dataset. We immediately fine-tune REVEAL on the MSCOCO coaching cut up by way of the cross-entropy generative goal. We measure our efficiency on the MSCOCO check cut up and NoCaps analysis set utilizing the CIDEr metric, which relies on the concept good captions must be much like reference captions by way of phrase alternative, grammar, that means, and content material. Our outcomes on MSCOCO caption and NoCaps datasets are proven beneath.

Picture Captioning outcomes on MSCOCO and NoCaps utilizing the CIDEr metric. REVEAL achieves the next rating compared to Flamingo, VinVL, SimVLM and CoCa.

Under we present a few qualitative examples of how REVEAL retrieves related paperwork to reply visible questions.

REVEAL can use data from totally different sources to appropriately reply the query.


We current an end-to-end retrieval-augmented visible language (REVEAL) mannequin, which accommodates a data retriever that learns to make the most of a various set of information sources with totally different modalities. We practice REVEAL on an enormous image-text corpus with 4 various data corpora, and obtain state-of-the-art outcomes on knowledge-intensive visible query answering and picture caption duties. Sooner or later we want to discover the flexibility of this mannequin for attribution, and apply it to a broader class of multimodal duties.


This analysis was performed by Ziniu Hu, Ahmet Iscen, Chen Solar, Zirui Wang, Kai-Wei Chang, Yizhou Solar, Cordelia Schmid, David A. Ross and Alireza Fathi.

Leave a Reply

Your email address will not be published. Required fields are marked *