Sparse video tubes for joint video and picture imaginative and prescient transformers – Google AI Weblog

Video understanding is a difficult drawback that requires reasoning about each spatial info (e.g., for objects in a scene, together with their places and relations) and temporal info for actions or occasions proven in a video. There are various video understanding purposes and duties, resembling understanding the semantic content material of internet movies and robotic notion. Nevertheless, present works, resembling ViViT and TimeSFormer, densely course of the video and require vital compute, particularly as mannequin dimension plus video size and backbone improve.

In “Rethinking Video ViTs: Sparse Video Tubes for Joint Picture and Video Studying”, to be offered at CVPR 2023, we introduce a easy method that turns a Imaginative and prescient Transformer (ViT) mannequin picture encoder into an environment friendly video spine utilizing sparse video tubes (learnable visible representations of samples from the video) to scale back the mannequin’s compute wants. This strategy can seamlessly course of each photos and movies, which permits it to leverage each picture and video information sources throughout coaching. This coaching additional permits our sparse tubes ViT mannequin to coalesce picture and video backbones collectively to serve a twin position as both a picture or video spine (or each), relying on the enter. We display that this mannequin is scalable, will be tailored to massive pre-trained ViTs with out requiring full fine-tuning, and achieves state-of-the-art outcomes throughout many video classification benchmarks.

Utilizing sparse video tubes to pattern a video, mixed with an ordinary ViT encoder, results in an environment friendly visible illustration that may be seamlessly shared with picture inputs.

Constructing a joint image-video spine

Our sparse tube ViT makes use of an ordinary ViT spine, consisting of a stack of Transformer layers, that processes video info. Earlier strategies, resembling ViViT, densely tokenize the video after which apply factorized consideration, i.e., the eye weights for every token are computed individually for the temporal and spatial dimensions. In the usual ViT structure, self-attention is computed over the entire token sequence. When utilizing movies as enter, token sequences grow to be fairly lengthy, which might make this computation sluggish. As a substitute, within the technique we suggest, the video is sparsely sampled utilizing video tubes, that are 3D learnable visible representations of varied sizes and styles (described in additional element under) from the video. These tubes are used to sparsely pattern the video utilizing a massive temporal stride, i.e., when a tube kernel is simply utilized to a couple places within the video, relatively than each pixel.

By sparsely sampling the video tubes, we will use the identical world self-attention module, relatively than factorized consideration like ViViT. We experimentally present that the addition of factorized consideration layers can hurt the efficiency as a result of uninitialized weights. This single stack of transformer layers within the ViT spine additionally permits higher sharing of the weights and improves efficiency. Sparse video tube sampling is finished by utilizing a big spatial and temporal stride that selects tokens on a hard and fast grid. The big stride reduces the variety of tokens within the full community, whereas nonetheless capturing each spatial and temporal info and enabling the environment friendly processing of all tokens.

Sparse video tubes

Video tubes are 3D grid-based cuboids that may have completely different shapes or classes and seize completely different info with strides and beginning places that may overlap. Within the mannequin, we use three distinct tube shapes that seize: (1) solely spatial info (leading to a set of 2D picture patches), (2) lengthy temporal info (over a small spatial space), and (3) each spatial and temporal info equally. Tubes that seize solely spatial info will be utilized to each picture and video inputs. Tubes that seize lengthy temporal info or each temporal and spatial info equally are solely utilized to video inputs. Relying on the enter video dimension, the three tube shapes are utilized to the mannequin a number of instances to generate tokens.

A hard and fast place embedding, which captures the worldwide location of every tube (together with any strides, offsets, and many others.) relative to all the opposite tubes, is utilized to the video tubes. Totally different from the earlier discovered place embeddings, this fastened one higher permits sparse, overlapping sampling. Capturing the worldwide location of the tube helps the mannequin know the place every got here from, which is very useful when tubes overlap or are sampled from distant video places. Subsequent, the tube options are concatenated collectively to type a set of N tokens. These tokens are processed by an ordinary ViT encoder. Lastly, we apply an consideration pooling to compress all of the tokens right into a single illustration and enter to a completely linked (FC) layer to make the classification (e.g., taking part in soccer, swimming, and many others.).

Our video ViT mannequin works by sampling sparse video tubes from the video (proven on the backside) to allow both or each picture or video inputs to be seamlessly processed. These tubes have completely different shapes and seize completely different video options. Tube 1 (yellow) solely captures spatial info, leading to a set of 2D patches that may be utilized to picture inputs. Tube 2 (pink) captures temporal info and a few spatial info and tube 3 (inexperienced) equally captures each temporal and spatial info (i.e., the spatial dimension of the tube x and y are the identical because the variety of frames t). Tubes 2 and three can solely be utilized to video inputs. The place embedding is added to all of the tube options.

Scaling video ViTs

The method of constructing video backbones is computationally intensive, however our sparse tube ViT mannequin permits computationally environment friendly scaling of video fashions, leveraging beforehand educated picture backbones. Since picture backbones will be tailored to a video spine, massive picture backbones will be became massive video backbones. Extra particularly, one can switch the discovered video function representations from a small tube ViT to a big pre-trained picture ViT and prepare the ensuing mannequin with video information for just a few steps, versus a full coaching from scratch.

Our strategy permits scaling a sparse tube ViT in a extra environment friendly method. Particularly, the video options from a small video ViT (high community) will be transferred to a big, pre-trained picture ViT (backside community), and additional fine-tuned. This requires fewer coaching steps to attain robust efficiency with the big mannequin. That is helpful as massive video fashions could be prohibitively costly to coach from scratch.


We consider our sparse tube ViT strategy utilizing Kinetics-400 (proven under), Kinetics-600 and Kinetics-700 datasets and examine its efficiency to a protracted checklist of prior strategies. We discover that our strategy outperforms all prior strategies. Importantly, it outperforms all state-of-the-art strategies educated collectively on picture+video datasets.

Efficiency in comparison with a number of prior works on the favored Kinetics-400 video dataset. Our sparse tube ViT outperforms state-of-the-art strategies.

Moreover, we check our sparse tube ViT mannequin on the One thing-One thing V2 dataset, which is usually used to judge extra dynamic actions, and in addition report that it outperforms all prior state-of-the-art approaches.

Efficiency on the One thing-One thing V2 video dataset.

Visualizing some discovered kernels

It’s fascinating to grasp what sort of rudimentary options are being discovered by the proposed mannequin. We visualize them under, displaying each the 2D patches, that are shared for each photos and movies, and video tubes. These visualizations present the 2D or 3D info being captured by the projection layer. For instance, within the 2D patches, varied frequent options, like edges and colours, are detected, whereas the 3D tubes seize fundamental shapes and the way they could change over time.

Visualizations of patches and tubes discovered the sparse tube ViT mannequin. Prime row are the 2D patches and the remaining two rows are snapshots from the discovered video tubes. The tubes present every patch for the 8 or 4 frames to which they’re utilized.


We’ve got offered a brand new sparse tube ViT, which might flip a ViT encoder into an environment friendly video mannequin, and might seamlessly work with each picture and video inputs. We additionally confirmed that enormous video encoders will be bootstrapped from small video encoders and image-only ViTs. Our strategy outperforms prior strategies throughout a number of standard video understanding benchmarks. We imagine that this straightforward illustration can facilitate way more environment friendly studying with enter movies, seamlessly incorporate both picture or video inputs and successfully eradicate the bifurcation of picture and video fashions for future multimodal understanding.


This work is performed by AJ Piergiovanni, Weicheng Kuo and Anelia Angelova, who at the moment are at Google DeepMind. We thank Abhijit Ogale, Luowei Zhou, Claire Cui and our colleagues in Google Analysis for his or her useful discussions, feedback, and assist.

Leave a Reply

Your email address will not be published. Required fields are marked *