On-device acceleration of enormous diffusion fashions through GPU-aware optimizations – Google AI Weblog

The proliferation of enormous diffusion fashions for picture technology has led to a major enhance in mannequin dimension and inference workloads. On-device ML inference in cellular environments requires meticulous efficiency optimization and consideration of trade-offs because of useful resource constraints. Operating inference of enormous diffusion fashions (LDMs) on-device, pushed by the necessity for price effectivity and person privateness, presents even larger challenges because of the substantial reminiscence necessities and computational calls for of those fashions.

We handle this problem in our work titled “Pace Is All You Want: On-Gadget Acceleration of Giant Diffusion Fashions through GPU-Conscious Optimizations” (to be offered on the CVPR 2023 workshop for Environment friendly Deep Studying for Laptop Imaginative and prescient) specializing in the optimized execution of a foundational LDM mannequin on a cellular GPU. On this weblog publish, we summarize the core strategies we employed to efficiently execute giant diffusion fashions like Steady Diffusion at full decision (512×512 pixels) and 20 iterations on trendy smartphones with high-performing inference velocity of the unique mannequin with out distillation of beneath 12 seconds. As mentioned in our earlier weblog publish, GPU-accelerated ML inference is usually restricted by reminiscence efficiency, and execution of LDMs is not any exception. Due to this fact, the central theme of our optimization is environment friendly reminiscence enter/output (I/O) even when it means selecting memory-efficient algorithms over those who prioritize arithmetic logic unit effectivity. Finally, our main goal is to cut back the general latency of the ML inference.

A pattern output of an LDM on Cellular GPU with the immediate textual content: “a photograph real looking and excessive decision picture of a cute pet with surrounding flowers”.

Enhanced consideration module for reminiscence effectivity

An ML inference engine sometimes supplies a wide range of optimized ML operations. Regardless of this, attaining optimum efficiency can nonetheless be difficult as there’s a specific amount of overhead for executing particular person neural internet operators on a GPU. To mitigate this overhead, ML inference engines incorporate in depth operator fusion guidelines that consolidate a number of operators right into a single operator, thereby lowering the variety of iterations throughout tensor components whereas maximizing compute per iteration. As an example, TensorFlow Lite makes use of operator fusion to mix computationally costly operations, like convolutions, with subsequent activation features, like rectified linear models, into one.

A transparent alternative for optimization is the closely used consideration block adopted within the denoiser mannequin within the LDM. The eye blocks permit the mannequin to deal with particular components of the enter by assigning increased weights to vital areas. There are a number of methods one can optimize the eye modules, and we selectively make use of one of many two optimizations defined beneath relying on which optimization performs higher.

The primary optimization, which we name partially fused softmax, removes the necessity for in depth reminiscence writes and reads between the softmax and the matrix multiplication within the consideration module. Let the eye block be only a easy matrix multiplication of the shape Y = softmax(X) * W the place X and W are 2D matrices of form a×b and b×c, respectively (proven beneath within the prime half).

For numerical stability, T = softmax(X) is often calculated in three passes:

  1. Decide the utmost worth within the listing, i.e., for every row in matrix X
  2. Sum up the variations of the exponential of every listing merchandise and the utmost worth (from go 1)
  3. Divide the exponential of the gadgets minus the utmost worth by the sum from go 2

Finishing up these passes naïvely would lead to an enormous reminiscence write for the non permanent intermediate tensor T holding the output of the complete softmax operate. We bypass this massive reminiscence write if we solely retailer the outcomes of passes 1 and a pair of, labeled m and s, respectively, that are small vectors, with a components every, in comparison with T which has a·b components. With this system, we’re in a position to cut back tens and even a whole bunch of megabytes of reminiscence consumption by a number of orders of magnitude (proven beneath within the backside half).

Consideration modules. High: A naïve consideration block, composed of a SOFTMAX (with all three passes) and a MATMUL, requires a big reminiscence write for the massive intermediate tensor T. Backside: Our memory-efficient consideration block with partially fused softmax in MATMUL solely must retailer two small intermediate tensors for m and s.

The opposite optimization entails using FlashAttention, which is an I/O-aware, actual consideration algorithm. This algorithm reduces the variety of GPU high-bandwidth reminiscence accesses, making it a very good match for our reminiscence bandwidth–restricted use case. Nevertheless, we discovered this system to solely work for SRAM with sure sizes and to require numerous registers. Due to this fact, we solely leverage this system for consideration matrices with a sure dimension on a choose set of GPUs.

Winograd quick convolution for 3×3 convolution layers

The spine of frequent LDMs closely depends on 3×3 convolution layers (convolutions with filter dimension 3×3), comprising over 90% of the layers within the decoder. Regardless of elevated reminiscence consumption and numerical errors, we discovered that Winograd quick convolution to be efficient at dashing up the convolutions. Distinct from the filter dimension 3×3 utilized in convolutions, tile dimension refers back to the dimension of a sub area of the enter tensor that’s processed at a time. Growing the tile dimension enhances the effectivity of the convolution when it comes to arithmetic logic unit (ALU) utilization. Nevertheless, this enchancment comes on the expense of elevated reminiscence consumption. Our exams point out {that a} tile dimension of 4×4 achieves the optimum trade-off between computational effectivity and reminiscence utilization.

    Reminiscence utilization    
    Tile dimension         FLOPS financial savings         Intermediate tensors         Weights    
2×2 2.25× 4.00× 1.77×
4×4 4.00× 2.25× 4.00×
6×6 5.06× 1.80× 7.12×
8×8 5.76× 1.56× 11.1×

Impression of Winograd with various tile sizes for 3×3 convolutions.

Specialised operator fusion for reminiscence effectivity

We found that performantly inferring LDMs on a cellular GPU requires considerably bigger fusion home windows for generally employed layers and models in LDMs than present off-the-shelf on-device GPU-accelerated ML inference engines present. Consequently, we developed specialised implementations that might execute a bigger vary of neural operators than typical fusion guidelines would allow. Particularly, we targeted on two specializations: the Gaussian Error Linear Unit (GELU) and the group normalization layer.

An approximation of GELU with the hyperbolic tangent operate requires writing to and studying from seven auxiliary intermediate tensors (proven beneath as gentle orange rounded rectangles within the determine beneath), studying from the enter tensor x thrice, and writing to the output tensor y as soon as throughout eight GPU packages implementing the labeled operation every (gentle blue rectangles). A customized GELU implementation that performs the eight operations in a single shader (proven beneath within the backside) can bypass all of the reminiscence I/O for the intermediate tensors.

GELU implementations. High: A naïve implementation with built-in operations would require 8 reminiscence writes and 10 reads. Backside: Our customized GELU solely requires 1 reminiscence learn (for x) and 1 write (for y).


After making use of all of those optimizations, we carried out exams of Steady Diffusion 1.5 (picture decision 512×512, 20 iterations) on high-end cellular units. Operating Steady Diffusion with our GPU-accelerated ML inference mannequin makes use of 2,093MB for the weights and 84MB for the intermediate tensors. With newest high-end smartphones, Steady Diffusion could be run in beneath 12 seconds.

Steady Diffusion runs on trendy smartphones in beneath 12 seconds. Be aware that operating the decoder after every iteration for displaying the intermediate output on this animated GIF leads to a ~2× slowdown.


Acting on-device ML inference of enormous fashions has confirmed to be a considerable problem, encompassing limitations in mannequin file dimension, in depth runtime reminiscence necessities, and protracted inference latency. By recognizing reminiscence bandwidth utilization as the first bottleneck, we directed our efforts in direction of optimizing reminiscence bandwidth utilization and placing a fragile steadiness between ALU effectivity and reminiscence effectivity. Because of this, we achieved state-of-the-art inference latency for giant diffusion fashions. You possibly can study extra about this work in the paper.


We would prefer to thank Yu-Hui Chen, Jiuqiang Tang, Frank Barchard, Yang Zhao, Joe Zou, Khanh LeViet, Chuo-Ling Chang, Andrei Kulik, Lu Wang, and Matthias Grundmann.

Leave a Reply

Your email address will not be published. Required fields are marked *