Researchers accelerate sparse inference on XNNPack and TensorFlow Lite for realtime apps

Header image showing Google's logo on a purple background

As proved by the Universal Approximation Theorem, neural networks can learn any arbitrary function. This allows us to capture hidden patterns in data to make more accurate and robust models for a wide variety of tasks. However, a big caveat in this process is that neural networks tend to grow quickly as the parameters (or complexity) of the task at hand increases. Naturally, these large neural networks require substantial computational power.

To this end, researchers have been working towards optimizing large neural networks so that they can be run on smaller devices like smartphones and less powerful computers. Inference frameworks like TensorFlow Lite with XNNPACK ML acceleration library, specialize in this task. They optimize machine learning models to run on a variety of devices by finding a sweet spot between model size, inference speed, and accuracy. Building on this, today, Google released new features for the XNNPACK acceleration library and TensorFlow Lite that enable efficient inference of sparse networks.

One of the most noteworthy updates is that XNNPACK can now detect if a model is sparse. Sparse networks, as the name suggests, are neural networks that have a selection of weights and biases set to zero. Not only does this reduce the space complexity of the model, but it also reduces the matrix multiplication and addition operations during forward and backward propagation, making the entire process faster while minimizing the hit on accuracy. As a result, sparse networks have come to the forefront as a promising model architecture for machine learning across a range of devices.

After the latest update, if XNNPACK detects sparsity in a network, it switches to sparse inference mode that accelerates the implementation of the 1x1 convolution kernel during inference. Using parallel processing and multiple threads, multiple pixels can be processed in tandem, directly leading to speedups of around 1.8x to 2.3x when at least 80% of the weights in the networks are zero, Google wrote. Pictorially, this is represented in the GIFs below.

Concretely, researchers at Google demonstrated that it is possible to sparsify classification tasks like background blur and gesture detection. For example, in the case of Google Meet, sparsification lowered the inference time of the model by 30%, which provided access to higher quality models for more users.

We were able to speed up the [Meet background blur] model by 30% by applying a 70% sparsification, while preserving the quality of the foreground mask. We examined the predictions of the sparse and dense models on images from 17 geographic subregions, finding no significant difference, and released the details in the associated model card.

[...] Compared with the dense model the [MediaPipe Hands] sparse model improved the inference by a factor of two, achieving the identical landmark quality as the distilled model. In a sense, sparsification can be thought of as an automatic approach to unstructured model distillation, which can improve model performance without extensive manual effort.

Speedups of this nature improve the real-time performance of applications on portable devices that lack substantial computational power. Moving forward, researchers will continue to extend XNNPACK and TensorFlow Lite with wider support for further operations and optimizations. In the meantime, if you are interested in further details, you can check out Google's recommendations on sparsification in this blog post. Or if you fancy a little exploration, be sure to check out the dense and sparse solutions of MediaPipe Hands.