Microsoft envisions efficient 4D tracking using deep learning methods

Some of Microsoft's recently published patents include an automated quick task system for Cortana, an AI device with a fisheye camera, and a story-creation system with auto-generated content. Yesterday, we also reported on a rather out-of-the-box idea coming from the tech giant - a liquid-powered hinge for foldable devices.

Now, Microsoft has been granted a patent for a 4D tracking system that utilizes depth sensing from 3D cameras. It then uses advanced deep learning models to implement 4D dynamic solid modeling systems that help in identifying real-time actions in cluttered environments more accurately.

Microsoft believes the added dimension of time to be a key factor in this whole process. As can be ascertained from the aforementioned description, the motivation behind this idea is improving object- and action-recognition in large-scale, crowded areas - an office space, for example, like the one pictured above.

In a bit more detail, here is one described way in which such an environment can be analyzed: Starting off with the reception of depth data sensed by multiple 3D cameras over time, the volumetric pixel (voxel) occupancy is then determined. This is used to create a 3D solid volume representation, from which a subject is selected. The selected subject can be recognized using advanced subject classifiers, such as those modeled on convolutional neural networks (CNNs).

At this point, tracking of the selected object using depth data begins. As this process is being done over time, this is another way in which the fourth dimension comes into play. Either way, the 3D solid volume representation is used to recognize the action, which is then output, completing the procedure of 4D tracking.

Notably, this can help the system develop a comprehensive understanding of the interaction between various people and objects in the environment. From Microsoft's perspective, it is enabling the computer vision model to mimic the much more nuanced human vision.

*Example 4D dynamic solid modeling system*

Current such systems require deep learning methods that are based on multiple data streams. As such, to overcome the issues of objects partially blocking each other's actions, or cameras' viewing angles being different from their training angles, a wide variation of camera settings, object appearances, and more are required. Some approaches even need special hardware for complex calibration methods and high-quality stereo imaging.

In comparison, Microsoft's proposed system would provide 4D information in real-time without being held back by onerous equipment or taxing processing requirements. In fact, in some cases, the described 4D dynamic solid modeling technique can be undertaken using only one to two percent of a generic GPU's resources. Additionally, with a single GTX1080 TI, 10 people could be tracked with their actions being inferred at 15 frames per second, which is pretty fast in the context of this discussion.

The concept does seem quite interesting, with a variety of use-cases being cited by the Redmond firm. However, it should be kept in mind that this is - like most patents - just an idea for now. As such, there is no guarantee that Microsoft will pursue this at some point.