Researchers have created a strategy that enables artificial intelligence (AI) computers to better map three-dimensional spaces by combining two-dimensional data from many cameras. Because the approach works well with minimal processing resources and has the potential to improve autonomous vehicle navigation performance.
“Most autonomous vehicles use powerful AI programs called vision transformers to take 2D images from multiple cameras and create a representation of the 3D space around the vehicle,” says Tianfu Wu, the paper’s corresponding author and an associate professor of electrical and computer engineering at North Carolina State University. “However, while each of these AI algorithms takes a unique approach, there is still plenty of opportunity for development.
“Our technique, called Multi-View Attentive Contextualization (MvACon), is a plug-and-play supplement that can be used in conjunction with these existing vision transformer AIs to improve their ability to map 3D spaces,” Wu explains. “The vision transformers aren’t getting any additional data from their cameras, they’re just able to make better use of the data.”
MvACon works by altering a method known as Patch-to-Cluster Attention (PaCa), which Wu and his partners published last year. PaCa enables transformer AIs to more efficiently and effectively recognize things in images.
“The key advance here is applying what we demonstrated with PaCa to the challenge of mapping 3D space using multiple cameras,” Wu explains.
To evaluate MvACon’s performance, the researchers combined it with three major vision transformers: BEVFormer, the BEVFormer DFA3D version, and PETR. In each example, the vision transformers collected two-dimensional images from six separate cameras. In all three cases, MvACon considerably increased the performance of the vision transformers.
“Performance was particularly improved when it came to locating objects, as well as the speed and orientation of those objects,” Wu explains. “And the increase in computing demand caused by adding MvACon to the vision transformers was almost minimal.
“Our next steps will be to test MvACon against more benchmark datasets and actual video input from autonomous vehicles. If MvACon continues to outperform existing vision transformers, we believe it will be widely adopted.
The study, “Multi-View Attentive Contextualization for Multi-View 3D Object Detection,” will be presented on June 20 at the IEEE/CVF Conference on Computer Vision and Pattern Recognition in Seattle, Washington. The paper’s first author, Xianpeng Liu, is a recent Ph.D. graduate from NC State. The paper was co-written by Ce Zheng and Chen Chen of the University of Central Florida, Ming Qian and Nan Xue of the Ant Group, and Zhebin Zhang and Chen Li of the OPPO U.S. Research Center.
The work was supported by the National Science Foundation (projects 1909644, 2024688, and 2013451), the United States Army Research Office (grants W911NF1810295 and W911NF2210010), and an Innopeak Technology, Inc. research gift fund.
Reference : New technique improves AI ability to map 3D space with 2D cameras