We are on a mission to advance and democratize fundamental AI research and technology to serve humanity, with current focuses on Accurate & Efficient Vision, Creative AI, and Responsible AI.
Accurate & Efficient Vision
Pushing the envelope of cutting-edge AI algorithms and systems to the next level
Neighborhood Attention (NA) is the first efficient and scalable sliding-window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity. The sliding-window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance.
Dilated Neighborhood Attention (DiNA) extends NA's sliding-window local attention to sparse global attention at no additional cost. Combinations of NA/DiNA are capable of preserving locality, maintaining translational equivariance, expanding the receptive field exponentially, and capturing longer-range inter-dependencies, leading to significant performance boosts in downstream vision tasks.
OneFormer is the first multi-task universal image segmentation framework. It achieves state-of-the-art performance across semantic, instance and panoptic segmentation tasks with a single task-dynamic model that is jointly trained across tasks. OneFormer reduces underlying resource requirement for segmentation and makes image segmentation more universal and accessible.
Creative AI
Empowering the next generation of creative commmunication
We built Versatile Diffusion (VD), the first unified multi-flow multimodal diffusion framework, as a step towards Universal Generative AI. Versatile Diffusion can natively support image-to-text, image-variation, text-to-image, and text-variation, and can be further extended to other applications such as semantic-style disentanglement, image-text dual-guided generation, latent image-to-text-to-image editing, and more. Future versions will support more modalities such as speech, music, video and 3D.
StyleNAT is a flexible and efficient state-of-the-art image generation framework. It is Style-based GAN that exploits Neighborhood Attention to extend the power of localized attention heads to capture long range features and maximize information gain within the generative process. The flexibility of the the system allows it to be adapted to various environments and datasets. StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating prior arts with convolutional models such as StyleGAN-XL and transformers such as HIT and StyleSwin. These results show a 6.4% improvement on FFHQ-256 scores when compared to StyleGAN-XL with a 28% reduction in the number of parameters and 56% improvement in sampling throughput.
Layer-wise Image Vectorization (LIVE) is a new state-of-the-art image vectorization method to progressively generate a SVG that fits the raster image in a layer-wise fashion. Given an arbitrary input image, LIVE recursively learns the visual concepts by adding new optimizable closed bezier paths and optimizing all these paths
Responsible AI
Solving important real-world problems at scale
Agriculture-Vision is a first major agriculture effort in the CVPR community. The CVPR 2020 paper introduced a first large-scale high-quality aerial image dataset for agriculture pattern analysis, covering over 1 million acres of farmlands in the US, curated over a period of 2 years. The resulting algorithms and its improved versions are being used in production to help farmers with actional insights to monitor crops and improve yield, contributing to addressing the global food security issue. Together with these efforts, We have hosted 3 international Agriculture-Vision workshop at CVPR since 2020, with prize challenges, academic-industrial panels, workshop paper programs that attracted wide participation,
The standard petrography test method for measuring air voids in concrete (ASTM C457) requires a meticulous and long examination of sample phase composition under a stereomicroscope. The high expertise and specialized equipment discourage this test for routine concrete quality control. Though the task can be alleviated with the aid of color-based image segmentation, additional surface color treatment is required. In this work, we investigated the feasibility of using CNN to conduct concrete segmentation without the use of color treatment. The CNN demonstrated a strong potential to process a wide range of concretes, including those not involved in model training.
Cardiac motion estimation plays a key role in MRI cardiac feature tracking and function assessment such as myocardium strain. Our Motion Pyramid Networks is a novel deep learning-based approach for accurate and efficient cardiac motion estimation. New evaluation metrics are also proposed to represent errors in a clinically meaningful manner. Our Fast Online Adaptive Learning (FOAL) framework is an online gradient descent based optimizer that is optimized by a meta-learner. The meta-learner enables the online optimizer to perform a fast and robust adaptation, preventing ramatic performance drops due to mismatched distributions between training and testing dataset.