SHI Labs

Research Highlights

Accurate & Efficient Vision

Pushing the envelope of cutting-edge AI algorithms and systems to the next level

Neighborhood Attention (NA) is the first efficient and scalable sliding-window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity. The sliding-window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance.

Paper Code 🤗 Transformers

Paper Code 🤗 Transformers 🤗 Demo

Creative AI

Empowering the next generation of creative commmunication

Paper Code 🤗 Diffusers 🤗 Demo

We built Versatile Diffusion (VD), the first unified multi-flow multimodal diffusion framework, as a step towards Universal Generative AI. Versatile Diffusion can natively support image-to-text, image-variation, text-to-image, and text-variation, and can be further extended to other applications such as semantic-style disentanglement, image-text dual-guided generation, latent image-to-text-to-image editing, and more. Future versions will support more modalities such as speech, music, video and 3D.

Paper Code

StyleNAT is a flexible and efficient state-of-the-art image generation framework. It is Style-based GAN that exploits Neighborhood Attention to extend the power of localized attention heads to capture long range features and maximize information gain within the generative process. The flexibility of the the system allows it to be adapted to various environments and datasets. StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating prior arts with convolutional models such as StyleGAN-XL and transformers such as HIT and StyleSwin. These results show a 6.4% improvement on FFHQ-256 scores when compared to StyleGAN-XL with a 28% reduction in the number of parameters and 56% improvement in sampling throughput.

Paper Code 🤗 Demo

Responsible AI

Solving important real-world problems at scale

Paper Dataset CVPR Workshops

Agriculture-Vision is a first major agriculture effort in the CVPR community. The CVPR 2020 paper introduced a first large-scale high-quality aerial image dataset for agriculture pattern analysis, covering over 1 million acres of farmlands in the US, curated over a period of 2 years. The resulting algorithms and its improved versions are being used in production to help farmers with actional insights to monitor crops and improve yield, contributing to addressing the global food security issue. Together with these efforts, We have hosted 3 international Agriculture-Vision workshop at CVPR since 2020, with prize challenges, academic-industrial panels, workshop paper programs that attracted wide participation,

Paper @Cement and Concrete Research

The standard petrography test method for measuring air voids in concrete (ASTM C457) requires a meticulous and long examination of sample phase composition under a stereomicroscope. The high expertise and specialized equipment discourage this test for routine concrete quality control. Though the task can be alleviated with the aid of color-based image segmentation, additional surface color treatment is required. In this work, we investigated the feasibility of using CNN to conduct concrete segmentation without the use of color treatment. The CNN demonstrated a strong potential to process a wide range of concretes, including those not involved in model training.

Paper @MICCAI Paper @CVPR

Cardiac motion estimation plays a key role in MRI cardiac feature tracking and function assessment such as myocardium strain. Our Motion Pyramid Networks is a novel deep learning-based approach for accurate and efficient cardiac motion estimation. New evaluation metrics are also proposed to represent errors in a clinically meaningful manner. Our Fast Online Adaptive Learning (FOAL) framework is an online gradient descent based optimizer that is optimized by a meta-learner. The meta-learner enables the online optimizer to perform a fast and robust adaptation, preventing ramatic performance drops due to mismatched distributions between training and testing dataset.