Skip to content

NATTEN is an open-source project dedicated to providing infrastructure for Neighborhood Attention (NA), a sliding window self-attention mechanism, and its extensions (dilated NA, causal NA, strided NA). Specifically, we provide Fused Multi-Headed Attention (FMHA) and Fused Neighborhood Attention (FNA) training and inference kernels, for all NVIDIA architectures since Maxwell (SM50), as well fast inference kernels for the Hopper (SM90) and Blackwell (SM100) architectures.

Neighborhood Attention introduces locality and sparsity into self attention in a manner similar to convolution. This means for any self attention problem, you will be able to specify a kernel_size, stride, and dilation. Because it's attention, you can also toggle causal masking.

NATTEN is dedicated to multi-dimensional layouts of tokens (i.e. 2-D and 3-D feature maps). Users have the freedom to explore the massive parameter space that NATTEN offers, in which the attention span in any dimension/axis of your input can be controlled with its respective kernel_size, stride, dilation, and is_causal parameters.

2D neighborhood attention with kernel size 6x6.
na2d(
    q, k, v,
    kernel_size=(6, 6)
)
2D dilated neighborhood attention with kernel size 6x6, dilation 2x2.
na2d(
    q, k, v,
    kernel_size=(6, 6),
    dilation=(2, 2))
2D causal neighborhood attention with kernel size 6x6.
na2d(
    q, k, v,
    kernel_size=(6, 6),
    is_causal=(True, True))
2D strided neighborhood attention with kernel size 6x6, stride 2x2.
na2d(
    q, k, v,
    kernel_size=(6, 6),
    stride=(2, 2))

Getting started

NATTEN supports PyTorch >= 2.7, and Python >= 3.9 (everything PyTorch supports). Please refer to install instructions for details on how to install NATTEN.

🔥 Release 0.20.0

NATTEN has undergone major changes since the last release (0.17.5), so we strongly recommend reading our new updated documentation in this webpage before upgrading.

The new release adds our Hopper FNA and Blackwell FNA kernels, bringing you massive speedups on modern data center class NVIDIA GPUs such as the H100 and B200. It also speeds up inference in our existing Ampere FNA kernels up to 1.47X in fully block-sparse cases, provides much cleaner error reporting, ships with our profiling toolkit, and so much more!

License

NATTEN is released under the MIT License.

Citation

If you found NATTEN, or neighborhood attention useful in your work, consider citing the appropriate papers:

Original neighborhood attention paper

First work proposing neighborhood attention, and introducing NATTEN.

@inproceedings{hassani2023neighborhood,
  title        = {Neighborhood Attention Transformer},
  author       = {Ali Hassani and Steven Walton and Jiachen Li and Shen Li and Humphrey Shi},
  year         = 2023,
  booktitle    = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}
}

arXiv GitHub

Dilated neighborhood attention

Introduced dilation for introducing sparse global context.

@article{hassani2022dilated,
    title        = {Dilated Neighborhood Attention Transformer},
    author       = {Ali Hassani and Humphrey Shi},
    year         = 2022,
    journal      = {arXiv preprint arXiv:2209.15001}
}

arXiv GitHub

GEMM-based and fused neighborhood attention

Introduced the first multi-dimensional attention kernels: GEMM-based and fused neighborhood attention (FNA).

Introduced causal neighborhood attention, and extended implementation to support varying parameters across different dimensions.

@inproceedings{hassani2024faster,
  title        = {Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level},
  author       = {Ali Hassani and Wen-Mei Hwu and Humphrey Shi},
  year         = 2024,
  booktitle    = {Advances in Neural Information Processing Systems},
}

arXiv

Generalized neighborhood attention: towards speed-of-light performance

Introduced even-sized windows, strided neighborhood attention, block-sparse forms of neighborhood attention, NATTEN Simulator, and our new Hopper and Blackwell FNA kernels, implemented with out-of-kernel token permutation.

@article{hassani2025generalized,
    title        = {Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light},
    author       = {Hassani, Ali and Zhou, Fengzhe and Kane, Aditya and Huang, Jiannan and Chen, Chieh-Yun and Shi, Min and Walton, Steven and Hoehnerbach, Markus and Thakkar, Vijay and Isaev, Michael and others},
    year         = 2025,
    journal      = {arXiv preprint arXiv:2504.16922}
}

arXiv

Acknowledgements

We thank NVIDIA, and the CUTLASS project, without which this project would not have been possible.

We also thank Meta and the xFormers team for their FMHA kernel, and the PyTorch project and team.