Standford artificial intelligence (AI) researchers propose S4ND, a new deep layer based on S4 that extends the ability of SSMs to simulate continuous signals to multidimensional data such as photos and videos

Visual modeling of data, such as photographs and videos, is a canonical problem in deep learning. Many current deep learning backbones with good performance on benchmarks like ImageNet have been suggested in recent years. These backbones are diverse and include 1D sequence models like the Vision Transformer (ViT), which handles images as patches, and 2D and 3D models that use local convolutions on images and videos (ConvNets). Dimensions of Space and Time They would like techniques to recognize the difference between data and signal and directly simulate the underlying continuous signals. This would allow them to modify the model into data collected at varying resolutions.

Deep state space models (SSM), namely S4, have achieved SotA results in modeling sequence data produced from continuous signals like audio. Parameterizing and learning continuous convolutional kernels, which can then be sampled differently for data at different resolutions, is a logical way to develop such models. However, a significant drawback of SSMs is that they were designed for 1D signs and cannot be used directly for visual data obtained from multidimensional “ND” signals. Given that 1D SSMs outperform alternative continuous modeling methods for sequence data and have had preliminary success with image and video classification, they predict that they may be well suited for modeling visual data when they are well adapted to multidimensional signals.

S4ND, a new deep learning layer that extends S4 to multidimensional signals, is their crucial contribution. The central concept is to convert a typical SSM (a 1D ODE) into a multifaceted PDE regulated by an individual SSM for each dimension. They demonstrate that adding structure to this ND SSM is comparable to a continuous ND convolution that can be factorized into a dimensionally distinct 1D SSM convolution. As a result, the model is efficient and simple to build, with the usual 1D S4 layer serving as a black box. Additionally, it can be parameterized by S4, allowing it to describe long-range dependencies and finite windows with a learnable window size that generalizes typical local convolutions.

They demonstrate that S4ND can be used as a direct replacement in current high performance vision systems, matching or improving performance in 1D, 2D and 3D. With a slight modification of the training technique, replacing the self-attention of ViT with S4-1D increases top 1 accuracy by 1.5%, while replacing the convolution layers in a 2D backbone ConvNeXt with S4-2D retains the performance of ImageNet-1k. Simply (temporarily) extending the pre-trained S4-2D-ConvNeXt backbone to 3D improves HMDB-51 video activity categorization scores by 4 points over the pre-trained ConvNeXt baseline. Notably, they use S4ND as global kernels that cover the full input form, allowing it to have global context (both geographically and temporally) at every network layer.

They also suggest a low-pass band-limiting adjustment to S4 that improves smoothness in learned convolutional kernels. Although S4ND can be used at any resolution, performance degrades when switching between resolutions due to aliasing artifacts in the kernel. This problem has also been highlighted in previous work on continuous models. While S4 can transfer audio data between resolutions, visual data poses a more complex problem due to the scale-invariant characteristics of images in space and time, as sampled images with more distant objects are more likely to include power at frequencies above the Nyquist cutoff frequency.

S4ND’s continuous signal modeling capabilities enable the development of new training recipes, such as the ability to train and test at different resolutions. Motivated by this, they suggest a simple criterion for masking frequencies above the Nyquist cutoff frequency in the S4ND core. S4ND degrades as little as 1.3% when oversampling low to high resolution data (e.g. 128×128 -> 160×160) on standard CIFAR-10 and Celeb-A datasets and can be used to facilitate gradual scaling to speed up training by 22% with a 1% drop in final accuracy compared to high resolution training only. They also demonstrate that their unique band-limiting approach is vital to these capabilities, with ablations showing an absolute decrease in performance of up to 20%+ in the absence of it.


Please Don't Forget To Join Our ML Subreddit

Aneesh Tickoo is an intern consultant at MarktechPost. He is currently pursuing his undergraduate studies in Data Science and Artificial Intelligence at Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He enjoys connecting with people and collaborating on interesting projects.

James G. Williams