AI researchers at Rutgers University propose a slot-based auto-encoder architecture called SLot Attention TransformEr (SLATE)


DALL E showed an impressive ability for composition-based systematic generalization in image generation, but it requires the dataset of text-image pairs and provides compositional cues from these texts. While the Slot Attention model can learn composable representations without text prompts, it does not have unlimited generative capability like DALL·E for zero-hit generation.

Researchers at Rutgers University propose a slot-based automatic encoder architecture called SLot Attention TransformEr (SLATE). The SLATE model is a combination of the best of DALL·E and object-centric representation learning. Unlike previous models, it significantly improves composition-based systematic generalization in image generation.

The research team showed that they could create an illiterate DALL·E (I-DALLE) model, which can learn object representation from images alone instead of taking text prompts. To do so, the team begins by analyzing that existing pixel mixing decoders for training object-centric representations suffer from limitations such as the slot decoding dilemma and independent pixels. Based on DALL·E, the research team then hypothesized that solving these limitations required not only a composable slot but also an expressive decoder.

The SLATE architecture was proposed based on the research results above. SLATE is a simple yet innovative automatic slot encoding architecture that uses a slot-conditioned GPT Image decoder. Apart from this, the research group also came up with a method to build visual concepts from learned niches. This is similar to text prompts in DALL·E and allows them to program image sentences made up of these types of slots.


The main contributions of the research work include:

  • Realize the first DALL·E model without text.
  • The first object-centered representation learning on transformer.
  • They explained that the proposed model significantly improves the systematic generalizability of object-centered representation models.
  • In terms of performance, the proposed model is much simpler than previous approaches.




James G. Williams