Table of Contents
In a recent paper titled The Evolution of Multimodal Model Architectures5, the authors present a taxonomy of multimodal vision-language models, categorized by the fusion stage (early or deep) and the fusion methods, including standard cross-attention layers, custom cross-attention layers, specialized tokenizers, or modality-specific encoders. Below, I provide a brief overview of the taxonomy groups, developed by the authors5, and examples of models within each category.
Deep Fusion
Type-A (multimodal inputs are directed to the internal LLM layers using cross-attention)
In this architecture, the multimodal inputs (image/video/audio) are passed through a multimodal encoder, resampled to a fixed length (using a resampler), and then passed to the internal layers of the LLM using cross-attention.
The fusion (via cross-attention) can be done before or after the self-attention layer in the LLM. This results in two possible sub-architectures, one that does the cross-attention fusion before the self-attention layer and one that does it after the self-attention layer.
For one such model, see Dolphins below.
Type-B (multimodal input are directed to the internal LLM layers using custom cross-attention layers)
The authors distinguish models that pass the multimodal input to the internal LLM layers via a custom cross-attention layer in Type-B architectures. They observe that deep fusion typically occurs via add/concatenation operations after the self-attention layers in the LLM.
For one such model, see CogVLM below.
Early Fusion
Type-C (multimodal inputs are optionally embedded or passed as-is to the LLM)
- Use pre-trained LLM as decoder
- Input: Encoder output + text
- Encoder can also be pre-trained
- Incorporate off-the-shelf LLMs and encoders
- Training & data:
- Pre-train + alignment tuning: train projection layers (MLP etc) for vision+text alignment
- Instruction + alignment tuning: train projection layer + LLM
- For one such model, see Qwen-VL below.
Type-D (multimodal inputs are tokenized before passed to the LLM)
- One tokenizer for all modalities
- Disadvantages:
- The addition of a new modality requires a re-training of the tokenizer (to learn how to tokenize the new modality).
- Training was observed to take longer than the other methods because the LLM (or encoder-decoder model) only considers the modality at the input stage (and receives no input guidance at intermediate deeper layers).
- For one such model, see CM3Leon below.
Qwen-VL (2023)
Paper: https://arxiv.org/pdf/2308.12966
Model architecture
- Visual Encoder (e.g., a ViT)
- Position-aware Vision-Language Adapter
- A cross-attention layer with:
- Inputs:
- visual embedding sequence from the Visual Encoder, as keys
- trainable vector embeddings, as queries
- Outputs:
- a compressed fixed-length visual embedding sequence (e.g., 256)
- Inputs:
- A cross-attention layer with:
- Large Language-Model (e.g., Qwen-7B)
- Inputs:
- compressed visual embedding sequence (surrounded by two special img tokens to distinguish it from the the text input) + text input sequence
- Outputs:
- Predicted next text token
- Inputs:
Training
Training is done in three phases:
- Pre-training on low-resolution image and text pairs:
- LLM is frozen. Only adapter and visual encoder are trained to minimize cross-entropy on LLM output text.
- Multi-task pre-training on high-res image and text pairs, and interleaved image-text data:
- LLM, adapter, encoder are all trained.
- Fine-tuning on interleaved image-text data:
- Encoder is frozen, only LLM and adapter are trained.
CM3Leon (2024)
Paper: https://arxiv.org/pdf/2405.09818
CM3Leon (pronounced as “chameleon”) is a multimodal early fusion model pre-trained on a large dataset including pure text, text-image pairs, and interleaved text-image documents. It’s pre-trained in two stages. In the first stage, which accounts for most of the training, the model is trained on ~2.9T text-only tokens, ~1.5T text-image tokens, and ~400B interleaved text-image tokens. The second stage contains a similar amount of data, but it’s much smaller and of higher quality.
Tokenization: At the core of CM3Leon’s architecture is a tokenizer module that can quantize both images and text into discrete tokens before applying the same transformer-based module to those tokens. Images (of size 512x512) are tokenized into a 1024 token representation (from a vocabulary of 8192 tokens). Then the image token representation and the text are tokenized using a BPE tokenizer trained on a vocabulary of ~65k tokens (including the 8192 image tokens).
Architecture: The authors propose new architectural changes to stabilize training. In particular, the normalization strategies in the attention blocks are QK-Norm (along with dropout after the attention and MLP blocks) for the 7B-parameter model, and in the case of the larger 34B-parameter model, the normalization from the Swin-transformer is used in the attention blocks. Also, to stabilize the final softmax over the logits, the authors use the z-loss regularisation softmax (see Sec. 3.1.2 in 2309.14322 v26).
CogVLM (2023)
Paper: https://arxiv.org/pdf/2311.03079
Dolphins (2023)
Paper: https://arxiv.org/pdf/2312.00438
References & Footnotes
-
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., … & Zhou, J. (2023). Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 1(2), 3. ↩︎
-
Team, C. Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.09818. ↩︎
-
Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., … & Tang, J. (2023). Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079. ↩︎
-
Ma, Y., Cao, Y., Sun, J., Pavone, M., & Xiao, C. (2023). Dolphins: Multimodal language model for driving. arXiv preprint arXiv:2312.00438. ↩︎
-
Wadekar, S. N., Chaurasia, A., Chadha, A., & Culurciello, E. (2024). The Evolution of Multimodal Model Architectures. arXiv preprint arXiv:2405.17927. ↩︎ ↩︎
-
Wortsman, M., Liu, P. J., Xiao, L., Everett, K., Alemi, A., Adlam, B., … & Kornblith, S. (2023). Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322. ↩︎