What is a Multimodal LLM?

Introduction

Multimodal is the ability for a model to process several different types of inputs. For example, ChatGPT can support text input as well as image input as one combined input. You could also have a model take in .pdf, .md, and input text as inputs (yes, these count as different modalities). 

This allows for users to give represent different ideas in its rawest form possible, such as an image, instead of describing it. If you try to describe it, you could hit hurdles in trying to properly explain it and making sure it interpreted the description correctly. 

How do Multimodal LLMs work? 

Typically, these modalities are "fused" together. Here's an architecture diagram, 
Image from NVIDIA

In this case, we have an encoder for a specific modality (e.g. video encoder, image encoder, audio encoder). This encodes them into embeddings the LLM can understand and use. After encoding each specific modality, we then fuse them together. Then, the output is projected and then uses the needed modality generator to generate the specific modality needed (text-based, image-based, etc.). These could point to a Diffusion/Image generation model, a text-decoder model, etc.

What is "Fusing"?

Fusing is a term for how we add things together. For example, you can map all the features from the different modalities into one shared latent space that the LLM understands. For example, we could have pre-trained encoders (ResNet-50), and map all of them together through a Linear Layer, Attention Layer, etc. to one embedding space. This space would not only encapsulate all the data from the encoders, but it would also be trained to embed vectors in a way LLM would understand (Huang et. al. 2024). 

Use Cases

1. Image-Text Questions: Some abstract ideas can only be shown through images. Sometimes, these abstract ideas don't make sense (does this person look mad? confused?), and can't be expressed by words. In this scenario, we can use a Large Multimodal Model. 

2. Radiology Imaging: Radiologists can input in an X-ray, MRI, text-query, and more. The model can then be prompted with specific questions and conduct specific tasks to look over the reports, saving radiologists time (Haq et. al. 2025). 

Comments

Post a Comment

Popular posts from this blog

Top 3 Breakthroughs in Computer Vision in 2024

A Mathematical Explanation of Gradient Descent