What is a Multimodal LLM?
Introduction
Multimodal is the ability for a model to process several different types of inputs. For example, ChatGPT can support text input as well as image input as one combined input. You could also have a model take in .pdf, .md, and input text as inputs (yes, these count as different modalities).
This allows for users to give represent different ideas in its rawest form possible, such as an image, instead of describing it. If you try to describe it, you could hit hurdles in trying to properly explain it and making sure it interpreted the description correctly.
How do Multimodal LLMs work?
Typically, these modalities are "fused" together. Here's an architecture diagram,
Image from NVIDIA
What is "Fusing"?
Fusing is a term for how we add things together. For example, you can map all the features from the different modalities into one shared latent space that the LLM understands. For example, we could have pre-trained encoders (ResNet-50), and map all of them together through a Linear Layer, Attention Layer, etc. to one embedding space. This space would not only encapsulate all the data from the encoders, but it would also be trained to embed vectors in a way LLM would understand (Huang et. al. 2024).
Use Cases
1. Image-Text Questions: Some abstract ideas can only be shown through images. Sometimes, these abstract ideas don't make sense (does this person look mad? confused?), and can't be expressed by words. In this scenario, we can use a Large Multimodal Model.
2. Radiology Imaging: Radiologists can input in an X-ray, MRI, text-query, and more. The model can then be prompted with specific questions and conduct specific tasks to look over the reports, saving radiologists time (Haq et. al. 2025).
Nice!
ReplyDelete