What is a Multimodal LLM?
Introduction Multimodal is the ability for a model to process several different types of inputs. For example, ChatGPT can support text input as well as image input as one combined input. You could also have a model take in .pdf, .md, and input text as inputs (yes, these count as different modalities). This allows for users to give represent different ideas in its rawest form possible, such as an image, instead of describing it. If you try to describe it, you could hit hurdles in trying to properly explain it and making sure it interpreted the description correctly. How do Multimodal LLMs work? Typically, these modalities are "fused" together. Here's an architecture diagram, Image from NVIDIA In this case, we have an encoder for a specific modality (e.g. video encoder, image encoder, audio encoder). This encodes them into embeddings the LLM can understand and use. After encoding each specific modality, we then fuse them together. Then, the output is projected and ...