Thoughts on Multimodality and an Image-Text-to-Text Direction for EXAONE

#6
by lesj0610 - opened

I would like to share some thoughts and questions regarding the current direction of EXAONE, particularly around multimodality and vision encoder integration.

Is the integration of multimodal components—specifically a vision encoder—considered a disadvantage from your perspective? From my viewpoint, image and audio inputs are not optional extensions but a clear part of the future baseline for large models. Even if development starts now, it may already be challenging to catch up with leading models in this space. While full “any-to-any” capability may be ambitious, at minimum, an Image-Text-to-Text direction seems necessary.

At present, I am building a custom Image-Text-to-Text model based on EXAONE-4.0.1-32B. This includes integrating siglip-so400m-patch14-384 as the vision encoder using vLLM and exllamav3 + TabbyAPI, along with LoRA training to improve VLM recognition performance. This work is being carried out on two RTX 3090 GPUs, which is feasible but highly constrained. With even a single H100—or realistically, just one A100—the same work could be performed far more efficiently and at much higher density.

From a technical standpoint, the core components are already in place. The projector alignment phase has been completed, basic image recognition is functional, and the EXAONE-4.0.1-32B–based custom VLM has been successfully quantized and served using vLLM and EXL3. Text inference and image-embedding input pipelines have also been tested and verified.

These results indicate that a practical and usable VLM is achievable even at the individual research level, without access to extreme infrastructure. For this reason, I sincerely hope that an official Image-Text-to-Text VLM release based on EXAONE—designed to be realistically usable by individual researchers and advanced users—can be considered.

Thank you for your time and for your continued work on EXAONE.

LG AI Research org

We also agree with your thoughts very much and will share your thoughts on VLM (Vision-Language Model) with our development team to support multi-modality.

Thank you for your suggestion.

Sign up or log in to comment