Papers
arxiv:2601.11522

UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

Published on Jan 16
· Submitted by
ruihengzhang
on Jan 21
Authors:
,
,
,
,
,
,
,
,
,

Abstract

UniX presents a unified medical foundation model that decouples visual understanding and generation tasks using distinct autoregressive and diffusion branches with cross-modal attention for enhanced performance.

AI-generated summary

Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.

Community

Paper author Paper submitter

We introduce UniX, a unified foundation model for Chest X-Ray that combines Autoregression (for understanding) and Diffusion (for generation) within a decoupled dual-branch architecture! 🏥✨

Why UniX? Current unified models often face a conflict between semantic abstraction and pixel-level reconstruction. UniX solves this via structural decoupling and Cross-Modal Self-Attention.

🔥 Key Results: Compared to previous works (like LLM-CXR), UniX achieves:
📈 +46.1% improvement in Understanding.
🎨 +24.2% improvement in Generation Quality.
⚡ Only 25% of the parameters!

Resources:
Code: https://github.com/ZrH42/UniX
Weights: https://huggingface.co/ZrH42/UniX
Paper: arXiv:2601.11522

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.11522 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.11522 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.