Research

Research Agenda (2026–2030)

Build clinically reliable AI systems that integrate imaging, EHR, and emerging biomedical modalities for decision support and translational impact.
Develop controllable multimodal generative models as compositional tools for simulation, synthesis, and hypothesis-driven data generation.
Advance multimodal perception and understanding in dynamic environments, with emphasis on temporal robustness and scalable deployment.

I view these three tracks as one connected agenda: stronger perception improves controllability, and controllable models accelerate translational biomedical applications.

Translational Biomedical AI

I focus on translational AI systems that bridge methodological advances and deployable clinical value, including medical imaging foundation models, EHR intelligence, and scalable biomedical data understanding (including emerging modalities such as single-cell data).

A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies

Nature Partner Journal Digital Medicine • 2026

Kimberly F. Greco*, Zongxin Yang*, Mengyan Li, Han Tong, Sara Morini Sweet, Alon Geva, Kenneth D. Mandl, Benjamin A. Raby, Tianxi Cai

Translational Biomedical AI Co-first Author

Paper

Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction

ICML 2026 • 2026

Jiafa Ruan, Ruijie Quan, Zongxin Yang✉, Liyang Xu, Yi Yang

Translational Biomedical AI Corresponding Author

Paper

CLINES: Clinical LLM-based Information Extraction and Structuring Agent

Preprint • 2025

Zongxin Yang*, Hongyi Yuan*, Raheel Sayeed*, Amelia Li Min Tan, Enci Cai, Mohammed Moro, Xiudi Li, Huaiyuan Ying, Nicholas Brown, Griffin Weber, and others

Translational Biomedical AI Co-first Author

Paper

MedSAM2: Segment Anything in 3D Medical Images and Videos

Preprint • 2025

Jun Ma*, Zongxin Yang*, Sumin Kim, Bihui Chen, Mohammed Baharoon, Adibvafa Fallahpour, Reza Asakereh, Hongwei Lyu, Bo Wang

Translational Biomedical AI Co-first Author

Paper Code

Controllable Multimodal Generation

I develop controllable multimodal generation methods for image/video/3D creation, with emphasis on compositionality, attribute-level control, and robust behavior under realistic user constraints.

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

arXiv 2026 • 2026

Dewei Zhou, You Li, Zongxin Yang, Yi Yang

Controllable Multimodal Generation

Paper Code Project

Are Image-to-Video Models Good Zero-Shot Image Editors?

CVPR 2026 • 2026

Zechuan Zhang, Zhenyuan Chen, Zongxin Yang, Yi Yang

Controllable Multimodal Generation

Paper

BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

ICLR 2026 • 2026

Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, Yi Yang

Controllable Multimodal Generation

Paper

Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

ICLR 2026 • 2026

Ruisi Zhao, Haoren Zheng, Zongxin Yang, Hehe Fan, Yi Yang

Controllable Multimodal Generation

Paper

Insert Anything: Image Insertion via In-Context Editing in DiT

AAAI 2026 (Oral) • 2026

Wensong Song, Hong Jiang, Zongxin Yang, Ruijie Quan, Yi Yang

Controllable Multimodal Generation Oral

Paper Code Project

Multimodal Perception and Understanding

I study scene understanding in dynamic environments through segmentation, tracking, and multimodal reasoning, aiming to improve robustness, temporal consistency, and transferability.

Efficient training of large vision models via advanced automated progressive learning

TPAMI 2026 • 2026

Changlin Li, Jiawei Zhang, Sihao Lin, Zongxin Yang, Junwei Liang, Xiaodan Liang, Xiaojun Chang

Multimodal Perception and Understanding

Paper Code

SELongVLM: Empowering Long Video Language Models with Self-Corrective Clip Selection

TPAMI 2026 • 2026

Kecheng Zhang, Zongxin Yang, Mingfei Han, Yunzhi Zhuge, Haihong Hao, Changlin Li, Zhihui Li, Xiaojun Chang

Multimodal Perception and Understanding

Paper

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

ICLR 2026 • 2026

Kecheng Zhang, Zongxin Yang, Mingfei Han, Haihong Hao, Yunzhi Zhuge, Changlin Li, Junhan Zhao, Zhihui Li, Xiaojun Chang

Multimodal Perception and Understanding

Paper

Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge

ICLR 2025 • 2025

Haomiao Xiong*, Zongxin Yang*, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, Huchuan Lu

Multimodal Perception and Understanding Co-first Author

Paper Code

DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)

ICML 2024 • 2024

Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, Yi Yang

Multimodal Perception and Understanding First Author

Paper Code Project

Collaboration

I welcome collaborations on clinically grounded AI, controllable multimodal generation, and robust scene understanding. If your interests align, feel free to reach out by email.

View all publications by track →

Dr. Zongxin Yang

Research

Research Agenda (2026–2030)

Translational Biomedical AI

A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies

Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction

CLINES: Clinical LLM-based Information Extraction and Structuring Agent

MedSAM2: Segment Anything in 3D Medical Images and Videos

Controllable Multimodal Generation

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

Are Image-to-Video Models Good Zero-Shot Image Editors?

BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

Insert Anything: Image Insertion via In-Context Editing in DiT

Multimodal Perception and Understanding

Efficient training of large vision models via advanced automated progressive learning

SELongVLM: Empowering Long Video Language Models with Self-Corrective Clip Selection

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge

DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)

Collaboration