Zongxin Yang

I am a Research Fellow in the Department of Biomedical Informatics (DBMI) at Harvard Medical School, Harvard University, working with Prof. Tianxi Cai. Previously, I was a postdoctoral researcher at CCAI, College of Computer Science and Technology, Zhejiang University (2021–2024), advised by Prof. Yi Yang. My research builds reliable and controllable multimodal learning and generation methods, with growing emphasis on translational biomedical applications.

My work is organized around three connected research directions:

1) Translational Biomedical AI

Building biology-informed and clinically grounded AI systems across medical imaging, EHR intelligence, and translational biomedical settings.

2) Controllable Multimodal Generation

Developing controllable multimodal generation methods for image, video, and 3D content, with emphasis on compositionality, reliability, and practical usability.

3) Multimodal Perception and Understanding

Advancing multimodal perception and understanding for dynamic environments through segmentation, tracking, and reasoning with robust temporal consistency.

Research overview Full publications

Recent Updates

2026-02: One paper accepted to npj Digital Medicine (Nature Partner Journal).
2026-01: Three papers accepted to ICLR 2026.
2025-12: Invited as Area Chair for ECCV 2026.
2025-09: Two papers accepted to NeurIPS 2025, one as Spotlight.
2025-09: Listed in Elsevier standardized citation indicators (single recent year 2024), Top 2% in Artificial Intelligence & Image Processing. (source)

Selected Publications

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

ACM MM 2023 • 2023

Kexin Li, Zongxin Yang✉, Lei Chen, Yi Yang, Jun Xiao

Multimodal Perception and Understanding Corresponding Author Best Paper

Paper Code

A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies

Nature Partner Journal Digital Medicine • 2026

Kimberly F. Greco*, Zongxin Yang*, Mengyan Li, Han Tong, Sara Morini Sweet, Alon Geva, Kenneth D. Mandl, Benjamin A. Raby, Tianxi Cai

Translational Biomedical AI Co-first Author

Paper

Are Image-to-Video Models Good Zero-Shot Image Editors?

CVPR 2026 • 2026

Zechuan Zhang, Zhenyuan Chen, Zongxin Yang, Yi Yang

Controllable Multimodal Generation

Paper

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

ICLR 2026 • 2026

Kecheng Zhang, Zongxin Yang, Mingfei Han, Haihong Hao, Yunzhi Zhuge, Changlin Li, Junhan Zhao, Zhihui Li, Xiaojun Chang

Multimodal Perception and Understanding

Paper

BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

ICLR 2026 • 2026

Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, Yi Yang

Controllable Multimodal Generation

Paper

Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

ICLR 2026 • 2026

Ruisi Zhao, Haoren Zheng, Zongxin Yang, Hehe Fan, Yi Yang

Controllable Multimodal Generation

Paper

Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction

arXiv 2026 • 2026

Jiafa Ruan, Ruijie Quan, Zongxin Yang✉, Liyang Xu, Yi Yang

Translational Biomedical AI Corresponding Author

Paper

Insert Anything: Image Insertion via In-Context Editing in DiT

AAAI 2026 (Oral) • 2026

Wensong Song, Hong Jiang, Zongxin Yang, Ruijie Quan, Yi Yang

Controllable Multimodal Generation Oral

Paper Code Project

CLINES: Clinical LLM-based Information Extraction and Structuring Agent

Preprint • 2025

Zongxin Yang*, Hongyi Yuan*, Raheel Sayeed*, Amelia Li Min Tan, Enci Cai, Mohammed Moro, Xiudi Li, Huaiyuan Ying, Nicholas Brown, Griffin Weber, and others

Translational Biomedical AI Co-first Author

Paper

MedSAM2: Segment Anything in 3D Medical Images and Videos

Preprint • 2025

Jun Ma*, Zongxin Yang*, Sumin Kim, Bihui Chen, Mohammed Baharoon, Adibvafa Fallahpour, Reza Asakereh, Hongwei Lyu, Bo Wang

Translational Biomedical AI Co-first Author

Paper Code

X-Field: A Physically Grounded Representation for 3D X-ray Reconstruction

NeurIPS 2025 (Spotlight) • 2025

Feiran Wang, Jiachen Tao, Junyi Wu, Haoxuan Wang, Bin Duan, Kai Wang, Zongxin Yang, Yan Yan

Translational Biomedical AI Spotlight

Paper Code

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

NeurIPS 2025 • 2025

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, Yi Yang

Controllable Multimodal Generation

Paper Code Project

See full publication list →

Selected Awards

Best Paper Award. ACM MM 2023. [Paper] [News]

1st in the VOTS 2023 challenge. ICCV 2023. [Report]

1st in Semi-Supervised Video Object Segmentation of EPIC-Kitchens Dataset Challenges. CVPR 2023. [Report]

1st in TREK-150 Object Tracking of EPIC-Kitchens Dataset Challenges. CVPR 2023. [Report]

1st in the VOT 2022 real-time segmentation tracking challenge. ECCV 2022. [Report]

1st in the VOT 2022 short-term segmentation tracking challenge. ECCV 2022. [Report]

1st in eBay eProduct Visual Search Challenge. CVPR 2022. [Report]

1st (Track 1) in the 3rd Large-scale Video Object Segmentation Challenge. CVPR 2021. [Report]

1st (Track 3) in the 3rd Large-scale Video Object Segmentation Challenge. CVPR 2021. [Report]

Guo Moruo Scholarship. From USTC, 2018.

Dr. Zongxin Yang

1) Translational Biomedical AI

2) Controllable Multimodal Generation

3) Multimodal Perception and Understanding

Recent Updates

Selected Publications

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies

Are Image-to-Video Models Good Zero-Shot Image Editors?

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction

Insert Anything: Image Insertion via In-Context Editing in DiT

CLINES: Clinical LLM-based Information Extraction and Structuring Agent

MedSAM2: Segment Anything in 3D Medical Images and Videos

X-Field: A Physically Grounded Representation for 3D X-ray Reconstruction

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Selected Awards