I am a Research Fellow in the Department of Biomedical Informatics (DBMI) at Harvard Medical School, Harvard University, working with Prof. Tianxi Cai. Previously, I was a postdoctoral researcher at CCAI, College of Computer Science and Technology, Zhejiang University (2021–2024), advised by Prof. Yi Yang. My research builds reliable and controllable multimodal learning and generation methods, with growing emphasis on translational biomedical applications.
My work is organized around three connected research directions:
1) Translational Biomedical AI
Building biology-informed and clinically grounded AI systems across medical imaging, EHR intelligence, and translational biomedical settings.
2) Controllable Multimodal Generation
Developing controllable multimodal generation methods for image, video, and 3D content, with emphasis on compositionality, reliability, and practical usability.
3) Multimodal Perception and Understanding
Advancing multimodal perception and understanding for dynamic environments through segmentation, tracking, and reasoning with robust temporal consistency.
Research overview Full publications
Recent Updates
See more- 2026-01: Three papers accepted to ICLR 2026.
- 2025-12: Invited as Area Chair for ECCV 2026.
- 2025-09: Two papers accepted to NeurIPS 2025, one as Spotlight.
- 2025-09: Listed in Elsevier standardized citation indicators (single recent year 2024), Top 2% in Artificial Intelligence & Image Processing. (source)
- 2025-09: Invited as Area Chair for CVPR 2026.
Selected Publications
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models
Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction
A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies
Insert Anything: Image Insertion via In-Context Editing in DiT
CLINES: Clinical LLM-based Information Extraction and Structuring Agent
MedSAM2: Segment Anything in 3D Medical Images and Videos
X-Field: A Physically Grounded Representation for 3D X-ray Reconstruction
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
Selected Awards
1st in the VOTS 2023 challenge. ICCV 2023. [Report]
1st in Semi-Supervised Video Object Segmentation of EPIC-Kitchens Dataset Challenges. CVPR 2023. [Report]
1st in TREK-150 Object Tracking of EPIC-Kitchens Dataset Challenges. CVPR 2023. [Report]
1st in the VOT 2022 real-time segmentation tracking challenge. ECCV 2022. [Report]
1st in the VOT 2022 short-term segmentation tracking challenge. ECCV 2022. [Report]
1st in eBay eProduct Visual Search Challenge. CVPR 2022. [Report]
1st (Track 1) in the 3rd Large-scale Video Object Segmentation Challenge. CVPR 2021. [Report]
1st (Track 3) in the 3rd Large-scale Video Object Segmentation Challenge. CVPR 2021. [Report]
Guo Moruo Scholarship. From USTC, 2018.
