Research
Research Agenda (2026–2030)
- Build clinically reliable AI systems that integrate imaging, EHR, and emerging biomedical modalities for decision support and translational impact.
- Develop controllable multimodal generative models as compositional tools for simulation, synthesis, and hypothesis-driven data generation.
- Advance multimodal perception and understanding in dynamic environments, with emphasis on temporal robustness and scalable deployment.
I view these three tracks as one connected agenda: stronger perception improves controllability, and controllable models accelerate translational biomedical applications.
Translational Biomedical AI
I focus on translational AI systems that bridge methodological advances and deployable clinical value, including medical imaging foundation models, EHR intelligence, and scalable biomedical data understanding (including emerging modalities such as single-cell data).
Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction
A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies
CLINES: Clinical LLM-based Information Extraction and Structuring Agent
Controllable Multimodal Generation
I develop controllable multimodal generation methods for image/video/3D creation, with emphasis on compositionality, attribute-level control, and robust behavior under realistic user constraints.
BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models
Insert Anything: Image Insertion via In-Context Editing in DiT
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
Multimodal Perception and Understanding
I study scene understanding in dynamic environments through segmentation, tracking, and multimodal reasoning, aiming to improve robustness, temporal consistency, and transferability.
Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Scalable Video Object Segmentation with Identification Mechanism
Collaboration
I welcome collaborations on clinically grounded AI, controllable multimodal generation, and robust scene understanding. If your interests align, feel free to reach out by email.
