Controllable Generation with Diffusion Models and Applications in Medical Imaging

Shentu, Junjie (2026) Controllable Generation with Diffusion Models and Applications in Medical Imaging. Doctoral thesis, Durham University.
Copy

Deep generative modeling has undergone rapid advancement in recent years, with diffusion models (DMs) emerging as a dominant framework for high-fidelity and diverse image synthesis. A core strength of DMs is their inherent controllability, offering the ability to steer generation through various conditions. Despite these advantages, achieving reliable controllability remains challenging, as issues such as concept entanglement, multimodal misalignment, and limited adaptability to specialized domains continue to restrict the practical deployment of DMs. These challenges become even more pronounced in medical imaging, where controllability is essential for tasks such as targeted data augmentation, but research on controllable generation using DMs within medical imaging remains limited. In this thesis, I investigate controllable generation with DMs from both theoretical and application-oriented perspectives. First, I advance the understanding of controllability in general-purpose text-to-image DMs by introducing an attention-driven framework for disentangling and customizing multiple visual concepts from a single image. This framework effectively mitigates issues of feature fusion and asynchronous learning across concepts that degrade customization quality, improving fidelity and control in customized generation. Building on the insights into attention-based guidance, I then explore controllable generation in medical imaging, focusing primarily on dermoscopic and chest X-ray modalities. To address the scarcity and imbalance of dermoscopic data, I propose a text-guided diffusion-based synthesis framework, incorporating dynamic prompt construction and region-aware fine-tuning to strengthen visual-textual alignment and enable controllable generation of lesion-mask pairs. Furthermore, a dual-branch DM is developed to tackle low-contrast bias in skin lesion segmentation by jointly controlling lesion layout and style, enabling the creation of targeted synthetic data that substantially improves segmentation performance on challenging cases while preserving overall accuracy. Finally, I extend controllable generation to multimodal medical content by proposing an integrated vision-language model capable of synthesizing clinically coherent chest X-ray images and their accompanying radiology reports. Through a novel prompt formulation and a self-supervised report generation module, the model enhances both the visual realism and clinical validity of synthetic image-report pairs. These contributions demonstrate how DMs can be endowed with fine-grained, reliable controllability, and how such controllability can be leveraged to address domain-specific challenges in medical imaging. The presented methods provide effective frameworks for the development of controllable DMs with both general and clinical utility.


picture_as_pdf
Junjie_PhD_Thesis_final.pdf

View Download

EndNote Reference Manager Refer Atom Dublin Core Data Cite XML OpenURL ContextObject OpenURL ContextObject in Span MPEG-21 DIDL METS ASCII Citation HTML Citation MODS
Export