Received 15 February 2024, accepted 23 February 2024, date of publication 1 March 2024, date of current version 15 March 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3372863

DiffusionDCI: A Novel Diffusion-Based Unified
Framework for Dynamic Full-Field OCT
Image Generation and Segmentation
BIN YANG1, JIANQIANG LI 1, (Senior Member, IEEE), JINGYI WANG1, RUIQI LI1,
KE GU 1, (Senior Member, IEEE), AND BO LIU 2, (Senior Member, IEEE)
1Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
2School of Mathematical and Computational Sciences, Massey University, Auckland 0632, New Zealand

Corresponding author: Bo Liu (b.liu@massey.ac.nz)

This work was supported by the National Natural Science Foundation of China under Grant 62076015.

ABSTRACT Rapid and accurate identification of cancerous areas during surgery is crucial for guiding
surgical procedures and reducing postoperative recurrence rates. Dynamic Cell Imaging (DCI) has emerged
as a promising alternative to traditional frozen section pathology, offering high-resolution displays of tissue
structures and cellular characteristics. However, challenges persist in segmenting DCI images using deep
learning methods, such as color variation and artifacts between patches in whole slide DCI images, and
the difficulty in obtaining precise annotated data. In this paper, we introduce a novel two-stage framework
for DCI image generation and segmentation. Initially, the Dual Semantic Diffusion Model (DSDM) is
specifically designed to generate high-quality and semantically relevant DCI images. These images not only
serve as an effective means of data augmentation to assist downstream segmentation tasks but also help in
reducing the reliance on expensive and hard-to-obtain large annotated medical image datasets. Furthermore,
we reuse the pretrained DSDM to extract diffusion features, which are then infused into the segmentation
network via a cross-attention alignment module. This approach enables our network to capture and utilize
the characteristics of DCI images more effectively, thereby significantly enhancing segmentation results.
Our method was validated on the DCI dataset and compared with other methods for image generation
and segmentation. Experimental results demonstrate that our method achieves superior performance in both
tasks, proving the effectiveness of the proposed model.

INDEX TERMS Semantic diffusion model, image synthesis, image segmentation, dynamic cell imaging.

I. INTRODUCTION
Accurate intraoperative diagnosis of breast cancer is crucial
for guiding surgical procedures and enhancing patient
outcomes. Traditional methods like intraoperative frozen
tissue section diagnosis are hampered by challenges such as
imprecise lesion localization, ambiguous margin assessment,
and lengthy diagnostic processes, increasing the likelihood
of additional surgeries [1], [2]. Dynamic Full Field Optical
Coherence Tomography (D-FFOCT) [3], also known as DCI,
is a breakthrough non-destructive imaging method that uses
light scattering and interference measurements to reveal
microscopic details within tissues. While D-FFOCT shows
significant potential in replacing traditional pathological

The associate editor coordinating the review of this manuscript and

approving it for publication was Carmelo Militello .

imaging methods by rapidly providing histology-like images,
it also presents its own set of challenges. For instance, due
to the unique characteristics of cellular activity imaging, the
whole slide DCI image may exhibit obviously color varia-
tions and artifacts (as described in Section IV-B), complicat-
ing image analysis and accurate diagnosis [4]. Additionally,
the sheer volume of images generated by these technology
burdens manual interpretation with time-consuming and
error-prone tasks, complicating intraoperative diagnoses.
Leveraging automated medical image analysis based on deep
learning provides an effective solution to these challenges.
Advanced models such as Convolutional Neural Networks
(CNNs) have significantly improved the speed and accuracy
of medical image analysis, particularly in segmentation.
However, thesemodels largely depend on extensive annotated
training data, which can be costly and limited by available

37702

 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024

https://orcid.org/0000-0003-1995-9249
https://orcid.org/0000-0001-6903-184X
https://orcid.org/0000-0002-2393-117X
https://orcid.org/0000-0003-2249-9538


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

medical resources and expertise. Therefore, new technologies
are urgently needed to address these issues.

Denoising Diffusion Probabilistic Models (DDPMs) have
shown impressive success in generating various medical
images, such as MRIs [5], [6], CTs [7], ultrasounds [8],
[9], and histopathology [10]. However, few studies have
explored its application in semantically guidedmedical image
synthesis, especially in scenarios involving limited datasets
with semantic labels. In this paper, we introduce the Dual
Semantic Diffusion Model (DSDM), specifically tailored to
generate DCI images in high quality and semantic relevance.
The model conditions on semantic masks and reference
images to generate images conforming to specific semantic
layouts. Semantic masks are integrated via Multi-layer
Spatially-Adaptive Normalization (SPADE) [11], aligning
the denoising process with semantic guidelines and yielding
synthetic images that closely resemble real medical images
both visually and semantically. Reference images, through
an attention alignment module, introduce style information,
ensuring feature alignment at corresponding spatial positions,
thus enhancing global feature correlation and maintaining
similarity with reference images. Our approach, as an
effective data augmentation method, provides additional
training data for downstream segmentation tasks, enhancing
the model’s learning and understanding of various image
features, thereby improving segmentation accuracy and
robustness. Moreover, it decreases reliance on large, costly,
and hard-to-obtain annotated medical image datasets.

Diffusion models have recently demonstrated significant
potential in various image segmentation tasks [12], [13],
including medical images [14], [15], [16]. They mainly
operate in three ways: one is using ground truth (semantic
labels) as input, employing the diffusion model’s random
sampling process to generate implicit segmentation masks,
with each step in training and sampling using images as
priors. However, this method can generate high-frequency
noise in the generated masks, affecting prediction accuracy.
While post-processing methods [17] like averaging, median
blurring, and Fully Dense CRM mitigate these issues, it is
hard to avoid completely. An alternative strategy involves
first pretraining a U-Net-based diffusion model on tasks
like image generation, followed by fine-tuning to boost
segmentation performance. Additionally, somemethods [18],
[19] extract fixed features of images from diffusion models,
then generate segmentation maps using simple classifiers.
To maximize the use of synthetic images and pretrained
diffusion models for segmentation, we propose a unified seg-
mentation framework. The first stage focuses on generating
high-quality, semantically relevant DCI images, while the
second stage is dedicated to enhancing image segmentation
accuracy and efficiency. Specifically, we input noisy images
(synthetic or real) into DSDM, extracting multi-level features
from various noise levels across different blocks. These
diffusion features are then infused into the segmentation
network through a cross-attention alignment module (CAA).
Notably, the segmentation model in the second stage reuses

the pretrained diffusion U-Net backbone structure from the
first stage and is initialized with pretrained weights.

Our extensive experiments on the DCI dataset demonstrate
the framework’s superiority. Both quantitative and qualitative
results confirm that our generative model produces high-
fidelity, diverse DCI images. The segmentation model
identifies and segments cancerous areas more accurately,
providing essential support for surgeons and enhancing
surgical success rates and patient quality of life. In summary,
the contributions of this paper are manifold:

1) A generation network, Dual Semantic Diffusion Model
(DSDM), has been developed. This model uniquely
combines semantic masks and reference images to
create synthetic images that closely resemble real DCI
images, both visually and semantically.

2) A segmentation model is also proposed, leveraging
diffusion features derived from the pretrained diffusion
model, thereby significantly enhancing the network’s
ability to segment images.

3) This work integrates image generation and segmenta-
tion into a cohesive framework, termed DiffusionDCI.
In this framework, images generated by the diffusion
model are directly utilized for training the segmentation
network, enriching the quality and diversity of task
data. An advanced attention alignment module has
been incorporated, effectively transmitting knowledge
of diffusion probability distributions from the image
generation domain to the segmentation network.

4) Comprehensive experiments conducted on the DCI
dataset demonstrate superior performance of this
approach in both image generation and segmentation
tasks.

II. RELATED WORK
A. MEDICAL IMAGE GENERATION
Previous studies have predominantly utilized Generative
Adversarial Networks (GANs) [20] for creating realistic nat-
ural and medical images [21], [22], [23]. Calimeri et al. [23]
introduced a versatile framework for image-to-image trans-
lation based on conditional GANs [24]. Wang et al. [25]
employed a novel adversarial loss and a multi-scale archi-
tecture (Pix2pixHD) to synthesize high-resolution, lifelike
images from semantic label maps. Isola et al. [26] developed
DermGAN, based on the Pix2Pix architecture, to generate
high-fidelity clinical skin images with pathology for specific
medical conditions. Additionally, Wang et al. [27] utilized
spatially conditioned normalization for photorealistic image
synthesis from semantic layouts, emphasizing semantic
coding and stylization. Zhang et al. [28] proposed a novel
noise adaptation GAN, incorporating adversarial, style,
and content losses for noise style transfer in OCT and
Ultrasound images. However, GAN-based methods face
significant challenges, including mode collapse, training
non-convergence, and instability. In contrast, the Denoising
Diffusion Probabilistic Model (DDPM) represents a new
class of generative models that use a Markov chain process to

VOLUME 12, 2024 37703


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

transform a simple distribution, such as Gaussian noise, into
complex data distributions, thereby achieving high-quality
image generation. Compared to GANs, diffusion models
offer better interpretability, stability, and controllability and
can produce more realistic image qualities [29]. They have
also been explored for conditional generation tasks, like
class-conditional, image-guided, and semantic-conditional
generation [30]. Wolleb et al. [31] proposed classifier-guided
diffusion models for image-to-image translation, focusing
only on altering anomalous regions to healthy outcomes.
Wang et al. [32] introduced a unified framework for Semantic
Diffusion Model (SDM), allowing for either linguistic
or image guidance, or both. However, semantic-guided
synthesis in medical imaging, especially for DCI images, has
been less explored. Parallelly, NASDM [33] leverages SDM
to generate realistic histological images of colon tissue from
semantic masks composed of various nuclei types. The main
difference between these methods and our DSDM is that we
have introduced not only semantic guidance but also new
reference conditions, coupled with an attention alignment
module to enhance image quality. Furthermore, based on a
pretrained DSDM, we explore the use of diffusion features to
assist segmentation tasks.

B. MEDICAL IMAGE SEGMENTATION
Medical image segmentation involves separating areas of
interest (such as organs, tissues, cells, lesions) from the back-
ground or other regions in medical images, aiding doctors in
disease diagnosis, treatment planning, and assessment [34].
Diffusion models have recently demonstrated significant
potential in medical image segmentation. Segdiff [12]
initially used a diffusion probabilistic approach to design
an end-to-end medical image segmentation, injecting infor-
mation into the denoising process to iteratively refine the
segmentation map. Differently, they computed pixel-wise
uncertainty maps of the segmentation, allowing an implicit
ensemble of segmentations that enhance segmentation per-
formance. Parallel workMedsegdiff [17] introduced dynamic
conditional encoding and FF-Parser to improve DDPM-
based segmentation models. The upgraded MedSegDiff-
v2 [35] introduced a new transformer-based conditional
U-Net framework and a novel Spectrum-Space Transformer
(SSFormer), thereby enhancing the interaction between noise
and semantic features and further enhancing segmentation
performance. However, these methods do not explicitly
apply diffusion features to segmentation. Inspired by [36]
independently using diffusion features to aid segmentation,
we infuse multi-scale diffusion features from the generative
model into the segmentation network via CAA to enhance
segmentation performance.

III. PRELIMINARIES: DENOISING DIFFUSION
PROBABILISTIC MODELS
Denoising diffusion probabilistic model (DDPM), one of the
generative models, is inspired by non-equilibrium thermody-
namics and defines a Markov chain of diffusion steps. The
purpose is to slowly add stochastic noise to the original data

and then learn to reverse this process to construct desired
samples from the noise. The core of DDPM involves a two-
step process: forward and backward processes.

Given a data point sampled from a real data distribution
x0 ∼ q(x), let us define a forward diffusion process in which
we introduce stochastic noises (usually parameterized with
Gaussian kernel) with steps t ∈ {0, 1, 2, . . . ,T }, producing
a sequence of noisy samples x1, . . . , xt . The step sizes are
controlled by a variance schedule {βt ∈ (0, 1)}Tt=1. Eventually
xt becomes an isotropic Gaussian distribution.

The forward process is described by the formulation:

q(x1:T |x0) =

T∏
t=1

q(xt |xt−1), (1)

q(xt |xt−1) = N (xt ;
√
1 − βtxt−1, βt In×n), (2)

where In×n is the identity matrix of size n. T = 1000 ∼

5000 is a typical choice for most works. The resulting
distribution of xt given x0 is then expressed as:

q(xt |x0) = N ((xt ;
√

ᾱtx0, (1 − ᾱt )In×n), (3)

The forward process has a nice property that it can
sample xt at an arbitrary time step t in a closed form via a
reparameterization trick. Let ᾱt =

∏T
i=1 (1 − βi), xt can be

sampled by:

xt =

√
ᾱtx0 +

√
(1 − ᾱt )ϵ, ϵ ∼ N (0, In×n). (4)

If βt is sufficiently small, the reverse process also follows a
Gaussian distribution. Thus, the reverse (generative) process
is also defined as a Markov chain parameterized by θ ,
gradually denoising an arbitrary Gaussian a noise xt ∼

N (0, I ) to a clean data sample at each step, for t ∈ {T ,T −

1, . . . , 0}. Unfortunately, since q(xt−1|xt ) depends on the
entire dataset distribution, a neural network pθ is trained
to approximate the reverse denoising process. The reverse
process can be represented as:

pθ (x0:T−1|xT ) = p(xT )
∏T

t=1
pθ (xt−1|xt ), (5)

pθ (xt−1|xt ) = N (xt−1; µθ (xt , t), 6θ (xt , t)), (6)

where 6θ (xt , t) can be either a fixed covariance or a learned
parameter as well which has been shown to improve the
model quality [37]. In practice, rather than directly parameter-
izing µθ (xt , t) as a neural network, a noise predictor network
ϵθ (xt , t) is trained to the noise component at the step t. Then,
the mean µθ (xt , t) as a function of noise can be defined as:

µθ (xt , t) =
1

√
αt

(xt −
βt

√
1 − ᾱt

ϵθ (xt , t)), (7)

The learning objective for the neural network to measures
the distance between the real noise ϵ and the noise estimation
ϵθ (xt , t) is derived by considering the variational lower bound
(VLB) and simple by Ho et al. [38]:

Lsimple = Ex0,ϵ,t [||ϵ − ϵθ (
√

ᾱtx0 +

√
1 − ᾱtϵ, t)||2], (8)

37704 VOLUME 12, 2024


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

FIGURE 1. Overview of the proposed unified segmentation framework.

where x0, ϵt and t are sampled from q(x0),N (0, In×n) and the
discrete uniform distribution U(0,T ).
For inference, the reverse diffusion steps are performed

starting from a random sample xt ∼ N (0, I ). For each
timestep t ∈ {T ,T − 1, . . . , 1}, xt is iteratively denoised to
xt−1 based on reparametrize Eq.6:

xt−1 =
1

√
αt

(xt −
1 − αt

√
1 − ᾱt

ϵθ (xt , t)) + σθ z. (9)

where σt denotes the variance scheme that can be learned by
the model, as proposed in [37].

IV. METHOD
A. OVERVIEW
DSDM is implemented on a diffusion-based U-Net frame-
work as depicted in Figure 1, which consists of two parts:
semantic image synthesis and segmentation. In the first stage,
two different conditioning manners for semantic guidance
are applied to guide the U-Net-based diffusion network:
reference image condition and semantic mask condition.
During training, a compact encoder (Resnet 18) extracts
coarse features F1, including appearance and structure
features, from a noise-free reference DCI image I . This
conditioning imposes an initial high-level semantic reference
on the diffusion model, aiding in the reduction of diffusion
variances. The high-level features F1 are then processed
through the multi-scale CAA modules, transforming F1
into various resolutions and progressively aligning with
the middle image features in SPADE ResBlocks (SRES).
Simultaneously, the semantic mask is injected into the
SPADEResBlocks (SRES) as a multi-layer spatially adaptive

semantic condition. In the sampling process, given a semantic
mask and a reference image, a semantically relevant and
realistic synthetic DCI image is recovered from the initialized
Gaussian distribution.

In the segmentation stage, the network’s architecture,
shown at the bottom of Figure 1, reuses the denoising
U-Net’s backbone, substituting the SPADE layer with
group normalization. In order to preliminarily leverage the
information learned in the image synthesis stage, the network
is initialized with pre-trained weights from the first stage of
the diffusion network. Given a noisy synthetic image Xt , it is
passed through the first-stage diffusion network with frozen
parameters to extract internal diffusion features F1 from
various ResBlocks. These features contain visual-linguistic
semantic information and have proven effective in semantic
segmentation tasks [39]. Then, these features are injected into
the segmentation network via a cross-attention mechanism
at the corresponding ResBlocks, in order to make use of the
maximum semantic and diverse visual-linguistic information.

B. SYNTHETIC DCI IMAGE GENERATION BASED ON DUAL
SEMANTIC GUIDANCE
1) COLOR NORMALIZATION
To construct a whole slide DCI image, axial (10 nanometers)
and transverse (1.24 mm × 1.24 mm patch) scans are
performed on each biopsy specimen using a Light-CT
scanner, producing 512 regions of interest (ROIs). For each
pixel, the power spectral density (PSD) is computed and
subjected to saturation normalization, thereby generating an
RGB image for an ROI patch. Subsequently, several ROI
patches are combined along the horizontal and vertical axes

VOLUME 12, 2024 37705


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

through continuous scanning to form a complete DCI image.
As the imaging principle involves using the characteristic
time period or frequency of the cell dynamic signal to
attribute a color to each pixel, the shorter the imaging
duration, the better the preservation of tissue freshness, and
the better the quality of the generated DCI signal. However,
the process of continuous scanning and generating multiple
ROI patches may result in color discrepancies between
the ROI patches of the complete DCI image, potentially
reducing the effectiveness of training for the generation and
segmentation models. Therefore, in this work, we employ a
stain normalization method modified from Reinhard [26] to
transform all ROI patches of a complete DCI image to match
the average staining distribution of a whole slide DCI image.

2) SEMANTIC DIFFUSION MODEL
a: NETWORK
The DSDM backbone, a U-Net-based conditional diffusion
model described in Wang et al. [27], maintains a structure
that allows for injecting semantic mask via SPADE. The
DSDM (shown in Figure 1) includes four components:
a) Diffusion encoder that converts the noise samples to latent
representations; b) Diffusion decoder that reconstructs or
generates images conditioned on two types of guidance;
c) Reference image encoder that encodes reference image
into embeddings; d) Multi-scale Cross-Attention Alignment
Module that aligns and measures the distance between con-
ditional reference image features and middle image diffusion
features. Diffusion encoder: we use residual blocks [40] with
self-attentions [41] as encoder resblock which consists of
convolution, GroupNorms, SiLU [42], and skip connections.
Diffusion decoder: Unlike the encoder resblocks, the group
normalization layer is replaced by the spatially-adaptive
normalization (SPADE) [11] which injects the semantic
mask into the diffusion network in the multi-layer spatially
adaptive manner (as shown in Figure 1). Reference image
encoder: Considering that the same cancer may present
different types, using reference image guidance can generate
images with the same or similar layout format. Therefore,
we propose using a refined ResNet-18 for the reference image
condition, which better captures DCI image features, such as
texture and color. Specially, we modify the standard ResNet-
18 architecture by stripping away the final classification
layer and appending a fully connected layer which produces
vectors of dimension 512. We use the contrastive learning
strategy proposed in [43] to train our reference image encoder
on DCI dataset. Multi-scale Cross-Attention Alignment
Module (MCAAM): Rombach et al. [44] exploited the spatial
transformer [41] as a flexible and powerful conditioning
mechanism to combine the conditional features and the
diffusion U-Net features. Inspired by this, we propose the
MSCAA, consisting of an embedding transform layer and
Cross-Attention Alignment (CAA) to inject the reference
image condition. Embedding transform layer uses a fully
connected layer 9FC to project a lower-dimensional encoded
image features into a higher-dimensional feature, reshaping

it to align with the multi-scale resolutions of the middle
features. Given the middle features Fm ∈ Rb×h×w and the
reference image embedding Fr ∈ Rb×n, K ,V ∈ Rb×h×w

are the output of the embedding transform layer, CAA can be
defined as:

Q = WQFm, K = WKFr , V =WV9FC (Fr )

(10)

MCAA(Q,K ,V ) = Softmax
(
QKT
√
C

)
V (11)

where b is the batchsize, h,w is the resolution of the middle
features, n is the length of the reference image embedding.
Furthermore, residual connection is applied between input
and output of the CAA modules ensuring the integration of
original features throughout the process.

b: LOSS
In the preliminaries section, we review the theory of uncon-
ditional DDPM. For convenience, we follow the definitions
of the symbols in Section III. If the data is sampled from
a conditional data distribution x0 ∼ q(x0|y), and y is
the condition (e.g., semantic mask, image embedding), the
sampling distribution becomes:

pθ (x0:T−1|xT , y) = p(xT )
∏T

t=1
pθ (xt−1|xt , y), (12)

pθ (xt−1|xt ) = N (xt−1; µθ (xt , t), 6θ (xt , t)), (13)

we can train DSDM using a hybrid loss function following
previous work [37]:

Lhybrid = Lsample + λLvlb, (14)

Lsimple = Ex0,ϵ,t [||ϵ − ϵθ (xt , y, t)||2], (15)

Lvlb = DKL(pθ (xt−1|xt , y)||q(xt−1|xt , x0), (16)

where λ is the trade-off parameter to prevent Lvlb from
overwhelming Lsimple.

c: CLASSIFIER-FREE GUIDANCE
To enhance the correlation between the sampled images
and semantic label maps, we adopt the Classifier-Free
Guidance [45]. This approach eliminates the need for
training a separate classifier model and has been widely
applied to diffusion models conditioned on a variety of
modalities, including text, images, and class labels [45], [46],
[47]. During training, we replace the condition (reference
embedding and semantic mask) with a null label at a certain
probability, thus unifying the training of the conditional and
unconditional models. During inference, a guidance scale s
controls the contribution of the guidance condition to the
noise estimation, enabling a balance between the quality and
diversity of the sample. The process can be formulated as
follows,

ϵ̂ϵ(xt |y) = ϵθ (xt |y) + s · (ϵθ (xt |y) − ϵθ (xt |∅)). (17)

Since we empirically observe the best results for s = 3,
we fix this choice over all generation experiments.

37706 VOLUME 12, 2024


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

FIGURE 2. Comparison of color normalization result. (A) shows original
images. (B) shows transformed images with the modified Reinhard
method.

C. SEGMENTATION
1) EXTRACTING DIFFUSION REPRESENTATIONS
To identify the features contributing most significantly to the
model’s segmentation performance, we analyze the diffusion
features produced by the different blocks within the U-Net
of DSDM at every 100th diffusion timestep within the range
of [100,1000]. A small subset of DCI image with masks
from DCI dataset will be used for training and validation.
Here, a straightforward decoder is trained on these features
to predict the associated masks. Specifically, for a given
image I at a diffusion timestep t, we initially extract internal
features with the dual semantic condition input to the DSDM
at that timestep. Subsequently, these features are used to train
the corresponding decoder until convergence. The model’s
performance is then evaluated using the mean Intersection
over Union (mIoU) metric on a select subset of the validation
dataset.

Figure 3 illustrates the segmentation result utilizing
intermediate features from various resblocks and timesteps.
The features produced by the middle of the U-Net decoder
{6,8,10} appear to be the most informative. Different from
the findings of previous works [39] that the most useful
information for segmentation appear only in a wider range
of timesteps, from 200 to 800.

2) ARCHITECTURE
The structure of our segmentation model is shown in the
bottom of Figure 1 and is the same as the first-stage
denoising U-Net structure. Therefore, it can be initialized
with the pretrained weights of the first-stagemodel. However,
a domain gap exists between the diffusion embedding

FIGURE 3. Performance of DSDM pixel-wise representations on the DCI
dataset.

FIGURE 4. The cross-attention mechanism integrates the diffusion
features obtained from DSDM into the corresponding feature space of the
segmentation model.

and the segmentation semantic embedding when predicting
redundant noise from the noisy image. To address this issue,
we reuse the Cross-Attention Alignment (CAA) (shown in
Figure 4), which allows the model to learn the interaction
between noise feature and semantic features, resulting in
stronger feature representations. Following the semantic
analysis of diffusion features in above section, we incorporate
the intermediate features from blocks {6-10} of the DSDM
into the corresponding blocks of segmentation model.

V. EXPERIMENT
In this section, we aim to comprehensively detail the design
and execution of our experiments. All the models are
implemented with Pytorch framework and trained using a
single NVIDIA RTX 3090.

A. EXPERIMENTAL SETUP
1) DATASET
DCI dataset contains 300 breast images from 141 patients in
total, covering 235 malignant cases and 74 benign lesions or
normal cases with their corresponding semantic masks and

VOLUME 12, 2024 37707


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

the pathological diagnosis obtained through a standard H&E
staining. Each DCI image is assigned a full segmentation
annotation for types of malignant and non-malignant. In the
annotation of the DCI dataset, three experienced pathologists
utilize QuPath, a powerful open-source software for digital
pathology. The process involves pathologists meticulously
reviewing each slide and delineating regions of interest
based on the corresponding histological diagnosis, thereby
generating semantic masks. The semantic mask in the dataset
consists of three class labels: 0 represents the background,
1 corresponds to the Benign tissue, and 2 indicates malignant
tissue. The DCI dataset resolution ranged from 2592 ×

2592 pixels to 21024 × 21024 pixels. Therefore, Regions of
Interest (ROIs) of 3450 × 3450 pixels were generated on the
original image using a sliding window algorithm with steps
of 3,450 pixels (horizontal and vertical). We extract a total
of 7000 including 5000 training patches and 2000 testing
patches which are resized to 256 × 256 pixels for training
and testing in the generation stage. In addition, the dataset
used for the segmentation task contains the original DCI
images and the synthetic images generated using the DSDM.
We use total semantic maps in training dataset as input to
the generative model and generate synthetic images with
the corresponding semantic mask. Note that all samples
displaying normal breast tissue and various cancerous
changes are evenly distributed across both low-quality and
high-quality image subsets. Consequently, two experienced
clinicians have conducted a screening of the generated
images for both quantitative and qualitative evaluations.
In Section III, we explain how we apply synthetic images to
image enhancement.

2) MODEL SETTINGS
In the generating experiment, the agriculture of our DSDM
is described in the section V-B2. The denoising U-Net is
training have all been trained from scratch while the image
encoder ResNet-18 is pre-trained with contrastive learning
strategy. Based on the generally effective performance of the
learning rate in similar image generation tasks [33], we use
Adam [48] optimizer with a learning rate of 1e−4 and a batch
size of 10 to train the whole model for 50,000 iterations.
Additionally, we choose a cosine noise schedule for all the
diffusion models.

As for segmentation experiment, we optimize the param-
eters of segmentation network with the cross-entropy (CE)
loss using the Adam optimizer with a learning rate of
0.001 as it is a widely accepted starting point for training
segmentation models [35], while keeping the parameters of
the pre-trained DSDM freezed. We train the model with
a batch size of 12 for 200 epochs. The primary image
augmentation techniques used include: horizontally flipping
the image with a 50% probability; randomly adjusting the
image’s brightness, saturation, contrast, and hue with an 80%
probability, and normalizing the tensor by calculating the
mean and variance of the dataset and adjusting its range from
(0,225) to (-1,1).

TABLE 1. Ablation study for different configurations of our model.

3) METRICS
To evaluate generation performance, we use two objec-
tive indicators Frechet Inception Distance (FID) [49] and
Learned Perceptual Image Patch Similarity (LPIPS) [50].
FID measures the fidelity of the generated images. LPIPS
measures the distance between the multimodal generated
images, which reflects the diversity of the generated images.
FID is evaluated on 2000 synthesis images against their
respective 2000 images from the test dataset, while LPIPS
which is computed on sets of 10 sample images for each of
the 2000 test images and features.

To evaluate the segmentation performance of various
neural network, we focus on four of the most widely used
evaluation metric: mean Dice Coefficient (mDice) (a.k.a.
F1 score), mean Intersection over Union (mIoU), precision
and recall and accuracy.

B. GENERATING RESULT
1) ABLATION STUDY
Ablation study assesses the impact of different components
on a U-Net-based denoising model for image generation.
To fully evaluate the performance of the conditional module
of the network, we retain the U-Net backbone of DSDM and
replace the SPADE layer in the decoder with GroupNorm to
obtain the initial unconditional denoising U-Net.We can train
from scratch the baseline unconditional denoising U-Net.

In Table 1, we present both quantitative and qualitative
results obtained by conditioning the model with either a
reference image (RI), semantic masks (SM), or a combination
of both. When SM and RI are used individually, we observe
improvements in the FID scores and reductions in the
LPIPS scores. Notably, the use of SM leads to a higher
improvement compared to RI alone. This could be attributed
to the explicit semantic constraints provided by SM, which
more effectively guide the learning and understanding of
data, thereby enhancing the quality of the generated images.
When both SM and RI are utilized simultaneously, the FID
scores achieve the best results, and the LPIPS scores show
only a slight decrease, indicating a well-maintained balance
between quality and diversity.

Furthermore, we also explore the influence of color
normalization and classifier-free guidance strategy. As shown
in Table 2, we observe that the introduction of color
normalization does not significantly impact diversity but
slightly improves FID. Comparing the third and fourth rows
of the table, it is evident that the classifier-free guidance

37708 VOLUME 12, 2024


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

TABLE 2. Ablation study for color normalization and the classifier-free
guidance strategy on denoising U-Net baselines.

strategy significantly enhances the FID metric, albeit with a
marginal increase in the LPIPS scores. This finding suggests
that classifier-free guidance effectively boosts image quality,
contributing to a notable improvement in the FID metric

2) COMPARED WITH OTHER GENERATING METHOD
We report FID and LPIPS results, compared to previous
works. The specific comparativemethod chosen for this paper
is as follows:

CycleGAN is a novel method for unpaired image-to-image
translation, which means it can transform images from one
domain to another without requiring corresponding images
in the source and target domains. It is widely used in medical
image generation.

AttributeGAN is designed to manipulate specific
attributes of an image, such as facial features or expressions,
allowing for controlled modifications while maintaining the
identity and main features of the original image.

MorphDiffusion introduced prioritized morphology
weighting and color normalization into the diffusion model
to synthesize high-quality histopathological images of brain
cancer.

NASDM is a nuclei-aware semantic tissue generation
framework based on SDM, which realizes the perfect nuclear
localization of pixels in the generated samples through the
semantic instance mask of different kernel types.

To ensure a fair and comprehensive comparison, we eval-
uate all models under identical conditions using the same
datasets and metrics. For the parameter settings of each
model, we adhere as closely as possible to the common
configurations specified in their original papers. By retraining
and testing them on our DCI dataset, we ensure a direct
comparison of each model’s capabilities in a consistent
environment.

As shown in Table 3, CycleGAN has the highest FID
score, indicating the generated images are less similar to
real images compared to other methods. AttributeGAN
Shows improvement over CycleGAN with a lower FID
score, suggesting better quality of generated images in terms
of realism. The improved quality of generated images by
NASDMprimarily attributed to the incorporation of semantic
label guidance into the diffusion process, effectively utilizing
the information in semantic masks to direct the diffusion.
Furthermore, our methods achieve superior sample quality to
previous method that surpasses SDM-based method methods
by 2.92 in FID. However, However, our method achieves a

TABLE 3. Quantitative comparison with different methods.

lower LPIPS than NASDM, mainly due to the introduction
of RI guidance. This provides more constraints on the
morphology and layout in the diffusion process, consequently
reducing diversity.

3) EXPERT ASSESSMENT
The conventional norm for assessing the visual quality of
generated medical images relies on evaluations by pathology
experts [8], [22]. To assess the histological plausibility of
our generated images (i.e., the realism of lesion appearance
and their applicability in classification), we conducted a
survey with five pathologists who have an average of five
years of work experience. Their task involved evaluating
100 randomly selected image patches from both the original
DCI dataset and the synthetic dataset, with a dual focus on
identifying synthetic images and detecting the presence of
cancerous tissues. Moreover, the five experts rated the images
on a 5-point Likert scale (very poor, poor, Medium, good,
very good) to quantify the similarity confidence between real
and synthetic images.

The findings are summarized in Table 4. In classification
tasks, the experts demonstrate comparable and balanced
performance in identifying real images (average accuracy of
74.6%) and synthetic images (average accuracy of 75.8%).
An interesting observation is that expert generally has a
slightly higher accuracy rate for identifying synthetic images
compared to real images, but the difference is not significant.
These results indicate that there is no strong bias towards real
and synthetic images and there exist the high visual similarity
between real and synthetic images.

For the task of discerning image authenticity, pathologists
relied on variousmicroscopic and histological features. These
included abnormalities in cellular nuclei and intercellular
structures, overly uniform morphology of cancer cells,
exaggerated tissue structures, unnatural color and texture
distribution, and a lack of complexity in surrounding details,
particularly evident in tumor regions.

As shown in Table 4, the classification accuracy of experts
ranges from 62.5% to 71.5%, with an average accuracy of
67.3%. This indicates a certain degree of challenge in dis-
tinguishing between real and synthetic images. Confidence
scores vary from 2.4 to 3.1, averaging 2.64 on the 5-point
Likert scale. This implies that while accuracy rates were
better than random guessing, experts exhibited relatively low
confidence in their judgments. These findings underscore

VOLUME 12, 2024 37709


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

FIGURE 5. Comparison of sampling results generated based on dual semantic condition under different magnification. Samples start
with random Gaussian noise and are denoised iteratively until a high-quality output is produced. Denoising 50% represents an
intermediate state in the generation process, followed by a progressive enhancement of details or a gradual reduction of noise. Yellow
is Malignant tissue, green is benign tissue, and blue is background.

TABLE 4. Results of the expert assessment.

the challenges experienced experts face in differentiating
between real and synthetic medical images. The moderate
accuracy rates, coupled with the less robust confidence levels,
can be attributed to the high quality of synthetic imaging
technology and its close resemblance to real images, further
accentuating the complexity of discerning real from synthetic
images in medical image analysis.

4) SAMPLING ANALYSIS
In this experiment, we train the two models respectively
with images with ROI size of 1150 × 1150 and 3450 ×

3450 and the sampling results. Some examples are shown
in Figure 5. we can observe that both models can generate
high-quality results. Specifically, the generated image on the
right displays more details and has a better visual effect, but

with lower semantic relevance. In contrast, the left samples
may not exhibit as fine detail as those at right sample,
but are likely to be smoother and more coherent overall.
Meanwhile, the left samples can also capture and reproduces
overall structure and main features more effectively. The
result shows the diffusion model’s limitations in handling
higher resolutions, such as potential distortion of details or
increased noise. Furthermore, while color normalization has
been applied, achieving uniform color across various regions
of the same image remains challenging, and this tends to be
more noticeable in high-resolution images.

C. SEGMENTATION RESULT
1) ABLATION STUDY
We conduct an ablation study to validate the effectiveness
of infusing DSDM’s diffusion features into the segmentation
model via CAA. Using the end-to-end supervised training
results of the diffusion U-Net as our baseline. We use Dice
scores and mIoU to assess performance on the segmentation
task. As is shown in Table 5, the segmentation results of
DCI images are significantly improved by using two diffusion
features. Specially, the Skip Connection method improves
the Dice score by approximately 14.42% and mIoU by
about 21.88% over the baseline Denoising U-Net. CAA
further enhances the Dice and mIoU scores by 17.47% and
26.90%, respectively. Moreover, comparing CAA to the Skip
Connection method, we observe an improvement of around
2.68% in Dice and 4.12% in mIoU. This suggests that CAA’s
integration notably bolsters performance, likely attributed to

37710 VOLUME 12, 2024


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

TABLE 5. Ablation study for different configurations of our model.

FIGURE 6. Dice coefficient and mIoU result under different ratios of
synthetic DCI images.

its ability to learn the interplay between noise and semantic
features and effectively leverage the context and correlations
across the entire image.

2) SYNTHESIS DATASET
We have designed a set of experiment to evaluate the role of
the synthetic data in the segmentation task. In all experiment,
DSDM is used to generate images based on DCI training
dataset. Guided by expert insights, we selected a dataset of
2,000 high-quality synthetic samples. We randomly sample
in a certain proportion from the generated synthesis images
which are merged into the original training dataset to finetune
the segmentation model. Note that all experiment is evaluate
in the real data. As shown in Figure 6, both Dice and mIoU
scores increase significantly when a portion of synthetic data
is added to the real data. The highest Dice score (0.8367) and
mIoU score (0.7202) are achieved with 50% synthetic data,
suggesting an optimal balance between real and synthetic
data at this proportion. However, at 75% and 100% synthetic
data, the standard deviations increase again, which might
indicate variability in the quality or relevance of the synthetic
data at higher proportions. The increase in indicators may
benefit from increasing the diversity of the dataset without
introducing too much noise. However, as the amount of
synthetic data introduced increases, the model introduces
irrelevant features that could arise from an over reliance
on synthetic data. Therefore, the synthetic image can be
effectively used as a means of image enhancement to improve
the segmentation result.

3) COMPARE WITH OTHER METHOD
Table 6 presents a comprehensive comparison of our method
against existing approaches across multiple metrics: Dice,
mIoU, Precision, Recall, and Accuracy. Compared to non-
diffusion-based methods such as Unet, PSP, and nnU-Net,

both Medsegdiff and our method demonstrate substantial
and comprehensive improvements. Particularly, our method,
which incorporates diffusion features into the segmentation
model, achieves state-of-the-art results across all metrics.
When compared to the second-best result, Medsegdiff, our
method shows improvements of 3.62% in Dice and 5.25% in
mIoU, which increased by 3.62% and 5.25%, respectively.
This significant enhancement in Dice and mIoU suggests
that DSGM is more adept at accurately identifying cancerous
regions in DCI images. The superior performance of DSGM
could be attributed to advanced techniques in diffusion
models and more effective integration of diffusion features.

Figure 7 shows the qualitative results of some examples
in the DCI dataset with predicted semantic maps produced
by all compared methods. From left to right, the columns
represent the original images, ground truth (GT), and
the segmentation result of various methods. In Figure 7,
we observe that the U-Net manifests notable deficiencies,
particularly in segmenting intricate regions, with numerous
omissions indicating missed segmentation opportunities. For
the five sub-image of U-Net, the malignant proportion (the
bright yellow areas) is 97.9%, 89.2%, 75.1% and 49.4%
respectively, compared to GT’s 76.2%, 82.5%, 55.6%, 5.9%
and 59.9%. This may suggest U-Net’s notable deficiencies in
segmenting intricate regions, with a tendency to over-mark
malignant areas. In contrast, the PSP method, while show-
ing marginal improvements, still leaves considerable areas
unsegmented, especially the malignant areas of complex
images. For instance, in the fourth and fifth sub-images of
PSP, the Cyan proportions are 20.9% and 74.8% against
GT’s 45.7% and 30.0%, respectively, while its Dark Purple
proportions are 56.1% and 9.2% compared to GT’s 48.3%
and 10.0%. The nnU-Net shows a significant enhancement
that it successfully identified the malignant region of the
first image, while the first two methods do not identify it.
The Label-efficient DM’s performance appears to regress
relative to nnU-Net, especially in boundary delineation,
shown in the red box of the figure. The Medsegdiff method
excels in boundary definition and segmentation coherence,
as evidenced by the color uniformity indicating high fidelity
to the ground truth. This demonstrates a closer alignment with
GT in terms of Cyan and Dark Purple proportions compared
with other method. This method particularly stands out in its
adeptness at preserving boundaries and detailing, suggesting
the advantages of our method in integrating diffusion feature.

D. LIMITATION
We unify DCI image generation and segmentation based on
diffusion models, achieving high-quality synthetic images
and accurate segmentation results. However, there are still
some limitations.

1) LIMITED DETAILS
Our approach generates images at a resolution of 256× 256,
whichmay result in a limited display of detail. To address this,
we can consider a multi-scale generative strategy that first

VOLUME 12, 2024 37711


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

TABLE 6. The performance of different medical image segmentation methods in DCI datasets.

FIGURE 7. Visual comparison of segmentation results with different methods on the DCI dataset. Yellow indicates malignant, Cyan indicates
benign and Dark Purple indicates background.

generates a rough image outline and style at low resolution,
and then gradually improves the detail level.

2) CONSISTENCY
Our semantic masks are not at a cellular level, leading to
sampled results that may not align closely, as visualized
in Section V-B. Therefore, our future work will focus on
refining semanticmasks and optimizing the network structure
to enhance semantic consistency.

3) SAMPLING CONTROLLABILITY AND STABILITY
The use of two semantic-guided conditions introduces some
redundancy in semantic information and additional noise,
adding complexity to the model and compromising the
controllability and stability of the results.

Finally, compared with the standard end-to-end image seg-
mentation method, we need extra time to extract and integrate

the diffusion features to obtain satisfactory segmentation
performance. Therefore, we need to explore more efficient
sampling methods for diffusion models to balance sampling
time and accuracy.

VI. CONCLUSION
In this paper, we propose a two-stage framework Diffu-
sionDCI for DCI image generation and segmentation. In the
first stage, the generated model encompasses two types of
semantic guidance: reference image and semantic mask.
This dual guidance effectively controls content, structure,
and semantic information, leading to the generation of
high-quality and semantically relevant DCI images. The
second stage focuses on further enhancing image segmen-
tation precision and efficiency by reusing the semantic
diffusion features. Experiments on DCI dataset demonstrate
that our method notably improves upon previous techniques

37712 VOLUME 12, 2024


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

in generating high-fidelity, diverse DCI images. The segmen-
tation model accurately identifies and segments cancerous
areas, providing crucial support for surgical procedures and
improving patient outcomes. In our future work, we aim to
enhance the resolution and detail of DCI image generation by
employing a multi-scale generative approach, refine semantic
masks for improved consistency at the cellular level, and
introduce the Latent Diffusion Model for better control and
stability in sampling.

REFERENCES
[1] O. Thouvenin, J. Scholler, D. Mandache, M. C. Mathieu, A. B. Lakhdar,

M. Darche, T. Monfort, C. Boccara, J.-C. Olivo-Marin, K. Grieve,
V. M. Yedid, and E. B. a la Guillaume, ‘‘Automatic diagnosis and biopsy
classification with dynamic full-field OCT and machine learning,’’ Res.
Square, vol. 10, no. 3, 2021, Art. no. 034504.

[2] D. R. McCready, ‘‘Keeping abreast of marginal controversies,’’ Ann.
Surgical Oncol., vol. 11, no. 10, pp. 885–887, Oct. 2004.

[3] C. Apelian, F. Harms, O. Thouvenin, and A. C. Boccara, ‘‘Dynamic
full field optical coherence tomography: Subcellular metabolic contrast
revealed in tissues by interferometric signals temporal analysis,’’ Biomed.
Opt. Express, vol. 7, no. 4, pp. 1511–1524, 2016.

[4] H. Yang, S. Zhang, P. Liu, L. Cheng, F. Tong, H. Liu, S. Wang, M. Liu,
C. Wang, Y. Peng, F. Xie, B. Zhou, Y. Cao, J. Guo, Y. Zhang, Y. Ma,
D. Shen, P. Xi, and S. Wang, ‘‘Use of high-resolution full-field optical
coherence tomography and dynamic cell imaging for rapid intraoperative
diagnosis during breast cancer surgery,’’ Cancer, vol. 126, no. S16,
pp. 3847–3856, Aug. 2020.

[5] Z. Chen, J. Qing, T. Xiang, W. L. Yue, and J. H. Zhou, ‘‘Seeing beyond
the brain: Conditional diffusion model with sparse masked modeling for
vision decoding,’’ inProc. IEEE/CVFConf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2023, pp. 22710–22720.

[6] Z. Dorjsembe, S. Odonchimed, and F. Xiao, ‘‘Three-dimensional medical
image synthesis with denoising diffusion probabilistic models,’’ in Proc.
Med. Imag. Deep Learn., Switzerland, Sep. 2022, pp. 18–22.

[7] Q. Li, C. Li, C. Yan, X. Li, H. Li, T. Zhang, H. Song, R. Schaffert, W. Yu,
Y. Fan, J. Ye, and H. Chen, ‘‘Ultra-low dose CT image denoising based on
conditional denoising diffusion probabilistic model,’’ in Proc. Int. Conf.
Cyber-Enabled Distrib. Comput. Knowl. Discovery (CyberC), Oct. 2022,
pp. 198–205.

[8] D. Stojanovski, U. Hermida, P. Lamata, A. Beqiri, and A. Gomez, ‘‘Echo
from noise: Synthetic ultrasound image generation using diffusion models
for real image segmentation,’’ 2023, arXiv:2305.05424.

[9] F. Tang, J. Ding, L. Wang, M. Xian, and C. Ning, ‘‘Multi-level global
context cross consistency model for semi-supervised ultrasound image
segmentation with diffusion model,’’ 2023, arXiv:2305.09447.

[10] P. A. Moghadam, S. Van Dalen, K. C. Martin, J. Lennerz, S. Yip,
H. Farahani, and A. Bashashati, ‘‘A morphology focused diffusion proba-
bilistic model for synthesis of histopathology images,’’ in Proc. IEEE/CVF
Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2023, pp. 1999–2008.

[11] T. Park, M. Y. Liu, T. C. Wang, and J. Y. Zhu, ‘‘Semantic image synthesis
with spatially-adaptive normalization,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 2332–2341.

[12] T. Amit, T. Shaharbany, E. Nachmani, and L. Wolf, ‘‘SegDiff: Image seg-
mentation with diffusion probabilistic models,’’ 2021, arXiv:2112.00390.

[13] A. Alshenoudy, B. Sabrowsky-Hirsch, S. Thumfart, M. Giretzlehner, and
E. Kobler, ‘‘Semi-supervised brain tumor segmentation using diffusion
models,’’ in Proc. IFIP Adv. Inf. Commun. Technol., 2023, pp. 314–325.

[14] X. Guo, Y. Yang, C. Ye, S. Lu, B. Peng, H. Huang, Y. Xiang, and T. Ma,
‘‘Accelerating diffusion models via pre-segmentation diffusion sampling
for medical image segmentation,’’ inProc. Int. Symp. Biomed. Imag., 2023,
pp. 1–5.

[15] B. Kim, Y. Oh, and J. C. Ye, ‘‘Diffusion adversarial representation learn-
ing for self-supervised vessel segmentation,’’ 2023, arXiv:2209.14566.
[Online]. Available: https://arxiv.org/abs/2209.14566

[16] J. Wolleb, R. Sandk, and P. C. Cattin, ‘‘Diffusion models for implicit image
segmentation ensembles,’’ in Proc. Int. Conf. Med. Imag. Deep Learn.,
2022, pp. 1336–1348.

[17] J. Wu, R. Fu, H. Fang, Y. Zhang, Y. Yang, H. Xiong, H. Liu, and Y. Xu,
‘‘MedSegDiff: Medical image segmentation with diffusion probabilistic
model,’’ 2022, arXiv:2211.00611.

[18] E. B. Asiedu, S. Kornblith, T. Chen, N. Parmar, M. Minderer, and
M. Norouzi, ‘‘Decoder denoising pretraining for semantic segmentation,’’
2022, arXiv:2205.11423.

[19] E. A. Brempong, S. Kornblith, T. Chen, N. Parmar, M. Minderer, and
M. Norouzi, ‘‘Denoising pretraining for semantic segmentation,’’ in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
Jun. 2022, pp. 4174–4185.

[20] C. Li, K. Xu, J. Zhu, and B. Zhang, ‘‘Triple generative adversarial nets,’’
in Proc. NIPS, 2017, pp. 4088–4098.

[21] M. J. M. Chuquicusma, S. Hussein, J. Burt, and U. Bagci, ‘‘How to fool
radiologists with generative adversarial networks? A visual Turing test
for lung cancer diagnosis,’’ in Proc. IEEE 15th Int. Symp. Biomed. Imag.
(ISBI), Apr. 2018, pp. 240–244.

[22] C. Baur, S. Albarqouni, and N. Navab, ‘‘MelanoGANs: High resolution
skin lesion synthesis with GANs,’’ 2018, arXiv:1804.04338.

[23] F. Calimeri, A. Marzullo, C. Stamile, and G. Terracina, ‘‘Biomedical data
augmentation using generative adversarial neural networks,’’ in Proc. Int.
Conf. Artif. Neural Netw.Cham, Switzerland: Springer, 2017, pp. 626–634.

[24] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’
2014, arXiv:1411.1784.

[25] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro,
‘‘High-resolution image synthesis and semantic manipulation with condi-
tional GANs,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 8798–8807.

[26] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation
with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 5967–5976.

[27] Y. Wang, L. Qi, Y.-C. Chen, X. Zhang, and J. Jia, ‘‘Image synthesis
via semantic composition,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.
(ICCV), Oct. 2021, pp. 13729–13738.

[28] T. Zhang, J. Cheng, H. Fu, Z. Gu, Y. Xiao, K. Zhou, S. Gao, R. Zheng,
and J. Liu, ‘‘Noise adaptation generative adversarial network for medical
image analysis,’’ IEEE Trans. Med. Imag., vol. 39, no. 4, pp. 1149–1159,
Apr. 2020.

[29] P. Dhariwal and A. Nichol, ‘‘Diffusion models beat GANs on image
synthesis,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
pp. 87–8780.

[30] P. Esser, R. Rombach, A. Blattmann, and B. Ommer, ‘‘ImageBART:
Bidirectional context with multinomial diffusion for autoregressive image
synthesis,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
pp. 3518–3532.

[31] J.Wolleb, F. Bieder, R. Sandkühler, and P. C. Cattin, ‘‘Diffusion models for
medical anomaly detection,’’ inProc.Med. Image Comput. Comput. Assist.
Intervent. (MICCAI). Cham, Switzerland: Springer, 2022, pp. 35–45.

[32] W.Wang, J. Bao,W. Zhou, D. Chen, D. Chen, L. Yuan, and H. Li, ‘‘Seman-
tic image synthesis via diffusion models,’’ 2022, arXiv:2207.00050.

[33] A. Shrivastava and P. T. Fletcher, ‘‘NASDM: Nuclei-aware semantic
histopathology image generation using diffusion models,’’ in Medical
Image Computing and Computer Assisted Intervention—MICCAI 2023,
vol. 14225. Cham, Switzerland: Springer, 2023, pp. 786–796.

[34] A. Kazerouni, E. K. Aghdam, M. Heidari, R. Azad, M. Fayyaz,
I. Hacihaliloglu, and D. Merhof, ‘‘Diffusion models in medical imaging:
A comprehensive survey,’’ Med. Image Anal., vol. 88, Aug. 2023,
Art. no. 102846.

[35] J. Wu, W. Ji, H. Fu, M. Xu, Y. Jin, and Y. Xu, ‘‘MedSegDiff-v2:
Diffusion based medical image segmentation with transformer,’’ 2023,
arXiv:2301.11798.

[36] K. Pnvr, B. Singh, P. Ghosh, B. Siddiquie, and D. Jacobs, ‘‘LD-ZNet:
A latent diffusion approach for text-based image segmentation,’’ 2023,
arXiv:2303.12343.

[37] A. Nichol and P. Dhariwal, ‘‘Improved denoising diffusion probabilistic
models,’’ in Proc. Mach. Learn. Res., vol. 139, 2021, pp. 8162–8171.

[38] J. Ho, A. Jain, and P. Abbeel, ‘‘Denoising diffusion probabilistic models,’’
in Proc. Adv. Neural Inf. Process. Syst., Dec. 2020, pp. 1–12.

[39] D. Baranchuk, I. Rubachev, A. Voynov, V. Khrulkov, and A. Babenko,
‘‘Label-efficient semantic segmentation with diffusion models,’’ 2021,
arXiv:2112.03126.

[40] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2016, pp. 770–778.

VOLUME 12, 2024 37713


B. Yang et al.: DiffusionDCI: A Novel Diffusion-Based Unified Framework

[41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst., 2017, pp. 5998–6008.

[42] P. Ramachandran, B. Zoph, and Q. V. Le, ‘‘Swish: A self-gated activation
function,’’ inProc. 6th Int. Conf. Learn. Represent. (ICLR), 2017, pp. 1–12.

[43] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, ‘‘A simple framework
for contrastive learning of visual representations,’’ in Proc. 37th Int. Conf.
Mach. Learn. (ICML), 2020, pp. 1–12.

[44] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
‘‘High-resolution image synthesis with latent diffusion models,’’ in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
pp. 10674–10685.

[45] A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew,
I. Sutskever, and M. Chen, ‘‘GLIDE: Towards photorealistic image
generation and editing with text-guided diffusion models,’’ in Proc. Mach.
Learn. Res., vol. 162, 2022, pp. 16784–16804.

[46] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton,
S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes,
T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, ‘‘Photorealistic text-to-
image diffusion models with deep language understanding,’’ in Proc. Adv.
Neural Inf. Process. Syst., vol. 35, 2022, pp. 36479–36494.

[47] X. Liu, D. H. Park, S. Azadi, G. Zhang, A. Chopikyan, Y. Hu, H. Shi,
A. Rohrbach, and T. Darrell, ‘‘More control for free! Image synthesis with
semantic diffusion guidance,’’ in Proc. IEEE/CVF Winter Conf. Appl.
Comput. Vis. (WACV), Jan. 2023, pp. 289–299.

[48] D. Lee and K. Myung, ‘‘ADAM: Method for stochastic optimization,’’ in
Proc. Int. Conf. Learn. Represent., Dec. 2014, pp. 1–15.

[49] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
‘‘GANs trained by a two time-scale update rule converge to a local Nash
equilibrium,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1–12.

[50] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, ‘‘The
unreasonable effectiveness of deep features as a perceptual metric,’’
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 586–595.

[51] A. Gilbert, M. Marciniak, C. Rodero, P. Lamata, E. Samset, and
K. Mcleod, ‘‘Generating synthetic labeled data from existing anatomi-
cal models: An example with echocardiography segmentation,’’ IEEE
Trans. Med. Imag., vol. 40, no. 10, pp. 2783–2794, Oct. 2021, doi:
10.1109/TMI.2021.3051806.

[52] J. Ye, Y. Xue, P. Liu, R. Zaino, K. C. Cheng, and X. Huang,
‘‘A multi-attribute controllable generative model for histopathology
image synthesis,’’ in Medical Image Computing and Computer Assisted
Intervention—MICCAI 2021, vol. 12908. Strasbourg, France, 2021,
pp. 613–623.

[53] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks
for biomedical image segmentation,’’ in Medical Image Computing
and Computer-Assisted Intervention—MICCAI 2015: 18th International
Conference, Munich, Germany, vol. 9351. 2015, pp. 234–241.

[54] R. J. Song, F. Zhang, and K. H. Park, ‘‘Semantic segmentation based on
improved pyramid scene parsing network,’’ J. Netw. Intell., vol. 6, no. 4,
pp. 797–806, 2021.

[55] F. Isensee, J. Petersen, A. Klein, and D. Zimmerer, ‘‘nnU-Net: Selfadapting
framework for U-Net-based medical image segmentation,’’ in Bildverar-
beitung für die Medizin 2019. Wiesbaden, Germany: Springer, 2019, p. 22.

BIN YANG received the M.S. degree in software
engineering from Beijing University of Tech-
nology, Beijing, China, in 2020, where he is
currently pursuing the Ph.D. degree in software
engineering.

His research interests include deep learning,
computer vision, and medical image analysis.

JIANQIANG LI (Senior Member, IEEE) received
the B.S. degree from Beijing Institute of Technol-
ogy, in 1996, and the Ph.D. degree from Tsinghua
University, in 2004.

He was a Researcher with the National Univer-
sity of Ireland, Galway, from 2004 to 2005. He was
with NEC Laboratories China, as a Researcher,
from 2005 to 2013. He joined Beijing University
of Technology, in 2013, as a Beijing Distinguished
Professor. His research interests include Petri nets,

data mining, information retrieval, and big data.

JINGYI WANG received the B.S. degree in
computer science and technology from Shandong
University of Finance and Economics, Jinan,
China, in 2021. She is currently pursuing the
M.S. degree in software engineering with Beijing
University of Technology.

Her research interests include deep learning,
weakly supervised learning, and medical image
analysis.

RUIQI LI received the B.S. degree in computer
science and technology from Hebei University,
Baoding, China, in 2021. He is currently pursuing
the M.S. degree in software engineering with
Beijing University of Technology.

His research interests include deep learning,
semi-supervised learning, andmedical image anal-
ysis.

KE GU (Senior Member, IEEE) received the
B.S. and Ph.D. degrees from Shanghai Jiao Tong
University, Shanghai, China, in 2009 and 2015,
respectively. He is currently a Professor with
Beijing University of Technology, Beijing, China.
His research interests include industrial vision,
environmental perception, image processing, and
machine learning.

BO LIU (Senior Member, IEEE) received the
B.S. degree from the Department of Automation,
Beijing Institute of Technology, Beijing, China,
and the M.S. and Ph.D. degrees from the Depart-
ment of Automation, System Integration Institute,
Tsinghua University, Beijing, in 2003 and 2008,
respectively. After graduation, she was with NEC
Laboratories China, The University of Chicago,
Argonne National Laboratory, Beijing University
of Technology, andMasseyUniversity. Her current

research interests include big data, data mining, machine learning, cloud
computing, scientific workflow, semantic web, and ontology reasoning.

37714 VOLUME 12, 2024

http://dx.doi.org/10.1109/TMI.2021.3051806