Generative models trained using Differential Privacy (DP) are increasingly
used to produce and share synthetic data in a privacy-friendly manner. In this
paper, we set out to analyze the impact of DP on these models vis-a-vis
underrepresented classes and subgroups of data. We do so from two angles: 1)
the size of classes and subgroups in the synthetic data, and 2) classification
accuracy on them. We also evaluate the effect of various levels of imbalance
and privacy budgets.

Our experiments, conducted using three state-of-the-art DP models (PrivBayes,
DP-WGAN, and PATE-GAN), show that DP results in opposite size distributions in
the generated synthetic data. More precisely, it affects the gap between the
majority and minority classes and subgroups, either reducing it (a “Robin Hood”
effect) or increasing it (“Matthew” effect). However, both of these size shifts
lead to similar disparate impacts on a classifier’s accuracy, affecting
disproportionately more the underrepresented subparts of the data. As a result,
we call for caution when analyzing or training a model on synthetic data, or
risk treating different subpopulations unevenly, which might also lead to
unreliable conclusions.

By admin