Empirical Investigations of Synthetic Data - Privacy, Utility, and Fidelity

πŸ“Œ Project Overview

Faculty Sponsors: Dr. Xiaojun (CSU East Bay) & Dr. Xunfei Jiang (CSU Northridge)
Institution: California State University, Northridge
Dates of Research: April 2025 - Present

This research investigates the privacy and utility tradeoffs of synthetic data, with a focus on understanding how generative models preserve or expose sensitive information. It addresses the critical question of whether synthetic datasets can be both useful for downstream ML tasks and provably private. An upcoming extension of this work explores the fidelity of synthetic data using information-theoretic criteria, aiming to quantify how well the data captures the true structure of the original.


🎯 Objectives

  • Quantitatively evaluate privacy leakage in synthetic tabular data using membership inference attacks.
  • Assess downstream ML utility of synthetic datasets generated by statistical and deep learning-based models.
  • Develop and extend an information-theoretic framework for fidelity evaluation of synthetic data.

🧠 Big Research Questions

  • Can synthetic tabular data protect individual privacy while maintaining downstream task performance?
  • How can privacy risk be measured in a model-agnostic and interpretable way?
  • What does high-fidelity synthetic data look like, and how can its resemblance to real data be evaluated without overfitting?

πŸ› οΈ Methods & Tools

  • Data Sources: UCI Adult Census and Bank Marketing datasets
  • Algorithms / Models: CTGAN, TVAE, Gaussian Copula, Logistic Regression, Random Forest
  • Software/Environments: Python, scikit-learn, SDV, NumPy, Matplotlib, Jupyter
  • Model Evaluation/Analysis:
    • Privacy: ROC AUC from average-case and worst-case Membership Inference Attacks (MIAs)
    • Utility: Accuracy, F1 Score, Precision, Recall, and AUC of classifiers trained on synthetic data
    • Fidelity (ongoing): Mutual Information, KL Divergence, and statistical dependency measures

πŸ“¦ Deliverables

  • First-author paper submitted to IEEE FMLDS 2025
  • Fidelity evaluation extension in progress through independent study (targeting Fall 2025 submission)

πŸ“ˆ Outcomes

  • Identified that CTGAN achieves strong downstream utility while exposing slightly higher privacy risk than TVAE or Copula.
  • Demonstrated that naive distance-based attacks perform poorly, while worst-case attacks reveal more leakage.
  • Found that GaussianCopula models often retain class proportions but fail to support predictive structure in certain tasks.

πŸ” Ongoing / Future Work

  • Ongoing development of an information-theoretic framework to evaluate synthetic data fidelity beyond visual or task-based methods.
  • Expansion to high-dimensional and time-series datasets for broader generalization.
  • Future integration of newer generative architectures (e.g., Diffusion Models, Tabular Transformers) into the utility evaluation pipeline

  1. SDV: Synthetic Data Vault
  2. Xu et al. (2019) – Modeling Tabular Data using Conditional GAN
  3. Xu et al. (2018) – Synthesizing Tabular Data using Generative Adversarial Networks

🧠 What I Learned

This project deepened my understanding of synthetic data evaluation from both a privacy and utility perspective. I learned how subtle modeling choices can influence both risk and value, and how visual similarity can be deceptive without rigorous metrics. It also inspired my interest in bridging statistical theory and machine learning through fidelity-based analysis.