Empirical Investigations of Synthetic Data - Privacy, Utility, and Fidelity

📌 Project Overview

Faculty Sponsors: Dr. Xiaojun (CSU East Bay) & Dr. Xunfei Jiang (CSU Northridge)
Institution: California State University, Northridge
Dates of Research: April 2025 - Present

This research investigates the privacy and utility tradeoffs of synthetic data, with a focus on understanding how generative models preserve or expose sensitive information. It addresses the critical question of whether synthetic datasets can be both useful for downstream ML tasks and provably private. An upcoming extension of this work explores the fidelity of synthetic data using information-theoretic criteria, aiming to quantify how well the data captures the true structure of the original.

🎯 Objectives

Quantitatively evaluate privacy leakage in synthetic tabular data using membership inference attacks.
Assess downstream ML utility of synthetic datasets generated by statistical and deep learning-based models.
Develop and extend an information-theoretic framework for fidelity evaluation of synthetic data.

🧠 Big Research Questions

Can synthetic tabular data protect individual privacy while maintaining downstream task performance?
How can privacy risk be measured in a model-agnostic and interpretable way?
What does high-fidelity synthetic data look like, and how can its resemblance to real data be evaluated without overfitting?

🛠️ Methods & Tools

Data Sources: UCI Adult Census and Bank Marketing datasets
Algorithms / Models: CTGAN, TVAE, Gaussian Copula, Logistic Regression, Random Forest
Software/Environments: Python, scikit-learn, SDV, NumPy, Matplotlib, Jupyter
Model Evaluation/Analysis:
- Privacy: ROC AUC from average-case and worst-case Membership Inference Attacks (MIAs)
- Utility: Accuracy, F1 Score, Precision, Recall, and AUC of classifiers trained on synthetic data
- Fidelity (ongoing): Mutual Information, KL Divergence, and statistical dependency measures

📦 Deliverables

First-author paper submitted to IEEE FMLDS 2025
Fidelity evaluation extension in progress through independent study (targeting Fall 2025 submission)

📈 Outcomes

Identified that CTGAN achieves strong downstream utility while exposing slightly higher privacy risk than TVAE or Copula.
Demonstrated that naive distance-based attacks perform poorly, while worst-case attacks reveal more leakage.
Found that GaussianCopula models often retain class proportions but fail to support predictive structure in certain tasks.

🔁 Ongoing / Future Work

Ongoing development of an information-theoretic framework to evaluate synthetic data fidelity beyond visual or task-based methods.
Expansion to high-dimensional and time-series datasets for broader generalization.
Future integration of newer generative architectures (e.g., Diffusion Models, Tabular Transformers) into the utility evaluation pipeline

🧠 What I Learned

This project deepened my understanding of synthetic data evaluation from both a privacy and utility perspective. I learned how subtle modeling choices can influence both risk and value, and how visual similarity can be deceptive without rigorous metrics. It also inspired my interest in bridging statistical theory and machine learning through fidelity-based analysis.

Brandon S. Ismalej