Evaluating Privacy and Utility of Synthetic Tabular Data with Membership Inference Attacks
Submitted to 2025 IEEE International Conference on Future Machine Learning and Data Science (FMLDS) , 2025 .
Abstract
Synthetic data is widely used as a privacy-preserving alternative to real data in machine learning. It is especially important in areas with strict data protection rules, such as healthcare and finance. Among its various forms, synthetic tabular data is one of the most common, particularly for structured datasets in decision-making systems. However, the extent to which synthetic tabular data protects individual privacy while preserving utility remains unclear. In this paper, we introduce an evaluation framework for assessing privacy risks and utility tradeoffs across different synthetic data generators. This framework combines distance-based and model-based Membership Inference Attacks to assess privacy risks. It evaluates utility by training classifiers on synthetic data and testing them on real holdout sets. The framework was evaluated using three commonly used synthetic data generators: CTGAN, TVAE, and GaussianCopula on UCI Adult and Bank Marketing datasets. CTGAN achieves the highest utility among the evaluated generators but also exposes the most significant privacy vulnerabilities. Model-based membership inference attacks yield AUC scores averaging 0.52 and reaching as high as 0.56 in the most vulnerable cases. TVAE provides a balanced trade-off between privacy and utility, while GaussianCopula offers the lowest privacy risk. The findings emphasize the importance of task-specific, context-aware evaluation when selecting synthetic data generators for privacy-sensitive applications.