Analyzing Adversarial Robustness through the Geometry of Neural Representations
Date:
Abstract
Adversarial training is a common technique for making neural networks more robust to small, intentionally crafted input perturbations. While robustness is often measured by how much accuracy drops on perturbed inputs, this may not fully reflect how the neural network is processing data internally. In this work, we explore whether accuracy is a reliable metric of robustness, by comparing it to the network’s internal geometry when clean and perturbed data are passed through.
We trained neural networks with different types of adversarial attacks, including Fast Gradient Sign Method (FGSM), L2-bounded, and L∞-bounded, and evaluated them across a range of perturbation strengths. To investigate internal changes, we used a data visualization tool called M-PHATE, to visualize penultimate layer embeddings and measured geometric shifts using maximum mean discrepancy, k-nearest neighbor recovery, and euclidean distances. Across these metrics, we found that some models with the smallest drop in accuracy, still showed large shifts in geometry. This suggests that accuracy alone may hide meaningful changes in how the network sees clean and perturbed data.
Our results show that internal geometries can reveal vulnerabilities that are not captured by traditional accuracy metrics, and that combining both views gives a more complete picture of what it means for a network to be robust.
