Adversarial Robustness and the Evolution of Latent Geometries in Neural Networks

Date:


Abstract

Adversarial training is a common technique for making neural networks more robust to small, intentionally crafted input perturbations. However, the internal structures these networks develop, especially how they represent and organize data, are still not well understood. Exploring these internal representations is crucial, as it helps explain how robust networks generalize differently and how their decision-making boundaries are affected.

In this study, we use a data visualization method called M-PHATE to examine how the internal geometry of neural networks changes during training, particularly under adversarial attacks. We compare neural networks trained under standard conditions to those trained with adversarial robustness techniques, analyzing how the representational structures learned by these networks differ when processing clean versus perturbed data. By embedding network activations into low-dimensional spaces throughout training, M-PHATE reveals clear geometric differences between robust and non-robust models. These visualizations, supported by quantitative analyses, show that adversarial training leads to structural changes in how networks organize internal data representations.

This work advances our understanding of how adversarial robustness reshapes the representational geometry of neural networks, offering a new perspective on the internal dynamics that distinguish robust models from standard ones. These insights are especially important for developing interpretable robustness methods and informing the design of networks that are accurate, resilient, and generalize well.

View Slides