Predictive Energy Modeling for GPU Workloads

📌 Project Overview

Faculty Advisor: Dr. Xunfei Jiang - Department of Computer Science Institution: California State University, Northridge
Dates of Research: June 2024 – August 2024
Funding: SECURE For Student Success \((\textrm{SfS}^2)\)

This project focused on developing machine learning models to forecast energy consumption of GPU workloads within data centers, using real-world and synthetic cluster traces. The aim was to build predictive tools that enable energy-efficient workload scheduling, a critical need for sustainable high-performance computing and sustainable datacenters.

🎯 Objectives

  • Predict GPU power consumption in watts using various real-time features
  • Support energy-efficient workload scheduling by modeling GPU behavior
  • Explore limitations of trace-driven simulations for real-time inference

🧠 Big Research Questions

  • Can we reliably predict GPU energy usage based on workload and utilization data?
  • How can machine learning improve energy efficiency in large-scale GPU clusters?
  • What are the constraints of using real-world trace data for predictive modeling?

🛠️ Methods & Tools

  • Data Sources: Alibaba v2020 Cluster Trace [1], Sensetime Helios CLuster Trace [2]
  • Algorithms / Models: XGBoost, LightGBM, CatBoost, LSTM
  • Software/Environments: Python, C, Bash, Linux
  • Model Evaluation/Analysis: RMSE, statistical analysis of cluster traces

🖼️ Poster

📦 Deliverables

📈 Outcomes

  • XGBoost emerged as the best-performing model with promising predictive accuracy
  • Identified GRAM utilization and GPU temperature as key features for prediction
  • Outperformed GPU energy model from previous team’s research, defined by a lower RMSE

🔁 Future Work

  • Expanding to include outlet temperature prediction and Coefficient of Performance (COP)
  • Planning integration with task schedulers for real-time energy-aware decisions

[1] Q. Weng et al., “{MLaaS} in the Wild: Workload Analysis and Scheduling in {Large-Scale} Heterogeneous {GPU} Clusters,” presented at the 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 945–960. Available: https://www.usenix.org/conference/nsdi22/presentation/weng
[2] Q. Hu, P. Sun, S. Yan, Y. Wen, and T. Zhang, “Characterization and prediction of deep learning workloads in large-scale GPU datacenters,” arXiv (Cornell University), pp. 1–15, Oct. 2021, doi: https://doi.org/10.1145/3458817.3476223.

🧠 What I Learned

I learned how to integrate real-world trace data with synthetic benchmark results to build effective machine learning models for systems-level forecasting. This project challenged me to balance statistical rigor with system constraints and sparked my interest in energy-efficient and parallel computing.


🙏 Acknowledgments

Thanks to Dr. Xunfei Jiang and Matthew Smith for their support and collaboration and to the \((\textrm{SfS}^2)\) Program, funded by the United States Department of Education FY 2023 Title V, Part A.
Thanks to the CSUN Office of Undergraduate Research, and CSUN AS STAR Fund and for their funding and support.