Predictive Energy Modeling for GPU Workloads

📌 Project Overview

Faculty Advisor: Dr. Xunfei Jiang - Department of Computer Science Institution: California State University, Northridge
Dates of Research: June 2024 – August 2024
Funding: SECURE For Student Success \((\textrm{SfS}^2)\)

This project focused on developing machine learning models to forecast energy consumption of GPU workloads within data centers, using real-world and synthetic cluster traces. The aim was to build predictive tools that enable energy-efficient workload scheduling, a critical need for sustainable high-performance computing and sustainable datacenters.

🎯 Objectives

Predict GPU power consumption in watts using various real-time features
Support energy-efficient workload scheduling by modeling GPU behavior
Explore limitations of trace-driven simulations for real-time inference

🧠 Big Research Questions

Can we reliably predict GPU energy usage based on workload and utilization data?
How can machine learning improve energy efficiency in large-scale GPU clusters?
What are the constraints of using real-world trace data for predictive modeling?

🛠️ Methods & Tools

Data Sources: Alibaba v2020 Cluster Trace [1], Sensetime Helios CLuster Trace [2]
Algorithms / Models: XGBoost, LightGBM, CatBoost, LSTM
Software/Environments: Python, C, Bash, Linux
Model Evaluation/Analysis: RMSE, statistical analysis of cluster traces

🖼️ Poster

📦 Deliverables

First-author paper published at IEEE CCWC 2025
Invited seminar talk at the University of Idaho, Graduate Seminar Series
Poster & talk presented at \((\textrm{SfS}^2)\) 1st Annual Undergraduate Research Symposium
GPU energy XGBoost model for GPUCloudSimPlus Integration

📈 Outcomes

XGBoost emerged as the best-performing model with promising predictive accuracy
Identified GRAM utilization and GPU temperature as key features for prediction
Outperformed GPU energy model from previous team’s research, defined by a lower RMSE

🔁 Future Work

Expanding to include outlet temperature prediction and Coefficient of Performance (COP)
Planning integration with task schedulers for real-time energy-aware decisions

[1] Q. Weng et al., “{MLaaS} in the Wild: Workload Analysis and Scheduling in {Large-Scale} Heterogeneous {GPU} Clusters,” presented at the 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 945–960. Available: https://www.usenix.org/conference/nsdi22/presentation/weng
[2] Q. Hu, P. Sun, S. Yan, Y. Wen, and T. Zhang, “Characterization and prediction of deep learning workloads in large-scale GPU datacenters,” arXiv (Cornell University), pp. 1–15, Oct. 2021, doi: https://doi.org/10.1145/3458817.3476223.

🧠 What I Learned

I learned how to integrate real-world trace data with synthetic benchmark results to build effective machine learning models for systems-level forecasting. This project challenged me to balance statistical rigor with system constraints and sparked my interest in energy-efficient and parallel computing.

Brandon S. Ismalej

Predictive Energy Modeling for GPU Workloads

📌 Project Overview

🎯 Objectives

🧠 Big Research Questions

🛠️ Methods & Tools

🖼️ Poster

📦 Deliverables

📈 Outcomes

🔁 Future Work

🧠 What I Learned

🙏 Acknowledgments

Brandon S. Ismalej

📌 Project Overview

🎯 Objectives

🧠 Big Research Questions

🛠️ Methods & Tools

🖼️ Poster

📦 Deliverables

📈 Outcomes

🔁 Future Work

🔗 References / Related Work

🧠 What I Learned

🙏 Acknowledgments