
To effectively address the increasingly severe thermal challenges in multi-core NoC systems, the Ceres Lab research team at National Yang Ming Chiao Tung University’s (NYCU) Institute of Electronics, led by Professor Kun-Chih (Jimmy) Chen, proposes a low-cost online learning mechanism for temperature prediction in Network-on-Chip (NoC) systems and utilizes adaptive reinforcement learning techniques for effective proactive thermal management, significantly improving the system’s thermal management efficiency and stability. This innovative research result conducted by master’s students Yuan-Hao Liao, Cheng-Ting Chen, and Lei-Chi Wang, is detailed in their paper “Adaptive Machine Learning-based Proactive Thermal Management for NoC Systems,” published in the IEEE Transactions on Very Large Scale Integration Systems (IEEE TVLSI), one of the most prestigious international journals in the field of semiconductors and integrated circuits, and was selected for the 2024 IEEE TVLSI Best Paper Award. This award is given annually to only one paper out of all papers published in the journal in the past three years and is presented at the IEEE International Symposium on Circuits and Systems (ISCAS), the flagship international conference of the IEEE Circuits and Systems Society (CASS).

In recent years, the number of processor cores on a single chip has increased to enhance system performance through highly parallel processing, which has also raised the challenges of on-chip communication. To mitigate the interconnection problem in multi-core systems, utilize the Network-on-Chip (NoC) interconnection, which has been proven to be an efficient way to connect each processing element in multi-core systems. However, due to the high power density and diverse workload distribution, NoC systems often encounter severe thermal problems, leading to various negative impacts.
To tackle the previously mentioned problem, monitoring the system temperature in real-time is necessary. Among the different Dynamic Thermal Management (DTM) approaches, the dynamic voltage frequency scaling (DVFS) scheme is the most popular method for controlling system temperature. The conventional reactive dynamic thermal management (RDTM) is triggered while the temperature of the NoC nodes reaches the triggering temperature. Although the RDTM can cool down the thermal-emergent NoC nodes quickly, it causes a significant performance impact. Unlike RDTM, proactive dynamic thermal management (PDTM) controls system temperature in advance based on temperature prediction information and mitigates performance impact during temperature control. However, the efficiency of the PDTM depends on the precision of the temperature prediction and the throttling ratio selection. Besides, the temperature behavior on NoC systems varies with different workload distributions, which increases the difficulty of accurately capturing physical parameters at runtime and makes it challenging to use PDTM for system temperature control.

To mitigate the thermal problem of NoC systems, the research team proposes a low-cost online learning mechanism for temperature prediction in NoC systems. The adaptive Machine Learning (ML)-based PDTM first considers an adaptive single layer perceptron (ASLP)-based temperature prediction method that involves the least mean square (LMS) adaptive filter theory. The LMS adaptive filter is used to find the proper weights for the involved ASLP-based temperature prediction to fit the hyperplane of the temperature behavior based on the temperature prediction error. Furthermore, to solve the problem of the conventional deterministic PDTM methods (i.e., fixed thermal control strategy), the research team proposes an adaptive Reinforced Learning (RL) method to assist with the proper throttling ratio selection for further temperature control. Different from the conventional feedback mechanism (i.e., the throttling level is determined according to the current and predicted temperature), the RL method considers the current rewards (i.e., the information about the current temperature, the predicted temperature, and the throughput) to adjust the throttling ratio dynamically. In this way, the proposed adaptive ML-based PDTM will simultaneously ensure thermal safety and system throughput. Compared with related works, the proposed approach reduces average temperature prediction error by 0.2% to 78.0% and improves the system performance by 2.4% to 43.0% with smaller hardware overhead.
“This IEEE TVLSI Best Paper Award is not only the first time a research team from Taiwan has won the award in its 30-year history, but also the first time a research team from East Asia has won the award. This award is a testament to the excellence of the Ceres Lab research team and demonstrates NYCU’s distinguished research contributions and advanced technique developments in electronics engineering,” Professor Chen said.