A multi-setpoint cooling control approach for air-cooled data centers using the deep Q-network algorithm

Cooling systems provide a safe thermal environment for the reliable operation of IT equipment in data centers (DCs) while generating significant energy consumption. Therefore, to achieve energy savings in cooling system control under dynamic thermal distribution in DCs, this paper proposes a multi-setpoint cooling control approach based on deep reinforcement learning (DRL). Firstly, a thermal model based on the XGBoost algorithm is constructed to precisely evaluate the thermal distribution in the rack room to guide real-time cooling control. Secondly, a multi-set point cooling control approach based on the deep Q-network algorithm (DQN-MSP) is designed to finely regulate the supply air temperature of each air conditioner by capturing the thermal fluctuations to ensure the dynamic balance of cooling supply and demand. Finally, we adopt the extended CloudSimPy simulation tool and the real workload trace of the PlanetLab system to evaluate the effectiveness and performance of the proposed approach. The simulation results show that the proposed control solution effectively reduces the cooling energy consumption by over 2.4% by raising the average air supply temperature of the air conditioner while satisfying the thermal constraints.


Introduction
With the widespread use of emerging applications such as natural language processing (NLP), image recognition, and 5G communications, the computing power demand for information systems in various industries is also increasing. 1 DCs provide resource services such as data computing, storage, and networking for information systems and have become one of the critical infrastructures in the Internet era. 2 Subsequently, the severe energy consumption and carbon emission problems brought about by the large-scale construction of data centers have attracted widespread attention from the community.As early as 2020, the top international journal Science reported that the current annual total energy consumption of global cloud data centers is about 205 terawatts, accounting for about 1% of the world's entire power generation, and will maintain steady growth in the next few years. 3In addition, with the proposal of the strategic goal of Carbon Peak and Carbon Neutrality, promoting the construction of green DCs plays an essential role in realizing the dual carbon goal. 4Cs generally include core components such as IT, cooling, power supply, distribution, and lighting systems, whose energy consumption is distributed as shown in Figure 1.The cooling system is one of the core components indispensable for the stable operation of data centers, providing a thermal environment for the safe operation of IT equipment in the rack room.Still, its energy consumption accounts for up to 30%-40%. 5 Therefore, managers must adopt more effective energy-saving technologies to reduce cooling system energy consumption and enhance the overall energy efficiency of DCs.
One of the keys to efficient cooling management in DC is the rapid and accurate prediction of the thermal distribution in the rack room, which in turn maintains a dynamic balance between heat dissipation demand and cooling supply. 6However, the complex infrastructure layout, dynamically changing IT thermal load, and airflow in the rack room pose a significant challenge to constructing temperature models.Traditional approaches for thermal modeling of data centers include computational fluid dynamics (CFD) models and simplified physical models. 7CFD-based temperature models can accurately simulate and evaluate the thermal distribution in the rack room.However, this method has high computational overhead and a complex modeling process, which is unsuitable for real-time thermal management.Additionally, the simplified physics-based model considers the heat transfer fluid and thermodynamic principles and can complete temperature prediction quickly but performs poorly in terms of accuracy. 8With the development of machine learning (ML) and Internet of Things (IoT) technologies, data center thermal modeling methods have gradually evolved from traditional CFD simulation modeling and simplified physical modeling to data-driven thermal modeling methods. 91][12] The work 13 developed a variety of thermal prediction models based on machine learning models, including ANN, gaussian process regression (GPR) models, and linear regression (LR) models.The authors used these thermal models to guide task placement and scheduling on a heterogeneous system, saving 17% in cooling power consumption while maintaining the quality of service (QoS).In summary, data-driven thermal models provide a better trade-off between modeling complexity, accuracy, and computational expense.They are better suited to guide cooling control in data centers' complex thermal environments.
DRL combines the strengths of deep learning and reinforcement learning algorithms to solve complex decision-making tasks.Since DeepMind applied the DQN algorithm to the Atari game and beat the human player, 14 improved and novel DRL algorithms continue to be developed.For example, deep deterministic policy gradient (DDPG) 15 and proximal policy optimization (PPO) 16 based on actor-critic structure have been proposed to make the DRL algorithms more stable and converge faster.Furthermore, a multi-agent deep deterministic policy gradient (MADDPG) is designed to solve the multi-intelligence collaboration and competition problem. 17These research advances have led to a wide range of applications of DRL in various fields, including gaming, robot control, autonomous driving, and financial trading.As DRL performs well in solving control problems of complex systems, it has been gradually adopted to solve DCs cooling control problems in recent years. 18,19For example, the work 18 uses a reinforcement learning algorithm to control fan speed and cooling water velocity in the rack room's air handling unit (AHU).Furthermore, the work 19 uses a large amount of monitoring data (IT load and weather information) in the data center to train the DRL-based agent to learn the control strategy of the water chilling unit.All these efforts have achieved good energy savings, but most existing cooling control schemes use a coarse-grained control strategy to set the global cooling parameters.This strategy usually sets the cooling supply based on the peak temperature to avoid hot spots, thus leading to some racks being over-cooled.In addition, the thermal impact of each air conditioner on different racks depends on the relative position, blast airspeed, and floor ventilation rate. 19Therefore, this coarse-grained control strategy creates a mismatch between local cooling supply and demand.It is challenging to achieve fine-grained thermal management, resulting in low cooling energy efficiency.
To address this issue, this work proposes a cooling control method based on DRL to address the dynamic control of cooling systems in complex thermal environments.The main contributions are as follows.supply temperature of each air conditioner in a fine-grained manner to improve the cooling efficiency.
The remaining parts of this paper are structured as follows.The first section describes the related work of thermal modeling and cooling system control.The second section mainly introduces the system model and optimization objectives.The third section presents the design of state space, action space, and reward function of the cooling control algorithm based on DQN.The fourth section shows the simulation experiment and result analysis; finally, the conclusions and outlook.

Data center thermal modeling
The temperature distribution in data centers exhibits complex and variable dynamic characteristics due to the dynamic load of IT equipment, airflow circulation, and building layout.Existing thermal modeling of data centers can be classified into simplified physical models, 20 computational fluid dynamics (CFD) based models, 21 and data-driven models. 8he most classical simplified physical model is an abstract thermal recirculation model for air-cooled data centers proposed in work. 20This model constructs a thermal disturbance matrix to represent the weights of the interactions between multiple nodes in the rack room.However, this thermal model can only evaluate steady-state temperature profiles and ignores the timevarying nature of temperature.Subsequent work 22 improved the thermal recirculation model by considering the temperature variation relationship with time and constructing a transient temperature model to predict the temperature distribution after the next time step.In addition, CFD-based simulation modeling approaches have been extensively applied for the thermal modeling of data centers.The work 23 provides an extensive survey and analysis of studies related to CFD/HT-based thermal modeling methods.CFDbased thermal models can generally offer a complete and accurate thermal field.Still, the complex modeling process and huge computational overhead make it challenging to be used for real-time temperature assessment and cooling control in data centers.Therefore, datadriven models have been developed as a prospective thermal modeling method and are broadly adopted for the thermal management of data centers.The work 24 constructs a thermal model based on proper orthogonal decomposition (POD) to fit the complex nonlinear relationship between the thermal load of nodes and the inlet temperature.Furthermore, work 12 developed an ANN-based thermal model for data centers to predict temperature and airflow distribution in rack rooms.The prediction results of this model are in high agreement with the CFD simulation results, which further validate that the neural network-based thermal model has acceptable performance and is suitable for real-time management of cooling systems.Moreover, the latest research work 10 demonstrates the applicability of the data-driven thermal model for guiding the thermal management of data centers through numerous simulation experiments.

Cooling control system
Most existing works on data center cooling control optimize the cooling parameters from the holistic control level.For example, the work 18 proposed a modelbased reinforcement learning algorithm to regulate data center cooling parameters.More specifically, the temperature and airflow rate in the rack room is regulated by controlling the air handling unit's fan speed and chilled water flow rate.Moreover, the work 19 formulated the cooling control strategy as an energy minimization problem with thermal constraints.Subsequently, an energy-aware cooling control algorithm (CCA) based on the Actor-Critic framework is proposed to solve it.Specifically, CCA is an end-to-end offline algorithm that can utilize historical monitoring data from data centers to train DRL-based agents to learn and improve chiller control policies.Furthermore, the work 25 constructs a DRL agent with constraints to learn the control strategy for air supply temperature and air speed of air conditioners, considering the thermal constraints of temperature and relative humidity (RH) of the data center.This strategy adaptively controls the amount of return hot air mixed with fresh outside air to minimize cooling energy consumption.Although these global cooling control methods can reduce cooling energy consumption while satisfying thermal constraints, there are some limitations.First, this coarse-grained control method requires large-scale adjustment of equipment parameters, which may increase the complexity of the cooling control problem.Secondly, the overall control method focuses more on the global temperature, which leads to difficulties in accurately regulating the temperature in the local area.Therefore, this work proposes a DQN-based cooling control method to achieve fine-grained thermal management and improve cooling energy efficiency in DCs.

System modeling and optimization goals
The cooling control model proposed in this work (Figure 2) mainly includes a cloud environment and DRL agent.The core components of an air-cooled data center include IT equipment that provides computing services and a cooling system that protects the thermal environment of the computer room.We model the key components of the data center, including the power model of the IT equipment, the cooling system, and the temperature model of the rack room, to build a highfidelity simulation environment.Assume that the IT system of a data center consists of N racks with servers, which can be represented as Racks = {Rack 1 ,., Rack n }, where 1 4 n 4 N.Moreover, the cooling system consists of M CRACs, which can be expressed as CRACs = {CRAC 1 ,., CRAC m }, where 1 4 m 4 M.[28]

Data-driven thermal models
The data-driven model analyzes system data to capture complex correlations between input-output variables from massive amounts of data.The modeling approach takes data as the central basis without focusing too much on the physical meaning behind the relationship. 9 flowchart of the data-driven thermal modeling approach is shown in Figure 3. First, multiple sensor data (temperature, humidity, pressure, flow rate) under natural operating conditions or simulation data from CFD models are collected to construct the dataset.Subsequently, the data set is used to train the data-driven-based thermal model until it meets the prediction accuracy requirements.Finally, real-time temperature and airflow prediction is performed based on the trained thermal model.
CFD model.This work takes 6SigmDCX commercial simulation software 29 to construct a CFD model of a rack room (Figure 4), which is parameterized according to the IT equipment parameters and building layout of a small data center.The rack room is equipped with an air conditioning system that sends cold air from the raised floor and returns hot air to the room.Four rows of racks are symmetrically arranged in the rack room, with 20 racks available for servers.An open cold channel is installed in the middle of each rack row, and four precision air conditioners are used to provide cooling capacity.The power density of servers in the rack room is set to 1.5 kW/m 2 according to the recommendations of ASHRAE. 30Moreover, three sensors were deployed at different height positions in each rack inlet to collect temperature data, as shown in Figure 5.The specific hardware configuration and building parameters are shown in Table 1.
Data driven thermal modeling.The thermal distribution in the data center depends on the cooling system supply and the heat generation of the IT equipment.The experiments in this work use the supply air temperature T sup , the blower air speed FS crac of the CRAC, and the running power P rack of racks as independent variables, and the rack inlet temperature T Rack_inlet as the dependent variable.Thus, the data-driven thermal model can be expressed as, where f represents the non-linear relationship between the input-output variables and is commonly fitted by  regression models.Furthermore, to achieve efficient sampling, the latin hypercube sampling (LHS) method 31 is adopted to ensure that the independent variables are uniformly distributed in the multidimensional parameter space.Based on this method, 1000 sets of parameters are randomly generated, and each set includes 28 parameters, including the air supply temperature T sup (18°C-30°C), airspeed FS crac (20%-100%) and operating power consumption P rack (6-10 kW).Finally, CFD models were used to carry out multiple numerical calculations, export the simulation data and perform data pre-processing to form the dataset.The dataset consists of 1000 samples denoted as (T sup , FS crac , P rack , T Rack inlet ) and is split into training and test sets in a ratio of 8:2.
Model training and performance.To validate and compare the performance of various existing data-driven models for data center temperature prediction, six widely used ML models were selected for the experiments, namely, the lasso regression model, random forest regression model, support vector regression (SVR) model, where Ti is the predicted value, T i is the observed value, and n denotes the sample size.Besides, the 10-Fold crossvalidation method is used in the training process to avoid model over-fitting.Table 1 shows that the MAPE and RMSE values of the XGBoost model are 2.73% and 1.0°C, respectively, which are significantly smaller than other models and can meet the demand for prediction accuracy.Moreover, R 2 equals 0.87, closest to 1.0, indicating a high degree of model fit (Table 2).Therefore, this work will use the XGBoost-based temperature prediction model to guide the cooling control of DCs.Note that the hyper-parameters of the ML models are shown in Appendix Table A1.

Power model
Computing system.Numerous servers with different hardware configurations are deployed in the rack room.The power consumption P i (t) of server i at moment t can be divided into static power consumption P static (t) and dynamic power consumption P i dynamic (t).Static power consumption P static (t) is the base power consumption of the server under no load, which is usually a constant.Moreover, there is a complex relationship between the dynamic power consumption P dynamic (t) and the computational resource utilization U(t) of the server.The work 33 states that there exists an optimal computational resource utilization U opt (close to 70%) for most servers.When U(t) 4 U opt , the dynamic power consumption P dynamic (t) grows linearly with the computational resource utilization U(t).Conversely, when U(t) 5 U opt , P dynamic (t) grows nonlinearly and rapidly with U(t).Thus, the dynamic power consumption P dynamic (t) can be expressed as, where the constant coefficients a, b are set to 0.5, 10, respectively, and the optimal utilization U opt is set to 0.7.The total energy consumption of the IT system at time t, P IT (t), is the sum of the power consumption of all servers in the rack room, denoted as, Cooling system.The computer room air conditioner (CRAC) is the primary energy-consuming device of the cooling system.It occupies most of the cooling overhead, so this work considers it the main optimization target for energy saving.The cooling efficiency of CRAC can be measured by calculating the energy consumption ratio between the system and the cooling system, called the coefficient of performance (CoP).The higher the CoP value, the higher the cooling efficiency, which can be expressed as, where P IT , P cooling denotes the total power consumption of IT system and cooling system, respectively.Additionally, the study 34 showed that CoP is positively correlated with the cold air supply temperature T sup .
Here, the CoP measured by HP labs is From equation ( 6), it can be seen that increasing the cooling supply temperature T sup of the CRAC can improve the cooling system's efficiency.

Optimization objective
The optimization objective of the cooling management problem studied in this paper is to minimize the total energy consumption of the data center while satisfying the thermal constraints.Therefore, the optimization objective is defined as, where constraint (10) indicates that the computational resource utilization of the server cannot exceed the maximum utilization; constraint (11) indicates that the inlet temperature of the server is guaranteed to be lower than the red line temperature (32°C) 30 ; and constraint (12) indicates the range of values for the supply air temperature of the air conditioner.

DQN-based cooling control model
The thermal profile of a data center results from multiple uncertainties coupled with the load on IT equipment and the organization of the cooling airflow.The cooling system needs to evaluate the complex thermal profile gradients in the rack room in real time to make control responses.In this process, the decisions made by the controller at each moment are only related to the system's current state and are independent of the previous historical state.Thus, this continuous decision process has Markovian properties.Therefore, this work models the data center cooling control optimization problem as a continuous markov decision process (MDP), denoted as (S, A, R, P).The state S denotes the set of environmental states and features, and s t 2 S denotes the state of the agent at time t; R is the reward function, which depends on the optimization objective.r t = R (s t , a t ) denotes the immediate reward that the agent receives for executing action at in state s t .P is the state transition function and p (s t + 1 |s t , a t ) denotes the probability that the agent takes action at in state s t to transform the next state s t + 1 .The goal of the reinforcement learning agent explores the optimal policy p for each scenario from t 2 T with maximized expected cumulative discounted rewards, denoted as, where the discount factor g 2 [0, 1] is given to weigh the effect of future rewards on cumulative rewards.Subsequently, the classical deep reinforcement learning model DQN 14 was adopted to solve this MDP problem.A schematic diagram of the DQN model is shown in Figure 6.The DQN agent learns the global control optimization strategy by continuously interacting with the cloud environment to obtain the corresponding reward. 19More specifically, the DQN model uses a deep neural network to extract features of the complex state space, followed by a Q-learning algorithm to evaluate and select action decisions.More specifically, the DQN model contains two homogeneous neural networks: the target and the online network.Q(s,a|u) denotes the output of the online network, which is used to evaluate the value function of the current state-action pair; max a' Q(s#,a#|u ; ) denotes the maximum Q value of the output of the target network.Thus, the target Q value can be calculated as, Subsequently, the mean square error of the current Q value and the target Q value is taken to define the loss function, denoted as, Calculate the gradient of the parameter u with respect to the loss function L(u), Subsequently, based on this gradient, stochastic gradient descent (SDG) is taken to update the parameters u of the online network.To enhance the stability of the algorithm, the target network adopts a delayed update method, which copies the parameters of the online network to the target network every C step.Therefore, the state space, action space and reward function of the DQN-based cooling control model are given below.

State space
In this work, the inlet temperature T i lnlet of each rack, the operating power consumption P i rack , the air supply temperature T j sup of the air conditioner, and the wind speed FS j crac can be used as the state space of the rack room environment, which is expressed as S t = {T i lnlet , P i rack , T j sup , FS j crac }.

Action space
The cooling controller is responsible for adjusting the cooling system air supply temperature in real-time based on the heat distribution in the rack room.Assume that there are four precision air conditioners with initial temperature T initial in the rack room.The available operation for each air conditioner is denoted as a = { + , 2}, '' + '' and ''2'' means to increase the air supply temperature or decrease the air supply temperature by 0.5°C respectively.Therefore, for a set of four air conditioners, the action combination can be expressed as A 1 = {a 1 , a 2 , a 3 , a 4 }, with a total of 2 4 action combinations.In addition, an empty action combination A 17 is added to represent the operation of maintaining the original parameter settings, so the action space is represented as Action = {A 1 , A 2 , ..., A 17 }.

Reward function
A properly designed reward function will help the agent learn the desired strategy faster and better toward the optimization goal.Therefore, for the optimization objective of minimizing the cooling energy consumption under the thermal constraint, the reward function can be designed as, where T n sup (t), HS(t) denote the supply air temperature and the number of hot spots of the n-th CRAC at time t, respectively, and C denotes a constant.Specifically, the higher the supply temperature of CRACs, the lower the cooling power, so the supply temperature of all CRACs is used as a reward for the time t.Conversely, to satisfy the thermal constraints, so the number of hot spots is used as a penalty.This reward function aims to reduce the cooling energy consumption by increasing the air supply temperature of the air conditioner as much as possible while giving a corresponding penalty for violating the thermal constraint.

Agent training process
Here, it is assumed that the data center cooling system triggers a cooling control operation every 10 min for 24 h as an episode.The DQN-agent training process in each episode is as follows.

Experiment settings
The 6SigmaDC simulation software 29 was adopted to build a CFD simulation model of a small data center (Figure 4), which serves as a platform for conducting repeatable validation experiments.Additionally, the simulation experiments use real workload traces from the PlanetLab system 35 to simulate the variation of IT load in the data center.Moreover, we extend the opensource cloud simulation tool CloudsimPy 36 to add temperature models, and power consumption models of computing and cooling systems.All source codes of the experiments are written in Python and run on a laptop with a core i7-5700HQ CPU, 3.5 GHz, and 12 GB RAM.
The experiment assumes that there are two cooling setpoint modes for the CRACs in the machine room, which are single setpoint (SSP) and multiple setpoint (MSP) modes.Specifically, the SSP mode indicates that all CRACs have the same cooling setpoints, while the MSP mode is a fine-grained cooling control mode that sets different cooling setpoints for each CRAC.In addition, two well-performing DRL algorithms, proximal policy optimization (PPO) 16 and deep deterministic policy gradient (DDPG) 15 are chosen to serve as benchmark control algorithms.Distinguished from the target-online network structure of DQN, the PPO and DDPG algorithms are policy gradient algorithms based on the actor-critic structure, where the actor is the policy neural network that selects the action and the critic is the neural network that evaluates the value of the action.Besides, the agent of PPO adopts the same strategy p for action selection during training and testing.While the agents of DQN and DDPG usually DQN-agent training process 1. Initialize the online network Q(u) and target network Q ; (u ; ) 2. Initialize the cloud environment Env and obtain the initial state space s t 3.For each episode: 4. Fort to Tdo: 5. Agent adopts e-greedy strategy to select the action a t. 6. Env takes action a t , returns the new state space s t + 1 , calculate the reward r. 7. Agent stores samples (s, a, r, s') into the memory pool.8. End For 9. If the number of samples exceeds the threshold: 10.Randomly select the mini-batch samples to train the online network Q(u) 11.Defining the loss function: 12. L(u) = E (s, a, r, s 0 );D(M) ½(r + g max a 0 Q(s 0 , a 0 ; u ; ) À Q(s, a; u)) 2 13.Using stochastic gradient descent to update online network Q ; (u ; ) 14. r u L(u)=E (s,a,r,s 0 );D(M) ½(r+gmax a 0 Q(s 0 ,a 0 ;u ; )ÀQ(s,a;u))r u Q(s,a;u) Update the target network every C steps, Q ; =Q 15.End For  3.

Experimental results and analysis
Figure 7 represents the variation curves of total reward and total energy consumption of the DQN-MSP during the training process.To facilitate the observation of the variation relationship between the two curves, the total reward and the total energy consumption are normalized.Figure 7 shows that the total reward first increases rapidly and then converges gradually, while the total energy consumption curve decreases rapidly and then converges gradually, with obvious synchronization between the two.It can be inferred that the designed reward function is well suited to drive the agent to learn the control strategy toward the optimization goal of reducing energy consumption.
Figure 8 shows that as the training episodes increase, the total reward of four DRL-based cooling control methods shows an increasing trend and eventually converges.Note that DQN-SSP outperforms the other methods regarding convergence speed but has the smallest convergence value.The reason is that DQN-SSP uses a coarse-grained control strategy, so its action space is smaller than the other MSP methods, and the exploration space is smaller, so it converges faster but obtains less reward.Moreover, after 200 training scenarios, the convergence value of the proposed DQN-MSP outperforms all benchmarks.
Figure 9 compares the total reward and energy consumption of different control methods.As can be observed, the proposed DQN-MSP algorithm can learn better control strategies to minimize energy consumption than other benchmarks.Compared to DQN-SSP, PPO-MSP, and DDPG-MSP, it can save 5.7%, 2.4%, and 4.2% of energy consumption, respectively.
Figure 10 represents the trace of the total IT power for 24 h in the data center and the average supply air    temperature for various cooling control methods.It is shown that the DRL-based cooling control methods can capture the power consumption trace variations to regulate the CRACs supply temperature.In general, the supply temperature of the cooling control methods with MSP mode (DQN-MSP, PPO-MSP, DDPG-MSP) is slightly higher than that with SSP mode.The specific reason is that SSP mode requires a lower global cooling setpoint to meet the thermal constraints of each zone, especially the high heat generating zones.In contrast, MSP mode allows each CRAC to regulate the supply air temperature according to the thermal variations in their respective coverage zones.As a result, the air supply temperature of the associated CRAC can be appropriately increased for areas and periods with low heat generation, thus reducing cooling energy consumption.In addition, Figure 11 represents the temperature distribution of the four cooling control methods.The supply temperature of the proposed DQN-MSP is higher than the other benchmarks, which also means lower cooling energy consumption.
Comparing the average rack inlet temperature distribution under the two control approaches of DQN-SSP (Figure 12(a)) and DQN-MSP (Figure 12(b)), it can be seen that the temperature gradient with the fine-grained control strategy (DQN-MSP) is more uniform and closer to the red-line temperature (32°C).In conclusion, the fine-grained cooling control strategy can differentially regulate the cooling parameters of multiple air conditioners according to the temperature changes in each region, which more effectively reduce the temperature gradient and cooling energy consumption.

Conclusion
To address the cooling control energy-saving problem in DCs, this work firstly constructs an XGBoost-based temperature prediction model to quickly and accurately evaluate the temperature distribution in rack rooms.Then, based on the guidance of this thermal model, a DRL-based cooling control method is proposed, which can reduce cooling energy consumption by over 2.4%-5.7%by increasing the average air supply temperature of the air conditioner without violating thermal constraints.The proposed DQN-based cooling control model can capture the thermal load changes of each region in the rack room and adjust the air supply temperature of each air conditioner in a fine-grained manner to improve the cooling efficiency.However, this work also has some limitations.For example, as the cooling units increase, the action space of the single-agent controller increases exponentially, leading to model training failure to converge or poor optimization results.Therefore, the following work will consider a multi-agent cooperative control framework 37 to achieve asynchronous control of multiple cooling units.This approach ensures the global cooling supply and demand balance while considering the local area's thermal fluctuations.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figure 1 .
Figure 1.Energy consumption distribution of data centers.

Figure 2 .
Figure 2. Data center cooling control model.

Figure 4 .
Figure 4. CFD model of data center.
adopt e-greedy action selection strategy during training and a Ã = argmax Q Ã (s, a) strategy during testing or practical use.Therefore, to validate the effectiveness and performance of the proposed DQN-MSP cooling control method, we set up four cooling control benchmarks in MSP and SSP modes which are DQN-MSP, PPO-MSP, DDPG-MSP, and DQN-SSP.Considering that the DRL algorithm is sensitive to hyper-parameters, the key hyper-parameters of the models used in the experiments are described as follows.The learning rate of actor and critic networks for PPO and DDPG are 0.01, 0.02, respectively.The hyper-parameter e of PPO determines the range of the clip of the action selection probability and is set to 0.2.Moreover, the hyperparameters of the DQN model are in Table

Figure 7 .
Figure 7. Normalized total reward and total energy.

Figure 8 .
Figure 8. Normalized reward curves for control methods.

Figure 9 .
Figure 9.Total reward and energy for control methods.

Figure 10 .
Figure 10.Total IT power consumption and average supply air temperature.

Figure 11 .
Figure 11.Supply temperature distribution for four control methods.

Table 1 .
Parameters of CFD model.

Table 2 .
Prediction error metrics of different models.

Table 3 .
The hyper-parameters of DQN model.