- Specific Hardware: Understanding Chip Temperature and Facility Water Temperature
Requesting a quote for cooling a specific set of hardware is essential to optimize your total cost of ownership (TCO). Lately, we've seen numerous liquid cooling quote requests that vaguely specify kW per rack or request a standalone CDU quote. Without details of the servers or CPUs/GPUs involved, the power a CDU can cool varies significantly. Standard procedures will simplify this process in time, but for now, ASHRAE and OCP are still figuring it out. Request a quote tailored to your identified hardware to avoid inaccurate estimates. Be mindful that cooling CPUs and GPUs below their maximum temperature prolongs their life, improve efficiency, and occasionally enhance performance. Bargain shopping for a CDU, fluid connectors, manifolds, and cold plates, then connecting them won't necessarily minimize your total cost of ownership. We recommend proceeding cautiously due to potential complications arising from these makeshift systems.
- Cooling and Power: Similar Terms, Different Meanings
Providing an accurate quote in response to incomplete RFPs is challenging as the required cooling water flow rate varies according to maximum chip temperature, chip power, and facility water temperature. This is akin to requesting a 10 kW PDU without specifying the voltage.
Technical Explanation: In liquid cooling, the flow rate is analogous to the current in power estimates, while the delta temperature between the CPU/GPU's maximum case temperature and the facility water temperature is similar to the voltage. CDUs, comprised of pumps and heat exchangers, can cool a certain number of servers given a particular flow rate. A higher deltaT allows more servers to be cooled. This is akin to a PDU supplying a certain current, where a higher voltage can power more servers with the same current. The heat exchanger provides certain heat transfer while reducing the available delta T, much like an electrical cable run where a higher current causes a larger voltage drop, limiting the voltage available to the servers.
- Practical Redundancy: Address Real-world Failure Modes
Redundancy requirements are essential, but they should tackle real-world problems. For example, including two centrifugal pumps in a CDU may not significantly increase uptime because these pumps are generally reliable and can last five years or more. More frequent issues like leaks, contamination, and corrosion need attention. Our systems offer N+1 CDU redundancy to maximize data center uptime, considering the significant financial implications of downtime in a multi-million-dollar data center.
- How to Analyze Your System Requirements
Determining the liquid cooling requirements of a system hinges on two key factors: 1) Power consumption and 2) Maximum case temperature tolerance. Higher power consumption and lower case temperature thresholds demand a greater volume of coolant per server. Historically, the upper limit for case temperatures would hover between 80-95 C. However, current models can operate with case temperatures as low as 57 C. This dynamic shift resembles the variability of voltage requirements across different servers.
Detailed Examination of Specific Systems
To illustrate this, let's consider a scenario in which the liquid cooling system primarily absorbs heat from the CPU/GPUs, with the cooling water maintained at 30 C. The rack power stated here pertains solely to CPU/GPU power; the overall power will be increased by 20-50%. Assume we are assembling a cluster of H100 GPUs, each drawing 700 W and featuring a maximum case temperature of 85 C. By using the Chilldyne calculator available at www.Chilldyne.com/calculator; we can ascertain the number of GPUs we can cool efficiently. Inputting the following specifications:
- Facility water temperature: 30 C
- Node maximum power: 700 W
- Target node temperature: 85 C
yields a cooling capacity of 572 GPUs. (A margin of error should be incorporated in real-world scenarios.) These GPUs can be housed in Dell XE9680 6U servers, each capable of handling 6 kW. With seven servers (56 GPUs) per rack, this results in the cooling capacity for ten racks, drawing 42 kW each, totaling 420 kW. (This example assumes that the servers' CPUs are under low load and contribute minimally to the overall power consumption.)
If the facility water temperature is increased to 40 C, the cooling capacity drops to 334 GPUs, restricting us to six racks (250 kW). This suggests that a 10C increase in the facility water temperature reduces rack cooling capability by nearly 40%.
Alternate Scenario: Intel 8460Q Sapphire Rapids (Xeon 4th Gen)
Alternatively, consider a situation where we are using 30 C water to cool Intel 8460Q Sapphire Rapids (Xeon 4th Gen) servers like the Dell c6620, equipped with eight CPUs per 2U server. Each CPU consumes 350 W and is rated for a maximum case temperature of 57 C. The smaller temperature differential between the CPU and coolant necessitates a substantially higher flow rate per watt. Employing the Chilldyne calculator once again, we can cool 477 CPUs. This equates to cooling 60 2U servers, or three racks, each drawing 59 kW (180 kW total). However, if the water temperature increases to 40 C, we might not have an adequate cooling margin.
Alternate Scenario: Supermicro AS -1125HS-TNR AMD Genoa
Lastly, let's evaluate the cooling capability for Supermicro AS -1125HS-TNR AMD Genoa servers. The Genoa configuration has six memory lanes and does not support a half-width form factor. Each 1U server accommodates two 400 W CPU sockets, rated for a maximum case temperature of 70 C. According to the calculator, we can cool 877 servers and 17 kW per rack, totaling 42 racks (714 kW). However, if the facility water temperature increases to 40C, we are limited to cooling 459 servers across 22 racks (374 kW).
Therefore, the power per rack and the total cooling load can vary, ranging from 180kW to 714 kW. To design a future-proof liquid-cooled data center, ensure an ample supply of facility cooling water, and add or upgrade CDUs as needed. Direct-to-chip cooling systems offer flexibility; you can switch from one brand to another during a server refresh with relative ease. In some cases, you can initially install a positive pressure system, and if it leaks, you can retrofit it with Chilldyne negative pressure CDUs and use the existing server hardware.
Here are some requests we have received and our reasons for caution:
- We've been asked to design a liquid cooling system for 3500 watts per node when the current server uses 500W. This leads to unnecessary expense and inefficiency. It's wiser to design for present requirements, scaling up only if necessary.
- System must be designed for 40+ psi DeltaP. Another request is for a system designed for 40+ psi DeltaP, which exceeds typical requirements. This increases costs and the chance for failure, with Chilldyne's systems demonstrating efficient operation at less than 6 psi DeltaP over years.
- 90% heat capture. Finally, aiming for a 90% heat capture ratio might seem like an efficiency drive but it can lead to unnecessary costs. It's often more cost-effective to achieve a lower heat capture ratio like the 80% we achieve with Sandia Manzano and air cool the remaining percentage.
Next time, we'll cover the top 3 variables to consider in RFPs to make smarter decisions on your future data center. Don't miss it!