Over the last year, University IT staff have been working with faculty, staff, and vendors to design the next community cluster. Goals for this successor to the Mills and Farber clusters are to pack more computing power into less physical space, use power more efficiently, and leverage reusable infrastructure for a longer overall lifespan. Several leading HPC vendors provided supporting information in late 2016 which, together with input from University faculty and staff, helped to produce a finalized design proposal that became a formal request for pricing (see Design and Policies document for more details). The project reached a milestone this week with the announcement at the HPC Symposium of Penguin Computing as the vendor for the next community cluster. Penguin’s Tundra Extreme Scale (ES) design met all of the design goals and was priced to maximize the value of the University’s investment in this critical resource.
The Tundra ES follows the specifications of the Open Compute Project, an initiative to standardize the construction of compute hardware and the racks that hold that hardware. Compute hardware (nodes) exist as little more than a metal box containing a motherboard and the processors, memory, and other components directly attached to it. When today’s nodes are no longer useful they can be removed and replaced with newer hardware — without additional rack infrastructure cost. Maximizing what counts as rack infrastructure is important: the nodes do not possess their own power supply and instead draw from a central power converter. The single power converter is more efficient and has the capacity and redundancy necessary to keep the cluster running with current and future generations of processors.
Just as in Farber before it, the next community cluster will use Intel processors. The first generation of nodes will feature two 18 core Intel Xeon E5-2695 v4 processors — nearly double the number of cores versus Farber’s nodes. The minimum amount of memory present in each node has been doubled to 128 GB. Nodes with dual nVidia P100 GPUs, additional memory, or local SSD disk are also present in the design. While all component hardware is significantly advanced from Farber’s design, the entry level buy-in for stakeholder investment in the next community cluster is anticipated to be similar, as are the value-added benefits: opportunistic access to idle compute resources, use of the large-scale shared scratch storage (Lustre), an allocation of resilient shared storage (NFS), and support provided by University IT HPC staff. Questions about the new cluster can be communicated to the team by emailing firstname.lastname@example.org.
Retirement of Mills Cluster
The purchase and installment of the next community cluster in the University’s data center provides motivation to move Mills into the next stage of its end-of-life plan. The Mills cluster, purchased in 2012, exited its official support period at the beginning of 2017. The cluster has remained functional in a production capacity since that time, with a mere 13 nodes removed from service following irreparable failure (on the order of 6%).
Later this year, in preparation for the installment of the next community cluster, a number of racks of compute hardware will be removed from the Mills cluster. Mills will remain functional in a diminished capacity for a period thereafter. Some time in 2018, the remainder of Mills will be consolidated to a single rack and the login node will be rebuilt to function solely as a means for users to access the Lustre and home directory storage of that cluster for a limited time.