The University of Delaware IT Research Computing team wishes you a festive and happy holiday season. The year 2018 has seen growth in HPC support, with the hiring of a Scientific Applications Consultant and the purchase of the Caviness cluster. We’re hopeful that 2019 will see our services continue to expand.
Three major initiatives are underway that require the attention of all stakeholders and their research groups beginning January 2019:
- Expansion of Caviness Cluster: design phase of next generation, input needed on types of nodes and quantities, acquisition expected 1st quarter 2019
Mills compute nodes will be shutdown and removed: early 2019 to make room for Caviness next generation
Scheduled Farber maintenance: Monday, January 28, 2019
Additional Capacity in Caviness Cluster
Part of that expansion will be the procurement of the second generation of hardware in the Caviness cluster. Since the cluster was designed with growth in mind, adding capacity is a matter of designing, purchasing, and connecting new racks. We are currently in the design phase and have been working with Penguin on the basics:
- The first generation compute nodes used Intel’s Broadwell processor; some stakeholders have voiced their desire for a newer Intel processor. Intel Skylake processors demonstrate overall higher performance for general HPC workloads. Current pricing from Intel does place Skylake nodes at a higher price point than the Broadwell nodes.
- The popularity among stakeholders of large and extra-large memory nodes in the first generation of Caviness seems to indicate that larger memory sizes are preferred, so for the second generation we’ve increased the baseline from 128 GiB to 384 GiB. The large and extra-large sizes would be 768 GiB and 1 TiB, respectively.
- The first generation GPU nodes used nVidia’s Pascal; some stakeholders have requested a newer GPU model be used. The nVidia Volta features 42% more CUDA cores plus tensor cores (not present in the Pascal) that can greatly accelerate AI/ML work loads. The Volta is priced higher and would also increase the per-node price.
Arriving at a solid idea of the kinds and quantities of nodes you, the stakeholders, would like to see in this addition to Caviness is critical. We will be sharing a Google form with you soon to gather this information. From there, we will propose and distribute a complete design specification — with pricing information included — for your review. We are hopeful that this process will be quick enough to get the addition procured within the first quarter of 2019.
Adding to Caviness will require additional floorspace in the data center. Though Mills officially met its end of support at the start of 2017, it has remained online since then and continues to process user workloads (though in diminishing capacity). To make room for the expansion of Caviness, the remainder of Mills compute nodes will be shutdown and removed in early 2019.
- The cluster’s head node will be taken offline and upgraded with a newer OS to address security concerns deriving from the age of the OS currently in-use. The head node will be brought back online for an additional period of time to serve as a file access gateway to /archive, /lustre/work, and user home directories. Please note that /lustre/scratch will not remain online for the additional period of access, so any important data thereon must be moved prior to the shutdown.
- Workgroups are strongly encouraged to begin planning for the migration of their important data off Mills storage if they have not done so already. Feel free to contact the IT HPC team if you need assistance with that process.
In previous years, the month of July has contained scheduled downtime of Mills and Farber for the application of OS patches and general cleanup of the systems. In 2018, the construction and deployment of Caviness indefinitely delayed this annual event. Annual patching helps keep these large-scale shared systems properly secured, so beginning on Monday, January 28, Farber will be taken offline for annual maintenance. Users will not have access to the cluster and should expect all active jobs to be killed, since compute nodes will require reboots. More details on the scope and length of the maintenance period will be made available as we continue to plan for it.