[COMPLETE] DARWIN Capacity and Power Stress Testing
[PENDING] Future Downtime for PDU Replacement
Testing of DARWIN at full capacity was completed at 08:00 this morning, June 10, 2021. All compute nodes were run at 100% with a job that in the past had caused circuit breakers in the rack PDUs to trip, and all GPUs were run at 100% with a graphics-rendering workflow. These usage levels were sustained for approximately 36 of the 48 hours of maintenance time.
The job that has been problematic power-wise demonstrated that the rearranged power distribution in rack 1 (effected after overload events in that rack) was sufficient to prevent an overload at full capacity. During the capacity test, the PDUs in rack 2 suffered an overload event shortly after the tests commenced and a rearrangement was performed the afternoon of June 8, 2021. One PDU’s circuit breaker was reset and load returned to it; subsequently, no additional overload events occurred. The other PDU in rack 2 was found to have a broken circuit breaker switch that requires a full replacement of that PDU. Once the replacement arrives the PDU will be swapped out, requiring a full outage of half of the equipment in rack 2. IT-RCI staff will communicate additional details as the schedule develops.