https://doi.org/10.1051/epjconf/202024507062
An automatic solution to make HTCondor more stable and easier
Computing Center, Institute of High Energy Physics, Chinese Academy of Science
Published online: 16 November 2020
HTCondor has been widely adopted by HEP clusters to provide high-level scheduling performance. Unlike other schedulers, HTCondor provides loose management of the worker nodes. We developed a maintenance automation tool called “HTCondor MAT” that focuses on dynamic resource management and automatic error handling. A central database records all worker node information, which is sent to the worker node for the startd configuration. If an error happens for the worker node, the node information stored in the database is updated and the worker node is reconfigured with the new node information. The new configuration stops the startd from accepting error-related jobs until the worker node recovers. The MAT has been deployed in the IHEP HTC cluster to provide a central way to manage the worker nodes and remove the impacts of errors on the worker nodes automatically.
© The Authors, published by EDP Sciences, 2020
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.