A Feasibility Study on workload integration between HT-Condor and Slurm Clusters

R. Du; J. Shi; J. Zou; X. Jiang; Z. Sun; G. Chen

doi:10.1051/epjconf/201921408004

EPJ

a
b
c
d
e
ap
st
h
plus
ds
pv
ti
qt
am
n

Proceedings

Open Access

EPJ Web of Conferences 214, 08004 (2019)
https://doi.org/10.1051/epjconf/201921408004

A Feasibility Study on workload integration between HT-Condor and Slurm Clusters

R. Du^*, J. Shi^**, J. Zou^***, X. Jiang^****, Z. Sun^† and G. Chen^‡

Institute of High Energy Physics, Chinese Academy of Sciences, Beijing, China, 100049

^* e-mail: duran@ihep.ac.cn
^** e-mail: shijy@ihep.ac.cn
^*** e-mail: zoujh@ihep.ac.cn
^**** e-mail: jiangxw@ihep.ac.cn
^† e-mail: e-mail:sunzy@ihep.ac.cn
^‡ e-mail: e-mail:gang.chen@ihep.ac.cn

Published online: 17 September 2019

Abstract

There are two production clusters co-existed in the Institute of High Energy Physics (IHEP). One is a High Throughput Computing (HTC) cluster with HTCondor as the workload manager, the other is a High Performance Computing (HPC) cluster with Slurm as the workload manager. The resources of the HTCondor cluster are funded by multiple experiments, and the resource utilization reached more than 90% by adopting a dynamic resource share mechanism. Nevertheless, there is a bottleneck if more resources are requested by multiple experiments at the same moment. On the other hand, parallel jobs running on the Slurm cluster reflect some specific attributes, such as high degree of parallelism, low quantity and long wall time. Such attributes make it easy to generate free resource slots which are suitable for jobs from the HTCondor cluster. As a result, if there is a mechanism to schedule jobs from the HTCon-dor cluster to the Slurm cluster transparently, it would improve the resource utilization of the Slurm cluster, and reduce job queue time for the HTCondor cluster. In this proceeding, we present three methods to migrate HTCondor jobs to the Slurm cluster, and concluded that HTCondor-C is more preferred. Furthermore, because design philosophy and application scenes are di↵erent between HTCondor and Slurm, some issues and possible solutions related with job scheduling are presented.

© The Authors, published by EDP Sciences, 2019

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.