Automatic Monitoring of Large-Scale Computing Infrastructure

Bockjoo Kim; Dimitri Bourilkov

doi:10.1051/epjconf/202429507007

EPJ

a
b
c
d
e
ap
st
h
plus
ds
pv
ti
qt
am
n

Proceedings

Open Access

EPJ Web of Conferences 295, 07007 (2024)
https://doi.org/10.1051/epjconf/202429507007

Automatic Monitoring of Large-Scale Computing Infrastructure

Bockjoo Kim^* and Dimitri Bourilkov

Department of Physics, University of Florida, Gainesville, FL 32611, U.S.A.

^* Corresponding author: bockjoo@phys.ufl.edu

Published online: 6 May 2024

Abstract

Modern distributed computing systems produce large amounts of monitoring data. For these systems to operate smoothly, underperforming or failing components must be identified quickly, and preferably automatically, enabling the system managers to react accordingly. In this contribution, we analyze jobs and transfer data collected in the running of the LHC computing infrastructure. The monitoring data is harvested from the Elasticsearch database and converted to formats suitable for further processing. Based on various machine and deep learning techniques, we develop automatic tools for continuous monitoring of the health of the underlying systems. Our initial implementation is based on publicly available deep learning tools, PyTorch or TensorFlow packages, running on state-of-the-art GPU systems.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Conference announcements

12 Internat. Congress of the Balkan Physical Union
July 8-12, 2025
Bucharest, Romania

Joint Annual Meeting of ÖPG and SPS
August 18-22, 2025
Wien, Austria

111th Italian National Society Congress
September 22-26, 2025
Palermo, Italy