Towards the integrated ALICE Online-Offline (O2) monitoring subsystem
European Organization for Nuclear Research (CERN),
2 IRI, Goethe University Frankfurt, Frankfurt, Germany
3 National Institute for Nuclear Physics (INFN), Bari, Italy
* Corresponding author: firstname.lastname@example.org
Published online: 17 September 2019
ALICE (A Large Ion Collider Experiment) is preparing for a major upgrade of the detector, readout and computing systemsfor LHC Run 3. A new facility called O2 (Online-Offline) will play a major role in data compression and event processing. To efficiently operate the experiment, we are designing a monitoring subsystem, which will provide a complete overview of the O2 overall health, detect performance degradation and component failures. The monitoring subsystem will receive and collect up to 600 kHz of performance metrics. It consists of a custom monitoring library and a server-side, distributed software covering five main functional tasks: parameter collection and processing, storage, visualisation and alarms. To select the most appropriate tools for these tasks, we evaluated three options: “Modular Stack”, Zabbix and the currently used ALICE Grid monitoring tool called MonALISA. The former one consists of a toolkit including collectd, Apache Flume, Apache Spark, InfluxDB, Grafana and Riemann. This paper describes the monitoring subsystem functional architecture. It goes through a complete evaluation of the three considered options, the selection process, risk assessment and justification for the final decision. The in-depth comparison includes functional features and throughput measurement to ensure the required processing and storage performance.
© The Authors, published by EDP Sciences, 2019
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.