Loading…

Analysis of CERN computing infrastructure and monitoring data

Optimizing a computing infrastructure on the scale of LHC requires a quantitative understanding of a complex network of many different resources and services. For this purpose the CERN IT department and the LHC experiments are collecting a large multitude of logs and performance probes, which are al...

Full description

Saved in:
Bibliographic Details
Published in:Journal of physics. Conference series 2015-01, Vol.664 (5), p.52029
Main Authors: Nieke, C, Lassnig, M, Menichetti, L, Motesnitsalis, E, Duellmann, D
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Optimizing a computing infrastructure on the scale of LHC requires a quantitative understanding of a complex network of many different resources and services. For this purpose the CERN IT department and the LHC experiments are collecting a large multitude of logs and performance probes, which are already successfully used for short-term analysis (e.g. operational dashboards) within each group. The IT analytics working group has been created with the goal to bring data sources from different services and on different abstraction levels together and to implement a suitable infrastructure for mid- to long-term statistical analysis. It further provides a forum for joint optimization across single service boundaries and the exchange of analysis methods and tools. To simplify access to the collected data, we implemented an automated repository for cleaned and aggregated data sources based on the Hadoop ecosystem. This contribution describes some of the challenges encountered, such as dealing with heterogeneous data formats, selecting an efficient storage format for map reduce and external access, and will describe the repository user interface. Using this infrastructure we were able to quantitatively analyze the relationship between CPU/wall fraction, latency/throughput constraints of network and disk and the effective job throughput. In this contribution we will first describe the design of the shared analysis infrastructure and then present a summary of first analysis results from the combined data sources.
ISSN:1742-6588
1742-6596
DOI:10.1088/1742-6596/664/5/052029