Loading…

Fast Columnar Physics Analyses of Terabyte-Scale LHC Data on a Cache-Aware Dask Cluster

The development of an LHC physics analysis involves numerous investigations that require the repeated processing of terabytes of data. Thus, a rapid completion of each of these analysis cycles is central to mastering the science project. We present a solution to efficiently handle and accelerate phy...

Full description

Saved in:
Bibliographic Details
Published in:Computing and software for big science 2023-12, Vol.7 (1), Article 3
Main Authors: Eich, Niclas, Erdmann, Martin, Fackeldey, Peter, Fischer, Benjamin, Noll, Dennis, Rath, Yannik
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c2019-b371ae2b5b03fe49d57ff2d7f8359b264c70031010fe126affe231c3a9b50d373
container_end_page
container_issue 1
container_start_page
container_title Computing and software for big science
container_volume 7
creator Eich, Niclas
Erdmann, Martin
Fackeldey, Peter
Fischer, Benjamin
Noll, Dennis
Rath, Yannik
description The development of an LHC physics analysis involves numerous investigations that require the repeated processing of terabytes of data. Thus, a rapid completion of each of these analysis cycles is central to mastering the science project. We present a solution to efficiently handle and accelerate physics analyses on small-size institute clusters. Our solution uses three key concepts: vectorized processing of collision events, the “MapReduce” paradigm for scaling out on computing clusters, and efficiently utilized SSD caching to reduce latencies in IO operations. This work focuses on the latter key concept, its underlying mechanism, and its implementation. Using simulations from a Higgs pair production physics analysis as an example, we achieve an improvement factor of 6.3 in the runtime for reading all input data after one cycle and even an overall speedup of a factor of 14.9 after 10 cycles, reducing the runtime from hours to minutes.
doi_str_mv 10.1007/s41781-023-00095-9
format article
fullrecord <record><control><sourceid>crossref_sprin</sourceid><recordid>TN_cdi_crossref_primary_10_1007_s41781_023_00095_9</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1007_s41781_023_00095_9</sourcerecordid><originalsourceid>FETCH-LOGICAL-c2019-b371ae2b5b03fe49d57ff2d7f8359b264c70031010fe126affe231c3a9b50d373</originalsourceid><addsrcrecordid>eNp9kN1KAzEQRoMoWGpfwKu8QHTy13Qvy2pVKChY8TLMpolt3e5KZovs27ta8dKr-Ri-MzCHsUsJVxLAXZORbiYFKC0AoLCiOGEjZSUIBcac_mU9PWcTot1QUtKABTNirwukjpdtfdg3mPnTpqdtID5vsO4pEm8TX8WMVd9F8Rywjnx5X_Ib7JC3DUdeYthEMf_EHIctvfOyPlAX8wU7S1hTnPzOMXtZ3K7Ke7F8vHso50sRFMhCVNpJjKqyFegUTbG2LiW1dmmmbVGpqQkOQEuQkKJUU0wpKi2DxqKysNZOj5k63g25Jcox-Y-83WPuvQT_bccf7fjBjv-x44sB0keIhnLzFrPftYc8vEz_UV_EP2Xt</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Fast Columnar Physics Analyses of Terabyte-Scale LHC Data on a Cache-Aware Dask Cluster</title><source>Springer Nature - SpringerLink Journals - Fully Open Access </source><creator>Eich, Niclas ; Erdmann, Martin ; Fackeldey, Peter ; Fischer, Benjamin ; Noll, Dennis ; Rath, Yannik</creator><creatorcontrib>Eich, Niclas ; Erdmann, Martin ; Fackeldey, Peter ; Fischer, Benjamin ; Noll, Dennis ; Rath, Yannik</creatorcontrib><description>The development of an LHC physics analysis involves numerous investigations that require the repeated processing of terabytes of data. Thus, a rapid completion of each of these analysis cycles is central to mastering the science project. We present a solution to efficiently handle and accelerate physics analyses on small-size institute clusters. Our solution uses three key concepts: vectorized processing of collision events, the “MapReduce” paradigm for scaling out on computing clusters, and efficiently utilized SSD caching to reduce latencies in IO operations. This work focuses on the latter key concept, its underlying mechanism, and its implementation. Using simulations from a Higgs pair production physics analysis as an example, we achieve an improvement factor of 6.3 in the runtime for reading all input data after one cycle and even an overall speedup of a factor of 14.9 after 10 cycles, reducing the runtime from hours to minutes.</description><identifier>ISSN: 2510-2036</identifier><identifier>EISSN: 2510-2044</identifier><identifier>DOI: 10.1007/s41781-023-00095-9</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Original Article ; Particle and Nuclear Physics ; Physics ; Physics and Astronomy</subject><ispartof>Computing and software for big science, 2023-12, Vol.7 (1), Article 3</ispartof><rights>The Author(s) 2023</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c2019-b371ae2b5b03fe49d57ff2d7f8359b264c70031010fe126affe231c3a9b50d373</cites><orcidid>0000-0003-4932-7162</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,777,781,27905,27906</link.rule.ids></links><search><creatorcontrib>Eich, Niclas</creatorcontrib><creatorcontrib>Erdmann, Martin</creatorcontrib><creatorcontrib>Fackeldey, Peter</creatorcontrib><creatorcontrib>Fischer, Benjamin</creatorcontrib><creatorcontrib>Noll, Dennis</creatorcontrib><creatorcontrib>Rath, Yannik</creatorcontrib><title>Fast Columnar Physics Analyses of Terabyte-Scale LHC Data on a Cache-Aware Dask Cluster</title><title>Computing and software for big science</title><addtitle>Comput Softw Big Sci</addtitle><description>The development of an LHC physics analysis involves numerous investigations that require the repeated processing of terabytes of data. Thus, a rapid completion of each of these analysis cycles is central to mastering the science project. We present a solution to efficiently handle and accelerate physics analyses on small-size institute clusters. Our solution uses three key concepts: vectorized processing of collision events, the “MapReduce” paradigm for scaling out on computing clusters, and efficiently utilized SSD caching to reduce latencies in IO operations. This work focuses on the latter key concept, its underlying mechanism, and its implementation. Using simulations from a Higgs pair production physics analysis as an example, we achieve an improvement factor of 6.3 in the runtime for reading all input data after one cycle and even an overall speedup of a factor of 14.9 after 10 cycles, reducing the runtime from hours to minutes.</description><subject>Original Article</subject><subject>Particle and Nuclear Physics</subject><subject>Physics</subject><subject>Physics and Astronomy</subject><issn>2510-2036</issn><issn>2510-2044</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kN1KAzEQRoMoWGpfwKu8QHTy13Qvy2pVKChY8TLMpolt3e5KZovs27ta8dKr-Ri-MzCHsUsJVxLAXZORbiYFKC0AoLCiOGEjZSUIBcac_mU9PWcTot1QUtKABTNirwukjpdtfdg3mPnTpqdtID5vsO4pEm8TX8WMVd9F8Rywjnx5X_Ib7JC3DUdeYthEMf_EHIctvfOyPlAX8wU7S1hTnPzOMXtZ3K7Ke7F8vHso50sRFMhCVNpJjKqyFegUTbG2LiW1dmmmbVGpqQkOQEuQkKJUU0wpKi2DxqKysNZOj5k63g25Jcox-Y-83WPuvQT_bccf7fjBjv-x44sB0keIhnLzFrPftYc8vEz_UV_EP2Xt</recordid><startdate>20231201</startdate><enddate>20231201</enddate><creator>Eich, Niclas</creator><creator>Erdmann, Martin</creator><creator>Fackeldey, Peter</creator><creator>Fischer, Benjamin</creator><creator>Noll, Dennis</creator><creator>Rath, Yannik</creator><general>Springer International Publishing</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-4932-7162</orcidid></search><sort><creationdate>20231201</creationdate><title>Fast Columnar Physics Analyses of Terabyte-Scale LHC Data on a Cache-Aware Dask Cluster</title><author>Eich, Niclas ; Erdmann, Martin ; Fackeldey, Peter ; Fischer, Benjamin ; Noll, Dennis ; Rath, Yannik</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c2019-b371ae2b5b03fe49d57ff2d7f8359b264c70031010fe126affe231c3a9b50d373</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Original Article</topic><topic>Particle and Nuclear Physics</topic><topic>Physics</topic><topic>Physics and Astronomy</topic><toplevel>online_resources</toplevel><creatorcontrib>Eich, Niclas</creatorcontrib><creatorcontrib>Erdmann, Martin</creatorcontrib><creatorcontrib>Fackeldey, Peter</creatorcontrib><creatorcontrib>Fischer, Benjamin</creatorcontrib><creatorcontrib>Noll, Dennis</creatorcontrib><creatorcontrib>Rath, Yannik</creatorcontrib><collection>SpringerOpen</collection><collection>CrossRef</collection><jtitle>Computing and software for big science</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Eich, Niclas</au><au>Erdmann, Martin</au><au>Fackeldey, Peter</au><au>Fischer, Benjamin</au><au>Noll, Dennis</au><au>Rath, Yannik</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Fast Columnar Physics Analyses of Terabyte-Scale LHC Data on a Cache-Aware Dask Cluster</atitle><jtitle>Computing and software for big science</jtitle><stitle>Comput Softw Big Sci</stitle><date>2023-12-01</date><risdate>2023</risdate><volume>7</volume><issue>1</issue><artnum>3</artnum><issn>2510-2036</issn><eissn>2510-2044</eissn><abstract>The development of an LHC physics analysis involves numerous investigations that require the repeated processing of terabytes of data. Thus, a rapid completion of each of these analysis cycles is central to mastering the science project. We present a solution to efficiently handle and accelerate physics analyses on small-size institute clusters. Our solution uses three key concepts: vectorized processing of collision events, the “MapReduce” paradigm for scaling out on computing clusters, and efficiently utilized SSD caching to reduce latencies in IO operations. This work focuses on the latter key concept, its underlying mechanism, and its implementation. Using simulations from a Higgs pair production physics analysis as an example, we achieve an improvement factor of 6.3 in the runtime for reading all input data after one cycle and even an overall speedup of a factor of 14.9 after 10 cycles, reducing the runtime from hours to minutes.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1007/s41781-023-00095-9</doi><orcidid>https://orcid.org/0000-0003-4932-7162</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2510-2036
ispartof Computing and software for big science, 2023-12, Vol.7 (1), Article 3
issn 2510-2036
2510-2044
language eng
recordid cdi_crossref_primary_10_1007_s41781_023_00095_9
source Springer Nature - SpringerLink Journals - Fully Open Access
subjects Original Article
Particle and Nuclear Physics
Physics
Physics and Astronomy
title Fast Columnar Physics Analyses of Terabyte-Scale LHC Data on a Cache-Aware Dask Cluster
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T22%3A27%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_sprin&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Fast%20Columnar%20Physics%20Analyses%20of%20Terabyte-Scale%20LHC%20Data%20on%20a%20Cache-Aware%20Dask%20Cluster&rft.jtitle=Computing%20and%20software%20for%20big%20science&rft.au=Eich,%20Niclas&rft.date=2023-12-01&rft.volume=7&rft.issue=1&rft.artnum=3&rft.issn=2510-2036&rft.eissn=2510-2044&rft_id=info:doi/10.1007/s41781-023-00095-9&rft_dat=%3Ccrossref_sprin%3E10_1007_s41781_023_00095_9%3C/crossref_sprin%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c2019-b371ae2b5b03fe49d57ff2d7f8359b264c70031010fe126affe231c3a9b50d373%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true