The CMS Big Data Project explores the applicability of open source data analytics toolkits to the HEP data analysis challenge

Experimental Particle Physics has been at the forefront of analyzing the world’s largest datasets for decades. The HEP community was amongst the first to develop suitable software and computing tools for this task.

In recent times, new open source toolkits and systems collectively called “Big Data” technologies have emerged to support the analysis of Petabyte and Exabyte datasets. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and promise a fresh look at analysis of very large datasets and could potentially reduce the time-to-physics with increased interactivity.

The CMS Big Data Project explores these new open source data analytics toolkits and has the following goals

  • Reduce time-to-physics
  • Educate our graduate students and 
post docs to use industry-based technologies
    • Improves chances on the job market outside 
academia
    • Increases the attractiveness of our field
  • Use tools developed in larger communities reaching outside of our field

The starting point is Apache Spark and we are working on the following thrusts

All thrusts are based on reproducing a CMS physics analysis with open source data analytics frameworks in various settings, based on Apache Spark. We are planning to use the following list of metrics to characterize the performance of the solutions:

A collection of GitHub repositories can be found here: