The Journal of Brief Ideas

Analysing data from an experiment at the Large Hadron Collider at CERN requires large amounts of computing power.

In addition to large amounts of experimental data, each individual analysis requires large amounts of simulated data. This production of simulated data is the single largest consumer of CPU resources of the LHC computing grid [Atlas Simulation, LHCb Computing Resource usage in 2014]. Producing enough simulated data is already challenging and will become a limiting factor in the future with the size of experimental datasets poised to increase by O(10)--O(100).

In individual analyses sophisticated features are computed from basic quantities measured in the detectors. These sophisticated features improve the sensitivity of an analysis, but require a large amount of CPU time [Mtop matrix element, Measurement of the Single Top Quark Production Cross Section at CDF, Matrix element for Higgs, Missing mass calculator]. The large CPU needs limit the number of cases in which these techniques are used, resulting in sub-optimal analyses.

I propose replacing the calculations by a regression model [Greedy Function Approximation: A Gradient Boosting Machine, Scikit-learn: Machine Learning in Python] which is cheap to evaluate. The full simulation or computation of sophisticated features is performed on a subset of events, which are then used to train a regression model. This model is used to perform the computation for a large number of events. A proof of concept for the Missing mass calculator using the dataset from the ATLAS Higgs Boson Machine Learning Challenge is available: Learning expensive functions.