By Kyle Cranmer, Tim Head, jean-roch vlimant, Vladimir Gligorov, Maurizio Pierini, Gilles Louppe, Andrey Ustyuzhanin, Balázs Kégl, Peter Elmer, Juan Pavez, Amir Farbin, Sergei Gleyzer, Steven Schramm, Lukas Heinrich, Michael Williams, Christian Lorenz Müller, Daniel Whiteson, Peter Sadowski, Pierre Baldi

Discussions at recent workshops have made it clear that one of the key barriers to collaboration between high energy physics and the machine learning community is access to training data. Recent successes in data sharing through the HiggsML and Flavours of Physics Kaggle challenges have borne much fruit, but required significant effort to coordinate.

While static simulated datasets are useful for challenges, in the course of investigating new machine learning techniques it is advantageous to be able to generate training data on demand (e.g. Refs. 1, 2, 3 ).
Therefore we recommend efforts be made to produce the ingredients required to facilitate such collaboration:

  • Specific challenges for HEP experiments should be fully specified such that minimal domain-specific knowledge is required to attack them.
  • Stand-alone simulators should be made open source. They should be developed to be easy to use without domain-specific expertise, while still being representative of real experimental challenges. Such a simulation will permit non-HEP researchers to generate realistic HEP datasets for training and testing. These simulators could range from truth-level simulation of a hard scattering to fast simulation like Delphes, to full GEANT4 simulation of sensor arrays.
  • Performance metrics (objective functions) and operational constraints should be defined to evaluate proposed solutions.


Comment from Sebastien Binet, who is having technical problems:

C++ frameworks are notoriously difficult to compile, install and
distribute (and are a pain to setup and/or time consuming b/c one has
to track hidden dependencies, find the right compiler, etc...)
python frameworks are relatively easy to deploy (pip install foo,
conda install bar, etc...) but slow (and I don't think there is a
python(2|3) framework that does (fast) simulation, because python.

Kyle Cranmer · 31 Mar, 2016

perhaps consider a Go-based (fast) simulation application? (such as,
e.g., fads)
Go packages are easy to install (go get and Go
binaries are fast.

Also: what is the exchange data format in vogue in ML? ARFF? CSV? NPy? HDF5?

Kyle Cranmer · 31 Mar, 2016
Please log in to add a comment.