## Archive for October 19, 2014

### Enhanced Higgs to tau+tau- Search with Deep Learning

“Enhanced Higgs to tau+tau- Search with Deep Learning” – that is the title of a new article posted to the archive this week by Daniel Whiteson and two collaborators from the Computer Science Department at UC Irvine (1410.3469). While the title may be totally obscure to someone outside of collider physics, it caught my immediate attention because I am working on a similar project (to be released soon).

Briefly: the physics motivation comes from the need for a stronger signal for Higgs decays to τ^{+}τ^{–}, which are important for testing the Higgs couplings to fermions (specifically, leptons). The scalar particle with a mass of 125 GeV looks very much like the standard model Higgs boson, but tests of couplings, which are absolutely crucial, are not very precise yet. In fact, indirect constraints are stronger than direct ones at the present time. So boosting the sensitivity of the LHC data to Higgs decays to fermions is an important task.

The meat of the article concerns the comparisons of *shallow* artificial neural networks, which contain only one or two hidden layers, and *deep* artificial neural networks, which have many. Deep networks are harder to work with than shallow ones, so the question is: does one really gain anything? The answer is: yes, its like increasing your luminosity by 25%.

This case study considers final states with two oppositely-charged leptons (e or μ) and missing transverse energy. The Higgs signal must be separated from the Drell-Yan production of τ pairs, especially Z→τ^{+}τ^{–}, on a statistical basis. It appears that no other backgrounds (such as W pair or top pair production) were considered, so this study is a purely technical one. Nonetheless, there is plenty to be learned from it.

Whiteson, Baldi and Sadowski make a distinction between *low-level variables*, which include the basic kinematic observables for the leptons and jets, and the *high-level variables*, which include derived kinematic quantities such as invariant masses, differences in angles and pseudorapidity, sphericity, etc. I think this distinction and the way they compare the impact of the two sets is interesting.

The question is: if a sophisticated artificial neural network is able to develop complex functions of the low-level variables through training and optimization, isn’t it redundant to provide derived kinematic quantities as additional inputs? More sharply: does the neural network need “human assistance” to do its job?

The answer is clear: human assistance does help the performance of even a deep neural network with thousands of neurons and millions of events for training. Personally I am not surprised by this, because there is physics insight behind most if not all of the high-level variables — they are not just arbitrary functions of the low-level variables. So these specific functions carry physics meaning and fall somewhere between arbitrary functions of the input variables and brand new information (or features). I admit, though, that “physics meaning” is a nebulous concept and my statement is vague…

The authors applied state of the art techniques for this study, including optimization with respect to *hyperparameters*, i.e., the parameters that concern the details of the training of the neural network (learning speed, `velocity’ and network architecture). A lot of computer cycles were burnt to carry out these comparisons!

Deep neural networks might seem like an obvious way to go when trying to isolate rare signals. There are real, non-trivial stumbling blocks, however. An important one is the *vanishing gradient* problem. If the number of hidden nodes is large (imagine eight layers with 500 neurons each) then training by back-propagation fails because it cannot find a significantly non-zero gradient with respect to the weights and offsets of the all the neurons. If the gradient vanishes, then the neural network cannot figure out which way to evolve so that it performs well. Imagine a vast flat space with a minimum that is localized and far away. How can you figure out which way to go to get there if the region where you are is nearly perfectly flat?

The power of a neural network can be assessed on the basis of the receiver operator curve (ROC) by integrating the area beneath the curve. For particle physicists, however, the common coinage is the expected statistical significance of an hypothetical signal, so Whiteson & co translate the performance of their networks into a discovery significance defined by a number of standard deviations. Notionally, a shallow neural network working only with low-level variables would achieve a significance of 2.57σ, while adding in the high-level variables increases the significance to 3.02σ. In contrast, the deep neural networks achieve 3.16σ with low-level, and 3.37σ with all variables.

Some conclusions are obvious: deep is better than shallow. Also, adding in the high-level variables helps in both cases. (Whiteson et al. point out that the high-level variables incorporate the τ mass, which otherwise is unavailable to the neural networks.) The deep network with low-level variables is better than a shallow network with all variables, and the authors conclude that the deep artificial neural network is learning something that is not embodied in the human-inspired high-level variables. I am not convinced of this claim since it is not clear to me that the improvement is not simply due to the inadequacy of the shallow network to the task. By way of an analogy, if we needed to approximate an exponential curve by a linear one, we would not succeed unless the range was very limited; we should not be surprised if a quadratic approximation is better.

In any case, since I am working on similar things, I find this article very interesting. It is clear that the field is moving in the direction of very advanced numerical techniques, and this is one fruitful direction to go in.