## Archive for October, 2014

### The CMS kinematic edge

Does CMS observe an excess that corresponds to a signal for Supersymmetry? Opinions differ.

This week a paper appeared with the statement “…CMS has reported an intriguing excess of events with respect to the ones expected in the SM” (Huang & Wagner, arXiv:1410.4998). And last month another phenomenology paper appeared with the title “Interpreting a CMS lljjpTmiss Excess With the Golden Cascade of the MSSM” (Allanach, Raklev and Kvellestad arXiv:1409.3532). Both studies are based on a preliminary CMS report (CMS PAS SUS-12-019, Aug. 24 2014), which ends with the statement *“We do not observe evidence for a statistically significant signal.”*

What is going on here?

The CMS search examines the di-lepton invariant mass distribution for a set of events which have, in addition to two energetic and isolated leptons, missing transverse momentum (pTmiss) and two jets. In cascade decays of SUSY particles, χ02 → χ01 l+l-, a kind of hard edge appears at the phase space limit for the two leptons l+l-, as pointed out many years ago (PRD 55 (1997) 5520). When distinguishing a signal from background, a sharp edge is almost as good as a peak, so this is a nice way to isolate a signal if one exists. An edge can also be observed in decays of sleptons. The CMS search is meant to be as general as possible.

In order to remove the bulk of SM events producing a pair of leptons, a significant missing energy is required, as expected from a pair of neutralinos χ01. Furthermore, other activity in the event is expected (recall that there will be *two* supersymmetric particles in the event) so the search demands at least two jets. Hence: ll+jj+pTmiss.

A crucial feature of the analysis is motivated by the phenomenology of SUSY cascade decays: for the signal, the two leptons will have the same flavor (ee or μμ), while most of the SM backgrounds will be flavor blind (so eμ also is expected). By comparing the Mll spectrum for same-flavor and for opposite-flavor leptons, an excess might be observed with little reliance on simulations. Only the Drell-Yan background does not appear in the opposite-flavor sample at the same level as in the same-flavor sample, but a hard cut on pTmiss (also called ETmiss) removes most of the DY background. (I am glossing over careful and robust measurements of the relative e and μ reconstruction and trigger efficiencies – see the note for those details, especially Section 4.)

The CMS analyzers make an important distinction between “central” leptons with |η|<1.4 and "forward" leptons 1.6<|η|<2.4 motivated by the idea that supersymmetric particles will be centrally produced due to their high mass, and an excess may be more pronounced when both leptons are central.

A search for a kinematic edge proceed just as you would expect – a series of fits is performed with the edge placed at different points across a wide range of invariant mass Mll. The model for the Mll spectrum has three components: the flavor-symmetric backgrounds, dominated by tt, the DY background and a hypothetical signal. Both the flavor-symmetric and the DY components are described by heuristic analytical functions with several free parameters. The signal is a triangle convolved with a Gaussian to represent the resolution on Mll. Most of the model parameters are determined in fits with enhanced DY contributions, and through the simultaneous fit to the opposite-flavor sample. For the search, only three parameters are free: the signal yield in the central and forward regions and the position of the kinematic edge.

The best fitted value for the edge is y = 78.7±1.4 GeV. At that point, an excess is observed with a local statistical significance of 2.4σ, in the central region. There is no excess in the forward region. Here is the plot:

The green triangle represents the fitted signal. The red peak is, of course, the Z resonance. Here is the distribution for the forward region:

Comparing the two distributions and ignoring the Z peak, there does indeed seem to be an excess of ee and μμ pairs for Mll < 80 GeV or so. One can understand why the theorists would find this intriguing…

CMS made a second treatment of their data by defining a mass region 20 < Mll < 70 GeV and simply counting the number of events, thereby avoiding any assumptions about the shape of a signal. For this treatment, one wants to compare the data to the prediction, with suitable corrections for efficiencies, etc. Here are the plots:

By eye one can notice a tendency of the real data (dots) to fall above the prediction (solid line histogram). This tendency is much stronger for the events with two central leptons compared to the events with at least one forward lepton. Counting, CMS reports 860 observed compared to 730±40 predicted (central) and 163 observed for 157±16 forward. The significance is 2.6σ for the central di-leptons.

CMS provides a kind of teaser plot, in which they simulate three signals from the production of sbottom squarks. As you can see here, two of the models describe the apparent excess well:

So why is this not a discovery?

First of all, the statistical significance is under 3σ so formally speaking, this not even “evidence.” More importantly, the “look-elsewhere effect” has not been taken into account, as stated clearly in the CMS note. In other words, the significance for the fit is 2.4σ *when you choose 78.7 GeV for the position of the edge.* If you allow for any position of the edge within some wide range of Mll, then the chance that you observe an excess *somewhere in that range* is much greater than 1%. Similarly, the counting excess is 2.6σ for the specific box 20 < Mll < 70 GeV, but if you consider many different boxes, the chance to observe such a single excess *somewhere* is not so small. For this reason, the CMS Collaboration states that there is no statistically significant excess.

That said, the agreement of the simulated sbottom signals with the data is striking, even granted that there are many free parameters here that one can tune to get a good description. The Allanach et al. paper reports a scan in MSSM parameter space to find which physical SUSY parameters are consistent with the CMS data. They impose a few theoretical constraints that are not crucial at this stage, and perform a scan with simulated signals to see which ranges of parameters reproduce the CMS data and also respect constraints from other ATLAS and CMS searches. An example model is shown here:

Wide ranges of sparticle masses are allowed, but there are strong constraints among them coming from the well-measured position of the edge. Constraints from (g-1)μ are respected and the relic density is good. Best of all, prospects for discovering one of these models at Run 2 are good — if such a signal really is present of course.

The Huang & Wagner paper focuses on the sbottom scenario mentioned in the CMS paper, and does a more detailed and refined analysis. They define two similar scenarios, here is the scheme for the first one:

They do not perform a scan of parameter space; rather, they select model parameters by hand to provide pertinent examples. They specifically focus on the relic density and (g-2)μ to make sure they their model can explain these facts. They explain their reasoning clearly in their paper. Their hand-tuned model does a nice job matching the CMS data. Of course, it also evades direct searches for sbottoms by both ATLAS and CMS.

What about the ATLAS 8 TeV data? For now, we must wait.

### Enhanced Higgs to tau+tau- Search with Deep Learning

“Enhanced Higgs to tau+tau- Search with Deep Learning” – that is the title of a new article posted to the archive this week by Daniel Whiteson and two collaborators from the Computer Science Department at UC Irvine (1410.3469). While the title may be totally obscure to someone outside of collider physics, it caught my immediate attention because I am working on a similar project (to be released soon).

Briefly: the physics motivation comes from the need for a stronger signal for Higgs decays to τ^{+}τ^{–}, which are important for testing the Higgs couplings to fermions (specifically, leptons). The scalar particle with a mass of 125 GeV looks very much like the standard model Higgs boson, but tests of couplings, which are absolutely crucial, are not very precise yet. In fact, indirect constraints are stronger than direct ones at the present time. So boosting the sensitivity of the LHC data to Higgs decays to fermions is an important task.

The meat of the article concerns the comparisons of *shallow* artificial neural networks, which contain only one or two hidden layers, and *deep* artificial neural networks, which have many. Deep networks are harder to work with than shallow ones, so the question is: does one really gain anything? The answer is: yes, its like increasing your luminosity by 25%.

This case study considers final states with two oppositely-charged leptons (e or μ) and missing transverse energy. The Higgs signal must be separated from the Drell-Yan production of τ pairs, especially Z→τ^{+}τ^{–}, on a statistical basis. It appears that no other backgrounds (such as W pair or top pair production) were considered, so this study is a purely technical one. Nonetheless, there is plenty to be learned from it.

Whiteson, Baldi and Sadowski make a distinction between *low-level variables*, which include the basic kinematic observables for the leptons and jets, and the *high-level variables*, which include derived kinematic quantities such as invariant masses, differences in angles and pseudorapidity, sphericity, etc. I think this distinction and the way they compare the impact of the two sets is interesting.

The question is: if a sophisticated artificial neural network is able to develop complex functions of the low-level variables through training and optimization, isn’t it redundant to provide derived kinematic quantities as additional inputs? More sharply: does the neural network need “human assistance” to do its job?

The answer is clear: human assistance does help the performance of even a deep neural network with thousands of neurons and millions of events for training. Personally I am not surprised by this, because there is physics insight behind most if not all of the high-level variables — they are not just arbitrary functions of the low-level variables. So these specific functions carry physics meaning and fall somewhere between arbitrary functions of the input variables and brand new information (or features). I admit, though, that “physics meaning” is a nebulous concept and my statement is vague…

The authors applied state of the art techniques for this study, including optimization with respect to *hyperparameters*, i.e., the parameters that concern the details of the training of the neural network (learning speed, `velocity’ and network architecture). A lot of computer cycles were burnt to carry out these comparisons!

Deep neural networks might seem like an obvious way to go when trying to isolate rare signals. There are real, non-trivial stumbling blocks, however. An important one is the *vanishing gradient* problem. If the number of hidden nodes is large (imagine eight layers with 500 neurons each) then training by back-propagation fails because it cannot find a significantly non-zero gradient with respect to the weights and offsets of the all the neurons. If the gradient vanishes, then the neural network cannot figure out which way to evolve so that it performs well. Imagine a vast flat space with a minimum that is localized and far away. How can you figure out which way to go to get there if the region where you are is nearly perfectly flat?

The power of a neural network can be assessed on the basis of the receiver operator curve (ROC) by integrating the area beneath the curve. For particle physicists, however, the common coinage is the expected statistical significance of an hypothetical signal, so Whiteson & co translate the performance of their networks into a discovery significance defined by a number of standard deviations. Notionally, a shallow neural network working only with low-level variables would achieve a significance of 2.57σ, while adding in the high-level variables increases the significance to 3.02σ. In contrast, the deep neural networks achieve 3.16σ with low-level, and 3.37σ with all variables.

Some conclusions are obvious: deep is better than shallow. Also, adding in the high-level variables helps in both cases. (Whiteson et al. point out that the high-level variables incorporate the τ mass, which otherwise is unavailable to the neural networks.) The deep network with low-level variables is better than a shallow network with all variables, and the authors conclude that the deep artificial neural network is learning something that is not embodied in the human-inspired high-level variables. I am not convinced of this claim since it is not clear to me that the improvement is not simply due to the inadequacy of the shallow network to the task. By way of an analogy, if we needed to approximate an exponential curve by a linear one, we would not succeed unless the range was very limited; we should not be surprised if a quadratic approximation is better.

In any case, since I am working on similar things, I find this article very interesting. It is clear that the field is moving in the direction of very advanced numerical techniques, and this is one fruitful direction to go in.

### Neural Networks for Triggering

As experiments push the high-energy frontier, and also the intensity frontier, they must contend with higher and higher instantaneous luminosities. This challenge drives experimenters to try new techniques for triggering that might have sounded outlandish or fanciful ten years ago.

The Belle II experiment posted a paper this week on using (artificial) neural networks at the first trigger level for their experiment (arXiv:1410.1395). To be explicit: they plan to implement an artificial neural network at the hardware-trigger level, L1, i.e., the one that deals with the most primitive information from the detector in real time. The L1 latency is 5 μs which allows only 1 μs for the trigger decision.

At issue is a major background coming from Touschek scattering. The coulomb interaction of the e- and e+ beams can transform a small transverse phase space into a long longitudinal phase space. (See a DESY report 98-179 for a discussion.) The beam is thereby spread out in the z direction leading to collisions taking place far from the center of the apparatus. This is a bad thing for analysis and for triggering since much of the event remains unreconstructed — such events are a waste of bandwidth. The artificial neural networks, once trained, are mechanistic and parallel in the way they do their calculations, therefore they are fast – just what is needed for this situation. The interesting point is that here, in the Belle application, decisions about the z position of the vertex will be made without reconstructing any tracks (because there is insufficient time to carry out the reconstruction).

The CDC has 56 axial and stereo layers grouped into nine superlayers. Track segments are found by the TSF based on superlayer information. The 2D trigger module finds crude tracks in the (r,φ) plane. The proposed neutral network trigger takes information from the axial and stereo TSF, and also from the 2D trigger module.

As usual, the artificial neural network is based on the multi-layer perceptron (MLP) with a hyperbolic tangent activation function. The network is trained by back-propagation. Interestingly, the authors use an ensemble of “expert” MLPs corresponding to small sectors in phase space. Each MLP is trained on a subset of tracks corresponding to that sector. Several incarnations of the network were investigated, which differ in the precise information used as input to the neural network. The drift times are scaled and the left/right/undecided information is represented by an integer. The azimuthal angle can be represented by a scaled wire ID or by an angle relative to the one produced by the 2D trigger. There is a linear relation between the arc length and the z coordinate, so the arc length (μ) can also be a useful input variable.

As a first test, one sector is trained for a sample of low-pT and another sample of high-pT tracks. The parameter range is very constrained, and the artificial neural networks do well, achieving a resolution of 1.1 – 1.8 cm.

In a second test, closer to the planned implementation, the output of the 2D trigger is represented by some smeared φ and pT values. The track parameters cover a wider range than in the first test, and the pT range is divided into nine pieces. The precision is 3 – 7cm in z, which is not yet good enough for the application (they are aiming for 2 cm or better). Nonetheless, this estimate is useful because it can be used to restrict the sector size for the next step.

Clearly this is a work in progress, and much remains to be done. Assuming that the Belle Collaboration succeeds, the fully pipelined neural network trigger will be realized on FPGA boards.

### LHCb searches for LFV in tau decays

Lepton flavor violation (LFV) occurs when neutrinos oscillate in flavor, but is not supposed to occur (at tree level) when charged leptons are involved. Beyond the standard model, however, speculative models predict observable levels of LFV and since flavor is so difficult to understand theoretically, searches for LFV are inherently worthwhile.

A striking signature for LFV would be the neutrinoless decay of a tau lepton to three muons (or, for that matter, the decay of a muon to three electrons). The Belle and BaBar experiments at the B-factories have searched for τ→3μ; Belle set the best limit in 2010: BF(τ→3μ) < 2.1×10^{-8}.

Tau leptons are produced copiously in high-energy pp collisions. They come from the semileptonic decays of b and c hadrons, which themselves are produced with huge cross sections. The LHCb experiment is designed to study b and c hadrons and is very successful at that. A key piece of the LHCb apparatus is a very nice muon spectrometer that provides triggers for the readout and a high-quality reconstruction of muon trajectories. This would seem to be an excellent place to search for τ→3μ decays – and it is, as reported this week (arXiv:1409.8548).

The selection begins, of course, with three muons that together form a displaced vertex (taking advantage of the tau and charm hadron lifetimes). The momentum vector of the three muons should be nearly collinear with a vector pointing from the primary vertex to the tri-muon vertex — there are no neutrinos in the signal, after all, and the tau lepton takes most of the energy of the charm hadron, and therefore closely follows the direction of the charm hadron. (Charm hadron decays produce most of the tau leptons, so those coming from b hadrons are lost, but this does not matter much.) Here is a depiction of a signal event, which comes from a talk given by Gerco Onderwater at NUFACT2014:

I like the way the analysis is designed: there is the all-important tri-muon invariant mass distribution, there is a classifier for “geometry” – i.e., topology, and a second one for muon identification. Clearly, this analysis is challenging.

The geometry classifier M(3body) incorporates information about the vertex and the pointing angle. The classifier itself is surprisingly sophisticated, involving two Fisher discriminants, four artificial neural networks, one function-discriminant analysis and one linear discriminant — all combined into a blended boosted decision tree (BDT)! Interestingly, the analyzers use one-half of their event sample to train the artificial neural networks, etc., and the other to train the BDT. The performance of the BDT is validated with a sample of Ds→φπ decays, with φ→2μ.

The muon ID classifier M(PID) uses detector information from the ring-imaging cherenkov detectors, calorimeters and muon detectors to provide a likelihood that each muon candidate is a muon. The smallest of the three likelihoods is used as the discriminating quantity. M(PID) employs an artificial neural network that is validated using J/ψ decays to muon pairs.

The LHCb physicists take advantage of their large sample of Ds→μμπ decays to model the tri-muon invariant mass distribution accurately. The line shape is parameterized by a pair of Gaussians that are then rescaled to the mass and width of the tau lepton.

Backgrounds are not large, and consist of one irreducible background and several reducible ones, which is where M(PID) plays a key role. The signal rate is normalized to the rate of Ds→φπ decays, which is relatively well known, and which also has a robust signal in LHCb.

The paper contains tables of yields in grids of M(PID) and M(3body), and there is no signs of a signal. The picture from their Fig. 3 is clear:

No signal. Taking relatively modest systematics into account, they use the usual CLs method to set an upper limit. The actual result is BF(τ→3μ) < 4.6×10^{-8} at 90% CL, slightly better than expected. This limit is not quite as constraining as the earlier Belle result, but points the way to stronger results when larger data samples have been collected. The mass window shown above is not heavily populated by background.

I think this is a nice analysis, done intelligently. I hope I can learn more about the advanced analysis techniques employed.