“Enhanced Higgs to tau+tau- Search with Deep Learning” – that is the title of a new article posted to the archive this week by Daniel Whiteson and two collaborators from the Computer Science Department at UC Irvine (1410.3469). While the title may be totally obscure to someone outside of collider physics, it caught my immediate attention because I am working on a similar project (to be released soon).
Briefly: the physics motivation comes from the need for a stronger signal for Higgs decays to τ+τ-, which are important for testing the Higgs couplings to fermions (specifically, leptons). The scalar particle with a mass of 125 GeV looks very much like the standard model Higgs boson, but tests of couplings, which are absolutely crucial, are not very precise yet. In fact, indirect constraints are stronger than direct ones at the present time. So boosting the sensitivity of the LHC data to Higgs decays to fermions is an important task.
The meat of the article concerns the comparisons of shallow artificial neural networks, which contain only one or two hidden layers, and deep artificial neural networks, which have many. Deep networks are harder to work with than shallow ones, so the question is: does one really gain anything? The answer is: yes, its like increasing your luminosity by 25%.
This case study considers final states with two oppositely-charged leptons (e or μ) and missing transverse energy. The Higgs signal must be separated from the Drell-Yan production of τ pairs, especially Z→τ+τ-, on a statistical basis. It appears that no other backgrounds (such as W pair or top pair production) were considered, so this study is a purely technical one. Nonetheless, there is plenty to be learned from it.
Whiteson, Baldi and Sadowski make a distinction between low-level variables, which include the basic kinematic observables for the leptons and jets, and the high-level variables, which include derived kinematic quantities such as invariant masses, differences in angles and pseudorapidity, sphericity, etc. I think this distinction and the way they compare the impact of the two sets is interesting.
The question is: if a sophisticated artificial neural network is able to develop complex functions of the low-level variables through training and optimization, isn’t it redundant to provide derived kinematic quantities as additional inputs? More sharply: does the neural network need “human assistance” to do its job?
The answer is clear: human assistance does help the performance of even a deep neural network with thousands of neurons and millions of events for training. Personally I am not surprised by this, because there is physics insight behind most if not all of the high-level variables — they are not just arbitrary functions of the low-level variables. So these specific functions carry physics meaning and fall somewhere between arbitrary functions of the input variables and brand new information (or features). I admit, though, that “physics meaning” is a nebulous concept and my statement is vague…
The authors applied state of the art techniques for this study, including optimization with respect to hyperparameters, i.e., the parameters that concern the details of the training of the neural network (learning speed, `velocity’ and network architecture). A lot of computer cycles were burnt to carry out these comparisons!
Deep neural networks might seem like an obvious way to go when trying to isolate rare signals. There are real, non-trivial stumbling blocks, however. An important one is the vanishing gradient problem. If the number of hidden nodes is large (imagine eight layers with 500 neurons each) then training by back-propagation fails because it cannot find a significantly non-zero gradient with respect to the weights and offsets of the all the neurons. If the gradient vanishes, then the neural network cannot figure out which way to evolve so that it performs well. Imagine a vast flat space with a minimum that is localized and far away. How can you figure out which way to go to get there if the region where you are is nearly perfectly flat?
The power of a neural network can be assessed on the basis of the receiver operator curve (ROC) by integrating the area beneath the curve. For particle physicists, however, the common coinage is the expected statistical significance of an hypothetical signal, so Whiteson & co translate the performance of their networks into a discovery significance defined by a number of standard deviations. Notionally, a shallow neural network working only with low-level variables would achieve a significance of 2.57σ, while adding in the high-level variables increases the significance to 3.02σ. In contrast, the deep neural networks achieve 3.16σ with low-level, and 3.37σ with all variables.
Some conclusions are obvious: deep is better than shallow. Also, adding in the high-level variables helps in both cases. (Whiteson et al. point out that the high-level variables incorporate the τ mass, which otherwise is unavailable to the neural networks.) The deep network with low-level variables is better than a shallow network with all variables, and the authors conclude that the deep artificial neural network is learning something that is not embodied in the human-inspired high-level variables. I am not convinced of this claim since it is not clear to me that the improvement is not simply due to the inadequacy of the shallow network to the task. By way of an analogy, if we needed to approximate an exponential curve by a linear one, we would not succeed unless the range was very limited; we should not be surprised if a quadratic approximation is better.
In any case, since I am working on similar things, I find this article very interesting. It is clear that the field is moving in the direction of very advanced numerical techniques, and this is one fruitful direction to go in.
As experiments push the high-energy frontier, and also the intensity frontier, they must contend with higher and higher instantaneous luminosities. This challenge drives experimenters to try new techniques for triggering that might have sounded outlandish or fanciful ten years ago.
The Belle II experiment posted a paper this week on using (artificial) neural networks at the first trigger level for their experiment (arXiv:1410.1395). To be explicit: they plan to implement an artificial neural network at the hardware-trigger level, L1, i.e., the one that deals with the most primitive information from the detector in real time. The L1 latency is 5 μs which allows only 1 μs for the trigger decision.
At issue is a major background coming from Touschek scattering. The coulomb interaction of the e- and e+ beams can transform a small transverse phase space into a long longitudinal phase space. (See a DESY report 98-179 for a discussion.) The beam is thereby spread out in the z direction leading to collisions taking place far from the center of the apparatus. This is a bad thing for analysis and for triggering since much of the event remains unreconstructed — such events are a waste of bandwidth. The artificial neural networks, once trained, are mechanistic and parallel in the way they do their calculations, therefore they are fast – just what is needed for this situation. The interesting point is that here, in the Belle application, decisions about the z position of the vertex will be made without reconstructing any tracks (because there is insufficient time to carry out the reconstruction).
The CDC has 56 axial and stereo layers grouped into nine superlayers. Track segments are found by the TSF based on superlayer information. The 2D trigger module finds crude tracks in the (r,φ) plane. The proposed neutral network trigger takes information from the axial and stereo TSF, and also from the 2D trigger module.
As usual, the artificial neural network is based on the multi-layer perceptron (MLP) with a hyperbolic tangent activation function. The network is trained by back-propagation. Interestingly, the authors use an ensemble of “expert” MLPs corresponding to small sectors in phase space. Each MLP is trained on a subset of tracks corresponding to that sector. Several incarnations of the network were investigated, which differ in the precise information used as input to the neural network. The drift times are scaled and the left/right/undecided information is represented by an integer. The azimuthal angle can be represented by a scaled wire ID or by an angle relative to the one produced by the 2D trigger. There is a linear relation between the arc length and the z coordinate, so the arc length (μ) can also be a useful input variable.
As a first test, one sector is trained for a sample of low-pT and another sample of high-pT tracks. The parameter range is very constrained, and the artificial neural networks do well, achieving a resolution of 1.1 – 1.8 cm.
In a second test, closer to the planned implementation, the output of the 2D trigger is represented by some smeared φ and pT values. The track parameters cover a wider range than in the first test, and the pT range is divided into nine pieces. The precision is 3 – 7cm in z, which is not yet good enough for the application (they are aiming for 2 cm or better). Nonetheless, this estimate is useful because it can be used to restrict the sector size for the next step.
Clearly this is a work in progress, and much remains to be done. Assuming that the Belle Collaboration succeeds, the fully pipelined neural network trigger will be realized on FPGA boards.
Lepton flavor violation (LFV) occurs when neutrinos oscillate in flavor, but is not supposed to occur (at tree level) when charged leptons are involved. Beyond the standard model, however, speculative models predict observable levels of LFV and since flavor is so difficult to understand theoretically, searches for LFV are inherently worthwhile.
A striking signature for LFV would be the neutrinoless decay of a tau lepton to three muons (or, for that matter, the decay of a muon to three electrons). The Belle and BaBar experiments at the B-factories have searched for τ→3μ; Belle set the best limit in 2010: BF(τ→3μ) < 2.1×10-8.
Tau leptons are produced copiously in high-energy pp collisions. They come from the semileptonic decays of b and c hadrons, which themselves are produced with huge cross sections. The LHCb experiment is designed to study b and c hadrons and is very successful at that. A key piece of the LHCb apparatus is a very nice muon spectrometer that provides triggers for the readout and a high-quality reconstruction of muon trajectories. This would seem to be an excellent place to search for τ→3μ decays – and it is, as reported this week (arXiv:1409.8548).
The selection begins, of course, with three muons that together form a displaced vertex (taking advantage of the tau and charm hadron lifetimes). The momentum vector of the three muons should be nearly collinear with a vector pointing from the primary vertex to the tri-muon vertex — there are no neutrinos in the signal, after all, and the tau lepton takes most of the energy of the charm hadron, and therefore closely follows the direction of the charm hadron. (Charm hadron decays produce most of the tau leptons, so those coming from b hadrons are lost, but this does not matter much.) Here is a depiction of a signal event, which comes from a talk given by Gerco Onderwater at NUFACT2014:
I like the way the analysis is designed: there is the all-important tri-muon invariant mass distribution, there is a classifier for “geometry” – i.e., topology, and a second one for muon identification. Clearly, this analysis is challenging.
The geometry classifier M(3body) incorporates information about the vertex and the pointing angle. The classifier itself is surprisingly sophisticated, involving two Fisher discriminants, four artificial neural networks, one function-discriminant analysis and one linear discriminant — all combined into a blended boosted decision tree (BDT)! Interestingly, the analyzers use one-half of their event sample to train the artificial neural networks, etc., and the other to train the BDT. The performance of the BDT is validated with a sample of Ds→φπ decays, with φ→2μ.
The muon ID classifier M(PID) uses detector information from the ring-imaging cherenkov detectors, calorimeters and muon detectors to provide a likelihood that each muon candidate is a muon. The smallest of the three likelihoods is used as the discriminating quantity. M(PID) employs an artificial neural network that is validated using J/ψ decays to muon pairs.
The LHCb physicists take advantage of their large sample of Ds→μμπ decays to model the tri-muon invariant mass distribution accurately. The line shape is parameterized by a pair of Gaussians that are then rescaled to the mass and width of the tau lepton.
Backgrounds are not large, and consist of one irreducible background and several reducible ones, which is where M(PID) plays a key role. The signal rate is normalized to the rate of Ds→φπ decays, which is relatively well known, and which also has a robust signal in LHCb.
The paper contains tables of yields in grids of M(PID) and M(3body), and there is no signs of a signal. The picture from their Fig. 3 is clear:
No signal. Taking relatively modest systematics into account, they use the usual CLs method to set an upper limit. The actual result is BF(τ→3μ) < 4.6×10-8 at 90% CL, slightly better than expected. This limit is not quite as constraining as the earlier Belle result, but points the way to stronger results when larger data samples have been collected. The mass window shown above is not heavily populated by background.
I think this is a nice analysis, done intelligently. I hope I can learn more about the advanced analysis techniques employed.
This week the CMS Collaboration released a paper reporting the measurement of the ratio of production cross sections for the χb2(1P) and the χb1(1P) heavy meson states (arXiv:1409.5761). The motivation stems from the theoretical difficulties in explaining how such states are formed, but for me as an experimenter the most striking feature of the analysis is the impressive separation of the χ states.
First, a little background. A bottom quark and an anti-bottom anti-quark can form a meson with a well-defined mass. These states bear some resemblance to positronium but the binding potential comes from the strong force, not electromagnetism. In the past, the spectrum of the masses of these states clarified important features of this potential, and led to the view that the potential increases with separation, rather than decreasing. As we all know, QCD is absolutely confining, and the first hints came from studies of charmonium and bottomonium. The masses of these and many other states have been precisely measured over the years, and now provide important tests of lattice calculations.
The mass of the χb2(1P) is 9912.21 MeV and the mass of the χb1(1P) is 9892.78 MeV; the mass difference is only 19.4 MeV. They sit together in a fairly complicated diagram of the states. Here is a nice version which comes from an annual review article by Patrignani, Pedlar and Rosner (arXiv:1212.6552) – I have circled the states under discussion here:
So, even on the scale of the bottomonium mesons, this separation of 19 MeV is quite small. Nonetheless, CMS manages to do a remarkably good job. Here is their plot:
Two peaks are clearly resolved: the χb2(1P) on the left (and represented by the green dashed line) and the χb1(1P) on the right (represented by the red dashed line). The two peaks are successfully differentiated, and the measurements of their relative rates can be carried out.
How do they do it? The χ stated decay to the Y(1S) by emitting a photon with a substantial branching fraction that is already known fairly well. The vector Y(1S) state is rather easily reconstructed through through its decays to a μ+μ- pair. The CMS spectrometer is excellent, as it the reconstruction of muons, so the Y(1S) state appears as a narrow peak. By detecting the photon and calculating the μμγ invariant mass, the χ states can be reconstructed.
Here is the interesting part: the photons are not reconstructed with the (rather exquisite) crystal electromagnetic calorimeter, because its energy resolution is not good enough. This may be surprising, since the Higgs decay to a pair of photons certainly is well reconstructed using the calorimeter. These photons, however, have a very low energy, and their energies are not so well measured. (Remember that electromagnetic calorimeter resolution goes roughly as 1/sqrt(E).) Instead, the CMS physicists took advantage of their tracking a second time, and reconstructed those photons that had cleanly converted into an e+e- pair. So the events of interest contained two muons, that together give the Y(1S) state, and an e+e- pair, which gives the photon emitted in the radiative decay of the χ state. The result is the narrow peaks displayed above; the yield is obtained simply by integrating the curves representing the two χ states.
This technique might conceivably be interesting when searching for peculiar signals of new physics.
It is difficult to ascertain the reconstruction efficiency of conversion pairs, since they tend to be asymmetric (either the electron or the positron gets most of the photon’s energy). By taking the ratio of yields, however, one obtains the ratio of cross sections times branching fractions. This ratio is experimentally clean, therefore, and robust. The mass spectrum was examined in four bins of the transverse momentum of the Y(1S); the plot above is the second such bin.
Here is the results of the measurement: four values of the ratio σ(χb2)/σ(χb1) plotted as a function of pT(Y):
LHCb have also made this measurement (arXiv:1202.1080), and their values are presented by the open circles; the CMS measurement agrees well with LHCb. The green horizontal band is simply an average of the CMS values, assuming no dependence on pT(Y). The orange curved band comes from a very recent theoretical calculation by Likhoded, Luchinsky and Poslavsky (arXiv:1409.0693). This calculation does not reproduce the data.
I find it remarkable that the CMS detector (and the other LHC detectors to varying degrees) can resolve such a small mass difference when examining the debris from an 8 TeV collision. These mass scales are different by a factor of two million. While there is no theoretical significance to this fact, it shows that experimenters must and can deal with such a huge range within one single apparatus. And they can.
Yesterday the AMS Collaboration released updated results on the positron excess. The press release is available at the CERN press release site. (Unfortunately, the AMS web site is down due to syntax error – I’m sure this will be fixed very soon.)
The Alpha Magnetic Spectrometer was installed three years ago at the International Space Station. As the name implies, it can measure the charge and momenta of charged particles. It can also identify them thanks to a suite of detectors providing redundant and robust information. The project was designed and developed by Prof. Sam Ting (MIT) and his team. An international team including scientists at CERN coordinate the analysis of data.
There are more electrons than positrons striking the earth’s atmosphere. Scientists can predict the expected rate of positrons relative to the rate of electrons in the absence of any new phenomena. It is well known that the observed positron rate does not agree with this prediction. This plot shows the deviation of the AMS positron fraction from the prediction. Already at an energy of a couple of GeV, the data have taken off.
The positron fraction unexpectedly increases starting around 8 GeV. At first it increases rapidly, with a slower increase above 10 GeV until 250 GeV or so. AMS reports the turn-over to a decrease to occur at 275 ± 32 GeV though it is difficult to see from the data:
This turnover, or edge, would correspond notionally to a Jacobian peak — i.e., it might indirectly indicate the mass of a decaying particle. The AMS press release mentions dark matter particles with a mass at the TeV scale. It also notes that no sharp structures are observed – the positron fraction may be anomalous but it is smooth with no peaks or shoulders. On the other hand, the observed excess is too high for most models of new physics, so one has to be skeptical of such a claim, and think carefully for an astrophysics origin of the “excess” positrons — see the nice discussion in Resonaances.
As an experimenter, it is a pleasure to see this nice event display for a positron with a measured energy of 369 GeV:
Finally, AMS reports that there is no preferred direction for the positron excess — the distribution is isotropic at the 3% level.
There is no preprint for this article. It was published two days ago in PRL 113 (2014) 121101″
A bit more than a year ago I was pleased to see a clear signal from CMS for the decay of Z bosons to four leptons. Of course there are literally millions of recorded Z decays to two leptons (e+ e- and μ+ μ-) used for standard model physics studies, lepton efficiency measurements, momentum/energy scale determinations and detector alignment. But Z→4L is cuter and of some intrinsic interest, being relatively rare.
It turned out the main interest of physicists who analyzed the signal was Higgs boson decays to four leptons. By now that Higgs signal is well established and plays an important role in the Higgs mass measurement, but at the time of the CMS publication (InSpire link, i.e., arXiv:1210.3844 October 2012), Z→4L provided the ideal benchmark for H→4L.
You might think that the rare decay Z→4L had been well studied at LEP. In fact, it was quite well studied because the ALEPH Collaboration had once reported an anomaly in the 4L final state when two of the leptons were tau leptons. (At the time, this observation hinted at a light supersymmetric Higgs boson signal.) The anomaly was not confirmed by the other LEP experiments. A perhaps definitive study was published by the L3 Collaboration in 1994 (InSpire link). Here are the plots of the two di-lepton masses:
Most of the events consist, in essence, of a virtual photon emitted by one of the primary leptons, with that virtual photon materializing as two more leptons – hence the peak at low masses for the Mmin distribution. Note there is no point in plotting the 4-lepton mass since the beam energies were tuned to the Z peak resonance – the total invariant mass will be, modulo initial-state radiation, a narrow peak at the center-of-mass-energy.
Here is the Z resonance from the CMS paper:
A rather clear and convincing peak is observed, in perfect agreement with the standard model prediction. This peak is based on the 5 fb-1 collected in 2011 at 7TeV.
ATLAS have released a study of this final state based on their entire 7 TeV and 8 TeV data set (ATLAS-CONF-2013-055, May 2013). Here is their preliminary 4-lepton mass peak:
Clearly the number of events is higher than in the CMS plot above, since five times the integrated luminosity was used. ATLAS also published the di-lepton sub-masses:
This calibration channel is not meant to be the place where new physics is discovered. Nonetheless, we have to compare the rate observed in the real data with the theoretical prediction – a discrepancy would be quite interesting since this decay is theoretically clean and the prediction should be solid.
Since the rate of pp→Z→2L is very well measured, and the branching ratio Z→2L already well known from LEP and SLD, we can extract branching fractions for Z decays to four leptons:
SM... BF(Z→4L) = (4.37 ± 0.03) × 10-6
CMS.. BF(Z→4L) = (4.2 ± 0.9 ± 0.2) × 10-6
ATLAS BF(Z→4L) = (4.2 ± 0.4 ) × 10^-6
So, as it turns out, the SM prediction matches the observed rate very well.
The inclusive W and Z production cross sections are benchmarks for any hadron collider. Excellent measurements were published by CDF and D0, but the superior detector capabilities of CMS and ATLAS allows for even better measurements. Fiducial cross sections are relatively free from theoretical uncertainties and can be used to constrain the parton distribution functions (PDFs), which are of central importance for nearly all measurements done at a hadron collider. In fact, ATLAS published an interesting constraint on the strange-quark density on the basis of inclusive cross section measurements. I’ll return to this result in a future post.
The first results were published back in 2010 and then updated in 2011 and 2012, based on 7 TeV data. Since W and Z bosons are produced copiously at the LHC, very small statistical uncertainties can be achieved with a rather small amount of integrated luminosity. (We have tens of millions of Z bosons detected in leptonic decay channels, for example, far more than the LEP experiments recorded. And we have roughly ten times the number of W bosons.) Remarkably, experimental systematic uncertainties are reduced to the 1% – 1.5% level, which is amazing considering the need to control lepton efficiencies and background estimates. (I am setting aside the luminosity uncertainty, which was about 3% – 4% for the early data.) The measurements done with only 35 pb-1 are nearly as precise as the theoretical predictions, whose errors are dominated by the PDF uncertainties. We knew, back in 2011, that a new era of electroweak physics had begun.
Experimenters know the power of ratios. We can often remove a systematic uncertainty by normalizing a measured quantity judiciously. For example, PDFs are a major source of uncertainty. These uncertainties are highly correlated, however, in the production of W and Z bosons. So we can extract the ratio (W rate)/(Z rate) with a relatively small error. Even better, we can plot the W cross section against the Z cross section, as ATLAS have done:
The elongated ellipses show that variations of the PDFs affect the W and Z cross sections is nearly the same way. The theoretical predictions are consistent with the data, and tend to lie all together. (The outlier, JR09, is no longer a favored PDF set.)
It is even more interesting to plot the W+ cross section against the W- cross section, because the asymmetry between W+ and W- production relates to the preponderance of up-quarks over down-quarks (don’t forget we are colliding two protons). Since the various PDF sets describe the d/u-ratio differently, there is a larger spread in theoretical predictions:
During the 8 TeV running in 2012, the instantaneous luminosity was much higher than in 2010, leading to high pile-up (overlapping interactions) which complicate the analysis. The LHC collaborations took a small amount of data (18 pb-1) in a low pile-up configuration in order to measure the W and Z cross sections at 8 TeV, and CMS have reported preliminary results. They produced ellipses plots similar to what ATLAS published:
You might notice that the CMS ellipse appear larger than the ATLAS ones. This is because the ATLAS results are based on fiducial cross sections – i.e., cross sections for particle produced within the detector acceptance. One has to apply an acceptance correction to convert a fiducial cross section to a total cross section. This acceptance correction is easily obtained from Monte Carlos simulations, but it comes with a systematic uncertainty coming mainly from the PDFs. (If a PDF favors a harder u-quark momentum distribution, then the vector bosons will have a slightly larger momentum component along the beam, and the leptons from the vector boson decay will be missed down the beam pipe more often. Such things matter at the percent level.) Since modern theoretical tools can calculate fiducial cross sections accurately, it is not necessary to apply an acceptance correction in order to compare to theory. Clearly it is wise to make the comparison at the level of fiducial cross sections, though total cross sections are also useful in other contexts. The CMS result is preliminary.
Back when I was electroweak physics co-convener in CMS, I produced a plot summarizing hadron collider measurements of W and Z production. My younger colleagues have updated that plot to include the new 8 TeV measurements:
This plot nicely summarizes the history of these measurements, and suggests that W and Z production processes are well understood.
As I learned a the WNL workshop, the collaborations are learning how to measure these cross section in the high pile-up data. We may see even more precise values, soon.