TancNote/note/tanc_nn_training.tex

The samples used to train the TaNC neural networks are typical of the signals
and backgrounds found in common physics analyses using taus.  The signal--type
training sample is composed of reconstructed tau--candidates that are matched
to generator level hadronic tau decays coming from simulated $Z \rightarrow
\tau^{+}\tau^{-}$ events.  The background training sample consists of
reconstructed tau--candidates in simulated QCD $2\rightarrow2$ hard scattering
events.  The QCD $P_T$ spectrum is steeply falling, and to obtain sufficient
statistics across a broad range of $P_T$ the sample is split into different
$\hat P_{T}$ bins.  Each QCD sub--sample imposes a generator level cut on the
transverse energy of the hard interaction.  During evaluation of discrimination
performance the QCD sub--samples are weighted according to their respective
integrated luminosities to remove any effect of the binning.

The signal and background samples are split into five subsamples corresponding
to each reconstructed decay mode.  An additional selection is applied to each
subsample by requiring a ``leading pion'': either a charged hadron or gamma
candidate with transverse momentum greater than 5 GeV$/c$.  A large number of
QCD training events is required as the leading pion selection and the
requirement that the decay mode match one of the dominant modes given in table
~\ref{tab:decay_modes} are both effective discriminants.  For each subsample,
half the signal and background tau--candidates are reserved to be used internally
by the TMVA software to test for over--training. The number of signal and
background entries used for each decay mode subsample is given in table
~\ref{tab:trainingEvents}.

%Chained 100 signal files.
%Chained 208 background files.
%Total signal entries: 874266
%Total background entries: 9526176
%Pruning non-relevant entries.
%After pruning, 584895 signal and 644315 background entries remain.
%**********************************************************************************
%*********************************** Summary **************************************
%**********************************************************************************
%*     NumEvents with weight > 0 (Total NumEvents)                                *
%*--------------------------------------------------------------------------------*
%*shrinkingConePFTauDecayModeProducer   ThreeProngNoPiZero: Signal:      53257(53271)            Background:155793(155841)
%*shrinkingConePFTauDecayModeProducer  ThreeProngOnePiZero: Signal:      13340(13342)            Background:135871(135942)
%*shrinkingConePFTauDecayModeProducer    OneProngTwoPiZero: Signal:      34780(34799)            Background:51181(51337)
%*shrinkingConePFTauDecayModeProducer    OneProngOnePiZero: Signal:      136464(138171)          Background:137739(139592)
%*shrinkingConePFTauDecayModeProducer     OneProngNoPiZero: Signal:      300951(345312)          Background:144204(161603)

\begin{table}
   \centering
   \begin{tabular}{lcc}
      %\multirow{2}{*}{}                         & \multicolumn{2}{c}{Events}    \\
                                                & Signal        & Background    \\
      \hline
      Total number of tau--candidates           & 874266        & 9526176       \\
      Tau--candidates passing preselection      & 584895        & 644315        \\
      Tau--candidates with $W(P_T,\eta)>0$      & 538792        & 488917        \\
      \hline
      Decay Mode                        & \multicolumn{2}{c}{Training Events}   \\
      \hline
      $\pi^{-}$                         & 300951   & 144204                     \\
      $\pi^{-}\pi^0$                    & 135464   & 137739                     \\
      $\pi^{-}\pi^0\pi^0$               & 34780    & 51181                      \\
      $\pi^{-}\pi^{-}\pi^{+}$           & 53247    & 155793                     \\
      $\pi^{-}\pi^{-}\pi^{+}\pi^0$      & 13340    & 135871                     \\
   \end{tabular}
   \label{tab:trainingEvents}
   \caption{Number of events used for neural network training for each
   selected decay mode.}
\end{table}

In both signal and background samples, 20\% of the events are reserved as a
statistically independent sample to evaluate the performance of the neural nets
after the training is completed.  The TaNC uses the ``MLP'' neural network
implementation provided by the TMVA software package, described in
~\cite{TMVA}.  The ``MLP'' classifier is a feed-forward artificial neural
network. There are two layers of hidden nodes and a single node in
the output layer.  The hyperbolic tangent function is used for the neuron activation function.
The number of hidden nodes in the first and second layers
are chosen according to Kolmogorov's theorem~\cite{kolmogorovsTheorem}; the number of
hidden nodes in the first (second) layer is $N+1 (2N+1)$, where $N$ is the
number of input observables.  The neural network is trained for 500 epochs. At
ten epoch intervals, the neural network error is computed to check for
overtraining (see figure~\ref{fig:overTrainCheck}). The neural network error $E$ is
defined~\cite{TMVA} as
\begin{equation}
   E = \frac{1}{2} \sum_{i=1}^N (y_{ANN,i} - \hat y_i)^2
   \label{eq:NNerrorFunc}
   %note - not right for weighted dists?
\end{equation}
where $N$ is the number of training events, $y_{ANN,i}$ is the neural network output
for the $i$th training event, and $y_i$ is the desired (-1 for background, 1 for signal) output
the $i$th event. No evidence  of over--training is observed.

\begin{figure}[t]
   \setlength{\unitlength}{1mm}
   \begin{center}
      \begin{picture}(150, 195)(0,0)
         \put(0.5, 130)
         {\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngNoPiZero.pdf}}}
         \put(65,  130)
         {\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngOnePiZero.pdf}}}
         \put(0.5, 65) 
         {\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngTwoPiZero.pdf}}}
         \put(65, 65) 
         {\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_ThreeProngNoPiZero.pdf}}}
         \put(33, 0) 
         {\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_ThreeProngOnePiZero.pdf}}}
      \end{picture}
   \caption{
   Neural network classification error for training (solid red) and testing
   (dashed blue) samples at ten epoch intervals over the 500 training epochs for each
   decay mode neural network.  The vertical axis represents the classification
   error, defined by equation~\ref{eq:NNerrorFunc}.  N.B. that the choice of
   hyperbolic tangent for neuron activation functions results in the desired
   outputs for signal and background to be -1 and 1, respectively.  This results
   in the computed neural network error being larger by a factor of four than
   the case where the desired outputs are (0, 1).  Classifier over--training
   would be evidenced by divergence of the classification error of the training
   and testing samples, indicating that the neural net was optimizing about
   statistical fluctuations in the training sample.  
   }
   \label{fig:overTrainCheck}
   \end{center}
\end{figure}


The neural nets uses as input variables the transverse momentum and $\eta$ of the
tau--candidates.  These variables are included as their correlations with other
observables can increase the separation power of the ensemble of observables.
For example, the opening angle in $\Delta R$ for signal tau--candidates is
inversely related to the transverse momentum, while for background events the
correlation is very small~\cite{DavisTau}. In the training signal and
background samples, there is significant discrimination power in the $P_T$
spectrum.   However, it is desirable to eliminate any systematic dependence of
the neural network output on $P_T$ and $\eta$, as in use the TaNC will be
presented with tau--candidates whose $P_T-\eta$ spectrum will be analysis
dependent. The dependence on $P_T$ and $\eta$ is removed by applying a $P_T$ and
$\eta$ dependent weight to the tau--candidates when training the neural nets.  

The weights are defined such that in any region in $P_T-\eta$ where the signal
and background probability density function are different, the sample with
higher probability density is weighted such that the samples have identical
$P_T-\eta$ probability distributions.  This removes regions of $P_T-\eta$ space
where the training sample is exclusively signal or background.  The weights are
computed by
\begin{align*}
   W(P_T, \eta) &=  {\rm less}(p_{sig}(P_T, \eta), p_{bkg}(P_T, \eta))\\
   w_{sig}(P_T, \eta) &=  W(P_T, \eta)/p_{sig}(P_T, \eta) \\
   w_{bkg}(P_T, \eta) &=  W(P_T, \eta)/p_{bkg}(P_T, \eta) 
\end{align*}
where $p_{sig}(P_T,\eta)$ and $p_{bkg}(P_T,\eta)$ are the probability densities of
the signal and background samples after the ``leading pion'' and decay mode
selections. Figure~\ref{fig:nnTrainingWeights} shows the signal and background
training $P_T$ distributions before and after the weighting is applied.


\begin{figure}[t]
\setlength{\unitlength}{1mm}
\begin{center}
\begin{picture}(150,60)(0,0)
\put(10.5, 2){
\mbox{\includegraphics*[height=58mm]{figures/training_weights_unweighted.pdf}}}
\put(86.0, 2){
\mbox{\includegraphics*[height=58mm]{figures/training_weights_weighted.pdf}}}
%\put(-5.5, 112.5){\small (a)}
%\put(72.0, 112.5){\small (b)}
%\put(-5.5, 54.5){\small (c)}
%\put(72.0, 54.5){\small (d)}
\end{picture}
\caption{Transverse momentum spectrum of signal and background
tau--candidates used in neural net training before (left) and after (right) the
application of $P_T-\eta$ dependent weight function.  Application of the weights
lowers the training significance of tau--candidates in regions of $P_T-\eta$
phase space where either the signal or background samples has an excess of
events. }
\label{fig:nnTrainingWeights}
\end{center}
\end{figure} 

Revision:	1.7
Committed:	Tue Apr 27 05:13:16 2010 UTC (15 years ago) by friis
Content type:	application/x-tex
Branch:	MAIN
Changes since 1.6:	+13 -11 lines
Log Message:	Almost complete
#	User	Rev	Content
1	friis	1.1	The samples used to train the TaNC neural networks are typical of the signals
2			and backgrounds found in common physics analyses using taus. The signal--type
3	friis	1.7	training sample is composed of reconstructed tau--candidates that are matched
4			to generator level hadronic tau decays coming from simulated $Z \rightarrow
5			\tau^{+}\tau^{-}$ events. The background training sample consists of
6			reconstructed tau--candidates in simulated QCD $2\rightarrow2$ hard scattering
7			events. The QCD $P_T$ spectrum is steeply falling, and to obtain sufficient
8			statistics across a broad range of $P_T$ the sample is split into different
9			$\hat P_{T}$ bins. Each QCD sub--sample imposes a generator level cut on the
10			transverse energy of the hard interaction. During evaluation of discrimination
11			performance the QCD sub--samples are weighted according to their respective
12			integrated luminosities to remove any effect of the binning.
13	friis	1.2
14	friis	1.1	The signal and background samples are split into five subsamples corresponding
15			to each reconstructed decay mode. An additional selection is applied to each
16			subsample by requiring a ``leading pion'': either a charged hadron or gamma
17			candidate with transverse momentum greater than 5 GeV$/c$. A large number of
18			QCD training events is required as the leading pion selection and the
19			requirement that the decay mode match one of the dominant modes given in table
20			~\ref{tab:decay_modes} are both effective discriminants. For each subsample,
21	friis	1.7	half the signal and background tau--candidates are reserved to be used internally
22	friis	1.1	by the TMVA software to test for over--training. The number of signal and
23			background entries used for each decay mode subsample is given in table
24			~\ref{tab:trainingEvents}.
25
26	friis	1.2	%Chained 100 signal files.
27			%Chained 208 background files.
28			%Total signal entries: 874266
29			%Total background entries: 9526176
30			%Pruning non-relevant entries.
31			%After pruning, 584895 signal and 644315 background entries remain.
32			%**********************************************************************************
33			%********************************* Summary ************************************
34			%**********************************************************************************
35			%* NumEvents with weight > 0 (Total NumEvents) *
36			%--------------------------------------------------------------------------------
37			%*shrinkingConePFTauDecayModeProducer ThreeProngNoPiZero: Signal: 53257(53271) Background:155793(155841)
38			%*shrinkingConePFTauDecayModeProducer ThreeProngOnePiZero: Signal: 13340(13342) Background:135871(135942)
39			%*shrinkingConePFTauDecayModeProducer OneProngTwoPiZero: Signal: 34780(34799) Background:51181(51337)
40			%*shrinkingConePFTauDecayModeProducer OneProngOnePiZero: Signal: 136464(138171) Background:137739(139592)
41			%*shrinkingConePFTauDecayModeProducer OneProngNoPiZero: Signal: 300951(345312) Background:144204(161603)
42
43	friis	1.1	\begin{table}
44			\centering
45	friis	1.2	\begin{tabular}{lcc}
46			%\multirow{2}{*}{} & \multicolumn{2}{c}{Events} \\
47			& Signal & Background \\
48	friis	1.1	\hline
49	friis	1.2	Total number of tau--candidates & 874266 & 9526176 \\
50			Tau--candidates passing preselection & 584895 & 644315 \\
51			Tau--candidates with $W(P_T,\eta)>0$ & 538792 & 488917 \\
52	friis	1.1	\hline
53	friis	1.2	Decay Mode & \multicolumn{2}{c}{Training Events} \\
54	friis	1.1	\hline
55	friis	1.2	$\pi^{-}$ & 300951 & 144204 \\
56			$\pi^{-}\pi^0$ & 135464 & 137739 \\
57			$\pi^{-}\pi^0\pi^0$ & 34780 & 51181 \\
58			$\pi^{-}\pi^{-}\pi^{+}$ & 53247 & 155793 \\
59			$\pi^{-}\pi^{-}\pi^{+}\pi^0$ & 13340 & 135871 \\
60	friis	1.1	\end{tabular}
61			\label{tab:trainingEvents}
62			\caption{Number of events used for neural network training for each
63			selected decay mode.}
64			\end{table}
65
66	friis	1.4	In both signal and background samples, 20\% of the events are reserved as a
67			statistically independent sample to evaluate the performance of the neural nets
68			after the training is completed. The TaNC uses the ``MLP'' neural network
69	friis	1.5	implementation provided by the TMVA software package, described in
70			~\cite{TMVA}. The ``MLP'' classifier is a feed-forward artificial neural
71			network. There are two layers of hidden nodes and a single node in
72			the output layer. The hyperbolic tangent function is used for the neuron activation function.
73			The number of hidden nodes in the first and second layers
74	friis	1.6	are chosen according to Kolmogorov's theorem~\cite{kolmogorovsTheorem}; the number of
75			hidden nodes in the first (second) layer is $N+1 (2N+1)$, where $N$ is the
76	friis	1.5	number of input observables. The neural network is trained for 500 epochs. At
77			ten epoch intervals, the neural network error is computed to check for
78	friis	1.6	overtraining (see figure~\ref{fig:overTrainCheck}). The neural network error $E$ is
79	friis	1.5	defined~\cite{TMVA} as
80			\begin{equation}
81			E = \frac{1}{2} \sum_{i=1}^N (y_{ANN,i} - \hat y_i)^2
82			\label{eq:NNerrorFunc}
83			%note - not right for weighted dists?
84			\end{equation}
85			where $N$ is the number of training events, $y_{ANN,i}$ is the neural network output
86			for the $i$th training event, and $y_i$ is the desired (-1 for background, 1 for signal) output
87	friis	1.7	the $i$th event. No evidence of over--training is observed.
88	friis	1.4
89			\begin{figure}[t]
90			\setlength{\unitlength}{1mm}
91			\begin{center}
92			\begin{picture}(150, 195)(0,0)
93			\put(0.5, 130)
94			{\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngNoPiZero.pdf}}}
95			\put(65, 130)
96			{\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngOnePiZero.pdf}}}
97			\put(0.5, 65)
98			{\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngTwoPiZero.pdf}}}
99			\put(65, 65)
100			{\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_ThreeProngNoPiZero.pdf}}}
101			\put(33, 0)
102			{\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_ThreeProngOnePiZero.pdf}}}
103			\end{picture}
104	friis	1.5	\caption{
105	friis	1.6	Neural network classification error for training (solid red) and testing
106			(dashed blue) samples at ten epoch intervals over the 500 training epochs for each
107	friis	1.5	decay mode neural network. The vertical axis represents the classification
108			error, defined by equation~\ref{eq:NNerrorFunc}. N.B. that the choice of
109			hyperbolic tangent for neuron activation functions results in the desired
110	friis	1.6	outputs for signal and background to be -1 and 1, respectively. This results
111	friis	1.5	in the computed neural network error being larger by a factor of four than
112			the case where the desired outputs are (0, 1). Classifier over--training
113			would be evidenced by divergence of the classification error of the training
114			and testing samples, indicating that the neural net was optimizing about
115	friis	1.6	statistical fluctuations in the training sample.
116	friis	1.4	}
117			\label{fig:overTrainCheck}
118			\end{center}
119			\end{figure}
120
121	friis	1.1
122	friis	1.2	The neural nets uses as input variables the transverse momentum and $\eta$ of the
123	friis	1.1	tau--candidates. These variables are included as their correlations with other
124			observables can increase the separation power of the ensemble of observables.
125			For example, the opening angle in $\Delta R$ for signal tau--candidates is
126			inversely related to the transverse momentum, while for background events the
127	friis	1.6	correlation is very small~\cite{DavisTau}. In the training signal and
128	friis	1.1	background samples, there is significant discrimination power in the $P_T$
129			spectrum. However, it is desirable to eliminate any systematic dependence of
130	friis	1.2	the neural network output on $P_T$ and $\eta$, as in use the TaNC will be
131			presented with tau--candidates whose $P_T-\eta$ spectrum will be analysis
132			dependent. The dependence on $P_T$ and $\eta$ is removed by applying a $P_T$ and
133			$\eta$ dependent weight to the tau--candidates when training the neural nets.
134
135			The weights are defined such that in any region in $P_T-\eta$ where the signal
136			and background probability density function are different, the sample with
137			higher probability density is weighted such that the samples have identical
138			$P_T-\eta$ probability distributions. This removes regions of $P_T-\eta$ space
139			where the training sample is exclusively signal or background. The weights are
140			computed by
141			\begin{align*}
142			W(P_T, \eta) &= {\rm less}(p_{sig}(P_T, \eta), p_{bkg}(P_T, \eta))\\
143			w_{sig}(P_T, \eta) &= W(P_T, \eta)/p_{sig}(P_T, \eta) \\
144			w_{bkg}(P_T, \eta) &= W(P_T, \eta)/p_{bkg}(P_T, \eta)
145			\end{align*}
146	friis	1.7	where $p_{sig}(P_T,\eta)$ and $p_{bkg}(P_T,\eta)$ are the probability densities of
147	friis	1.2	the signal and background samples after the ``leading pion'' and decay mode
148			selections. Figure~\ref{fig:nnTrainingWeights} shows the signal and background
149			training $P_T$ distributions before and after the weighting is applied.
150
151
152			\begin{figure}[t]
153			\setlength{\unitlength}{1mm}
154			\begin{center}
155			\begin{picture}(150,60)(0,0)
156			\put(10.5, 2){
157			\mbox{\includegraphics*[height=58mm]{figures/training_weights_unweighted.pdf}}}
158			\put(86.0, 2){
159			\mbox{\includegraphics*[height=58mm]{figures/training_weights_weighted.pdf}}}
160			%\put(-5.5, 112.5){\small (a)}
161			%\put(72.0, 112.5){\small (b)}
162			%\put(-5.5, 54.5){\small (c)}
163			%\put(72.0, 54.5){\small (d)}
164			\end{picture}
165	friis	1.4	\caption{Transverse momentum spectrum of signal and background
166	friis	1.2	tau--candidates used in neural net training before (left) and after (right) the
167			application of $P_T-\eta$ dependent weight function. Application of the weights
168			lowers the training significance of tau--candidates in regions of $P_T-\eta$
169			phase space where either the signal or background samples has an excess of
170			events. }
171			\label{fig:nnTrainingWeights}
172			\end{center}
173			\end{figure}
174