TancNote/note/tanc_nn_training.tex

The samples used to train the TaNC neural networks are typical of the signals
and backgrounds found in common physics analyses using taus.  The signal--type
training sample is composed of reconstructed tau--candidates that are matched
to generator level hadronic tau decays coming from simulated $Z \rightarrow
\tau^{+}\tau^{-}$ events.  The background training sample consists of
reconstructed tau--candidates in simulated QCD $2\rightarrow2$ hard scattering
events.  The QCD $P_T$ spectrum is steeply falling, and to obtain sufficient
statistics across a broad range of $P_T$ the sample is split into different
$\hat P_{T}$ bins.  Each binned QCD sample imposes a generator level cut on the
transverse momentum of the hard interaction.  During the evaluation of discrimination
performance the QCD samples are weighted according to their respective
integrated luminosities to remove any effect of the binning.

The signal and background samples are split into five subsamples corresponding
to each reconstructed decay mode.  An additional selection is applied to each
subsample by requiring a ``leading pion'': either a charged hadron or gamma
candidate with transverse momentum greater than 5 GeV$/c$.  A large number of
QCD training events is required as both the leading pion selection and the
requirement that the decay mode match one of the dominant modes given in table
~\ref{tab:decay_modes} are effective discriminants.  For each subsample,
80\% of the signal and background tau--candidates are used for training the neural
networks by the TMVA software, with half (40\%) used as a validation sample
used to ensure the neural network is not over--trained. The number of signal and background entries
used for training and validation in each decay mode subsample is given in table ~\ref{tab:trainingEvents}.

%Chained 100 signal files.
%Chained 208 background files.
%Total signal entries: 874266
%Total background entries: 9526176
%Pruning non-relevant entries.
%After pruning, 584895 signal and 644315 background entries remain.
%**********************************************************************************
%*********************************** Summary **************************************
%**********************************************************************************
%*     NumEvents with weight > 0 (Total NumEvents)                                *
%*--------------------------------------------------------------------------------*
%*shrinkingConePFTauDecayModeProducer   ThreeProngNoPiZero: Signal:      53257(53271)            Background:155793(155841)
%*shrinkingConePFTauDecayModeProducer  ThreeProngOnePiZero: Signal:      13340(13342)            Background:135871(135942)
%*shrinkingConePFTauDecayModeProducer    OneProngTwoPiZero: Signal:      34780(34799)            Background:51181(51337)
%*shrinkingConePFTauDecayModeProducer    OneProngOnePiZero: Signal:      136464(138171)          Background:137739(139592)
%*shrinkingConePFTauDecayModeProducer     OneProngNoPiZero: Signal:      300951(345312)          Background:144204(161603)

\begin{table}
   \centering
   \begin{tabular}{lcc}
      %\multirow{2}{*}{}                         & \multicolumn{2}{c}{Events}    \\
                                                & Signal        & Background    \\
      \hline
      Total number of tau--candidates           & 874266        & 9526176       \\
      Tau--candidates passing preselection      & 584895        & 644315        \\
      Tau--candidates with $W(P_T,\eta)>0$      & 538792        & 488917        \\
      \hline
      Decay Mode                        & \multicolumn{2}{c}{Training Events}   \\
      \hline
      $\pi^{-}$                         & 300951   & 144204                     \\
      $\pi^{-}\pi^0$                    & 135464   & 137739                     \\
      $\pi^{-}\pi^0\pi^0$               & 34780    & 51181                      \\
      $\pi^{-}\pi^{-}\pi^{+}$           & 53247    & 155793                     \\
      $\pi^{-}\pi^{-}\pi^{+}\pi^0$      & 13340    & 135871                     \\
   \end{tabular}
   \label{tab:trainingEvents}
   \caption{Number of events used for neural network training and validation for each
   selected decay mode.}
\end{table}

The remaining 20\% of the signal and background samples are
reserved as a statistically independent sample to evaluate the performance of
the neural nets after the training is completed.  The TaNC uses the ``MLP''
neural network implementation provided by the TMVA software package, described
in ~\cite{TMVA}.  The ``MLP'' classifier is a feed-forward artificial neural
network. There are two layers of hidden nodes and a single node in the output
layer.  The hyperbolic tangent function is used for the neuron activation
function.  
%The number of hidden nodes in the first and second layers are chosen
%according to Kolmogorov's theorem~\cite{kolmogorovsTheorem}; the number of
%hidden nodes in the first (second) layer is $N+1 (2N+1)$, where $N$ is the
%number of input observables.  
The number of hidden nodes in the first (second) layers are chosen
to be $N+1 (2N+1)$, respectively, where $N$ is the
number of input observables.  According to the Kolmogorov's theorem~\fixme{need to find cite}
The neural network is trained for 500 epochs. At
ten epoch intervals, the neural network error is computed using the validation sample to check for
overtraining (see figure~\ref{fig:overTrainCheck}). The neural network error
$E$ is defined~\cite{TMVA} as

\begin{equation}
   E = \frac{1}{2} \sum_{i=1}^N (y_{ANN,i} - \hat y_i)^2
   \label{eq:NNerrorFunc}
   %note - not right for weighted dists?
\end{equation}
where $N$ is the number of training events, $y_{ANN,i}$ is the neural network output
for the $i$th training event, and $y_i$ is the desired (-1 for background, 1 for signal) output
the $i$th event. No evidence  of over--training is observed.

\begin{figure}[thbp]
   \setlength{\unitlength}{1mm}
   \begin{center}
      \begin{picture}(150, 195)(0,0)
         \put(0.5, 130)
         {\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngNoPiZero.pdf}}}
         \put(65,  130)
         {\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngOnePiZero.pdf}}}
         \put(0.5, 65) 
         {\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngTwoPiZero.pdf}}}
         \put(65, 65) 
         {\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_ThreeProngNoPiZero.pdf}}}
         \put(33, 0) 
         {\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_ThreeProngOnePiZero.pdf}}}
      \end{picture}
   \caption{
   Neural network classification error for training (solid red) and testing
   (dashed blue) samples at ten epoch intervals over the 500 training epochs for each
   decay mode neural network.  The vertical axis represents the classification
   error, defined by equation~\ref{eq:NNerrorFunc}.  N.B. that the choice of
   hyperbolic tangent for neuron activation functions results in the desired
   outputs for signal and background to be 1 and -1, respectively.  This results
   in the computed neural network error being larger by a factor of four than
   the case where the desired outputs are (0, 1).  Classifier over--training
   would be evidenced by divergence of the classification error of the training
   and testing samples, indicating that the neural net was optimizing about
   statistical fluctuations in the training sample.  
   }
   \label{fig:overTrainCheck}
   \end{center}
\end{figure}


The neural networks use as input observables the transverse momentum and $\eta$
of the tau--candidates.  These observables are included as their correlations
with other observables can increase the separation power of the ensemble of
observables.  For example, the opening angle in $\Delta R$ for signal
tau--candidates is inversely related to the transverse momentum, while for
background events the correlation is very small~\cite{DavisTau}. In the
training signal and background samples, there is significant discrimination
power in the $P_T$ spectrum.   However, it is desirable to eliminate any
systematic dependence of the neural network output on $P_T$ and $\eta$, as in
practice the TaNC will be presented with tau--candidates whose $P_T-\eta$
spectrum will be analysis dependent. The dependence on $P_T$ and $\eta$ is
removed by applying a $P_T$ and $\eta$ dependent weight to the tau--candidates
when training the neural nets.  

The weights are defined such that in any region in the vector space spanned by
$P_T$ and $\eta$ where the signal sample and background sample probability
density functions are different, the sample with higher probability density is
weighted such that the samples have identical $P_T-\eta$ probability
distributions.  This removes regions of $P_T-\eta$ space where the training
sample is exclusively signal or background.  The weights are computed according to
\begin{align*}
   W(P_T, \eta) &=  {\rm less}(p_{sig}(P_T, \eta), p_{bkg}(P_T, \eta))\\
   w_{sig}(P_T, \eta) &=  W(P_T, \eta)/p_{sig}(P_T, \eta) \\
   w_{bkg}(P_T, \eta) &=  W(P_T, \eta)/p_{bkg}(P_T, \eta) 
\end{align*}
where $p_{sig}(P_T,\eta)$ and $p_{bkg}(P_T,\eta)$ are the probability densities of
the signal and background samples after the ``leading pion'' and dominant decay mode
selections. Figure~\ref{fig:nnTrainingWeights} shows the signal and background
training $P_T$ distributions before and after the weighting is applied.


\begin{figure}[thbp]
\setlength{\unitlength}{1mm}
\begin{center}
\begin{picture}(150,60)(0,0)
\put(10.5, 2){
\mbox{\includegraphics*[height=58mm]{figures/training_weights_unweighted.pdf}}}
\put(86.0, 2){
\mbox{\includegraphics*[height=58mm]{figures/training_weights_weighted.pdf}}}
%\put(-5.5, 112.5){\small (a)}
%\put(72.0, 112.5){\small (b)}
%\put(-5.5, 54.5){\small (c)}
%\put(72.0, 54.5){\small (d)}
\end{picture}
\caption{Transverse momentum spectrum of signal and background
tau--candidates used in neural net training before (left) and after (right) the
application of $P_T-\eta$ dependent weight function.  Application of the weights
lowers the training significance of tau--candidates in regions of $P_T-\eta$
phase space where either the signal or background samples has an excess of
events. }
\label{fig:nnTrainingWeights}
\end{center}
\end{figure} 

Revision:	1.10
Committed:	Wed Apr 28 22:15:10 2010 UTC (15 years ago) by friis
Content type:	application/x-tex
Branch:	MAIN
Changes since 1.9:	+4 -4 lines
Log Message:	Final text tweaks
#	User	Rev	Content
1	friis	1.1	The samples used to train the TaNC neural networks are typical of the signals
2			and backgrounds found in common physics analyses using taus. The signal--type
3	friis	1.7	training sample is composed of reconstructed tau--candidates that are matched
4			to generator level hadronic tau decays coming from simulated $Z \rightarrow
5			\tau^{+}\tau^{-}$ events. The background training sample consists of
6			reconstructed tau--candidates in simulated QCD $2\rightarrow2$ hard scattering
7			events. The QCD $P_T$ spectrum is steeply falling, and to obtain sufficient
8			statistics across a broad range of $P_T$ the sample is split into different
9	friis	1.9	$\hat P_{T}$ bins. Each binned QCD sample imposes a generator level cut on the
10	friis	1.10	transverse momentum of the hard interaction. During the evaluation of discrimination
11	friis	1.9	performance the QCD samples are weighted according to their respective
12	friis	1.7	integrated luminosities to remove any effect of the binning.
13	friis	1.2
14	friis	1.1	The signal and background samples are split into five subsamples corresponding
15			to each reconstructed decay mode. An additional selection is applied to each
16			subsample by requiring a ``leading pion'': either a charged hadron or gamma
17			candidate with transverse momentum greater than 5 GeV$/c$. A large number of
18	friis	1.9	QCD training events is required as both the leading pion selection and the
19	friis	1.1	requirement that the decay mode match one of the dominant modes given in table
20	friis	1.9	~\ref{tab:decay_modes} are effective discriminants. For each subsample,
21	friis	1.10	80\% of the signal and background tau--candidates are used for training the neural
22	friis	1.8	networks by the TMVA software, with half (40\%) used as a validation sample
23	friis	1.9	used to ensure the neural network is not over--trained. The number of signal and background entries
24			used for training and validation in each decay mode subsample is given in table ~\ref{tab:trainingEvents}.
25	friis	1.1
26	friis	1.2	%Chained 100 signal files.
27			%Chained 208 background files.
28			%Total signal entries: 874266
29			%Total background entries: 9526176
30			%Pruning non-relevant entries.
31			%After pruning, 584895 signal and 644315 background entries remain.
32			%**********************************************************************************
33			%********************************* Summary ************************************
34			%**********************************************************************************
35			%* NumEvents with weight > 0 (Total NumEvents) *
36			%--------------------------------------------------------------------------------
37			%*shrinkingConePFTauDecayModeProducer ThreeProngNoPiZero: Signal: 53257(53271) Background:155793(155841)
38			%*shrinkingConePFTauDecayModeProducer ThreeProngOnePiZero: Signal: 13340(13342) Background:135871(135942)
39			%*shrinkingConePFTauDecayModeProducer OneProngTwoPiZero: Signal: 34780(34799) Background:51181(51337)
40			%*shrinkingConePFTauDecayModeProducer OneProngOnePiZero: Signal: 136464(138171) Background:137739(139592)
41			%*shrinkingConePFTauDecayModeProducer OneProngNoPiZero: Signal: 300951(345312) Background:144204(161603)
42
43	friis	1.1	\begin{table}
44			\centering
45	friis	1.2	\begin{tabular}{lcc}
46			%\multirow{2}{*}{} & \multicolumn{2}{c}{Events} \\
47			& Signal & Background \\
48	friis	1.1	\hline
49	friis	1.2	Total number of tau--candidates & 874266 & 9526176 \\
50			Tau--candidates passing preselection & 584895 & 644315 \\
51			Tau--candidates with $W(P_T,\eta)>0$ & 538792 & 488917 \\
52	friis	1.1	\hline
53	friis	1.2	Decay Mode & \multicolumn{2}{c}{Training Events} \\
54	friis	1.1	\hline
55	friis	1.2	$\pi^{-}$ & 300951 & 144204 \\
56			$\pi^{-}\pi^0$ & 135464 & 137739 \\
57			$\pi^{-}\pi^0\pi^0$ & 34780 & 51181 \\
58			$\pi^{-}\pi^{-}\pi^{+}$ & 53247 & 155793 \\
59			$\pi^{-}\pi^{-}\pi^{+}\pi^0$ & 13340 & 135871 \\
60	friis	1.1	\end{tabular}
61			\label{tab:trainingEvents}
62	friis	1.9	\caption{Number of events used for neural network training and validation for each
63	friis	1.1	selected decay mode.}
64			\end{table}
65
66	friis	1.9	The remaining 20\% of the signal and background samples are
67	friis	1.8	reserved as a statistically independent sample to evaluate the performance of
68			the neural nets after the training is completed. The TaNC uses the ``MLP''
69			neural network implementation provided by the TMVA software package, described
70			in ~\cite{TMVA}. The ``MLP'' classifier is a feed-forward artificial neural
71			network. There are two layers of hidden nodes and a single node in the output
72			layer. The hyperbolic tangent function is used for the neuron activation
73			function.
74			%The number of hidden nodes in the first and second layers are chosen
75			%according to Kolmogorov's theorem~\cite{kolmogorovsTheorem}; the number of
76			%hidden nodes in the first (second) layer is $N+1 (2N+1)$, where $N$ is the
77			%number of input observables.
78			The number of hidden nodes in the first (second) layers are chosen
79	friis	1.9	to be $N+1 (2N+1)$, respectively, where $N$ is the
80	friis	1.8	number of input observables. According to the Kolmogorov's theorem~\fixme{need to find cite}
81			The neural network is trained for 500 epochs. At
82	friis	1.9	ten epoch intervals, the neural network error is computed using the validation sample to check for
83	friis	1.8	overtraining (see figure~\ref{fig:overTrainCheck}). The neural network error
84			$E$ is defined~\cite{TMVA} as
85
86	friis	1.5	\begin{equation}
87			E = \frac{1}{2} \sum_{i=1}^N (y_{ANN,i} - \hat y_i)^2
88			\label{eq:NNerrorFunc}
89			%note - not right for weighted dists?
90			\end{equation}
91			where $N$ is the number of training events, $y_{ANN,i}$ is the neural network output
92			for the $i$th training event, and $y_i$ is the desired (-1 for background, 1 for signal) output
93	friis	1.7	the $i$th event. No evidence of over--training is observed.
94	friis	1.4
95	friis	1.8	\begin{figure}[thbp]
96	friis	1.4	\setlength{\unitlength}{1mm}
97			\begin{center}
98			\begin{picture}(150, 195)(0,0)
99			\put(0.5, 130)
100			{\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngNoPiZero.pdf}}}
101			\put(65, 130)
102			{\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngOnePiZero.pdf}}}
103			\put(0.5, 65)
104			{\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_OneProngTwoPiZero.pdf}}}
105			\put(65, 65)
106			{\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_ThreeProngNoPiZero.pdf}}}
107			\put(33, 0)
108			{\mbox{\includegraphics*[height=60mm]{figures/overtrainCheck_ThreeProngOnePiZero.pdf}}}
109			\end{picture}
110	friis	1.5	\caption{
111	friis	1.6	Neural network classification error for training (solid red) and testing
112			(dashed blue) samples at ten epoch intervals over the 500 training epochs for each
113	friis	1.5	decay mode neural network. The vertical axis represents the classification
114			error, defined by equation~\ref{eq:NNerrorFunc}. N.B. that the choice of
115			hyperbolic tangent for neuron activation functions results in the desired
116	friis	1.10	outputs for signal and background to be 1 and -1, respectively. This results
117	friis	1.5	in the computed neural network error being larger by a factor of four than
118			the case where the desired outputs are (0, 1). Classifier over--training
119			would be evidenced by divergence of the classification error of the training
120			and testing samples, indicating that the neural net was optimizing about
121	friis	1.6	statistical fluctuations in the training sample.
122	friis	1.4	}
123			\label{fig:overTrainCheck}
124			\end{center}
125			\end{figure}
126
127	friis	1.1
128	friis	1.9	The neural networks use as input observables the transverse momentum and $\eta$
129			of the tau--candidates. These observables are included as their correlations
130			with other observables can increase the separation power of the ensemble of
131			observables. For example, the opening angle in $\Delta R$ for signal
132			tau--candidates is inversely related to the transverse momentum, while for
133			background events the correlation is very small~\cite{DavisTau}. In the
134			training signal and background samples, there is significant discrimination
135			power in the $P_T$ spectrum. However, it is desirable to eliminate any
136			systematic dependence of the neural network output on $P_T$ and $\eta$, as in
137			practice the TaNC will be presented with tau--candidates whose $P_T-\eta$
138			spectrum will be analysis dependent. The dependence on $P_T$ and $\eta$ is
139			removed by applying a $P_T$ and $\eta$ dependent weight to the tau--candidates
140			when training the neural nets.
141
142			The weights are defined such that in any region in the vector space spanned by
143			$P_T$ and $\eta$ where the signal sample and background sample probability
144			density functions are different, the sample with higher probability density is
145			weighted such that the samples have identical $P_T-\eta$ probability
146			distributions. This removes regions of $P_T-\eta$ space where the training
147			sample is exclusively signal or background. The weights are computed according to
148	friis	1.2	\begin{align*}
149			W(P_T, \eta) &= {\rm less}(p_{sig}(P_T, \eta), p_{bkg}(P_T, \eta))\\
150			w_{sig}(P_T, \eta) &= W(P_T, \eta)/p_{sig}(P_T, \eta) \\
151			w_{bkg}(P_T, \eta) &= W(P_T, \eta)/p_{bkg}(P_T, \eta)
152			\end{align*}
153	friis	1.7	where $p_{sig}(P_T,\eta)$ and $p_{bkg}(P_T,\eta)$ are the probability densities of
154	friis	1.10	the signal and background samples after the ``leading pion'' and dominant decay mode
155	friis	1.2	selections. Figure~\ref{fig:nnTrainingWeights} shows the signal and background
156			training $P_T$ distributions before and after the weighting is applied.
157
158
159	friis	1.8	\begin{figure}[thbp]
160	friis	1.2	\setlength{\unitlength}{1mm}
161			\begin{center}
162			\begin{picture}(150,60)(0,0)
163			\put(10.5, 2){
164			\mbox{\includegraphics*[height=58mm]{figures/training_weights_unweighted.pdf}}}
165			\put(86.0, 2){
166			\mbox{\includegraphics*[height=58mm]{figures/training_weights_weighted.pdf}}}
167			%\put(-5.5, 112.5){\small (a)}
168			%\put(72.0, 112.5){\small (b)}
169			%\put(-5.5, 54.5){\small (c)}
170			%\put(72.0, 54.5){\small (d)}
171			\end{picture}
172	friis	1.4	\caption{Transverse momentum spectrum of signal and background
173	friis	1.2	tau--candidates used in neural net training before (left) and after (right) the
174			application of $P_T-\eta$ dependent weight function. Application of the weights
175			lowers the training significance of tau--candidates in regions of $P_T-\eta$
176			phase space where either the signal or background samples has an excess of
177			events. }
178			\label{fig:nnTrainingWeights}
179			\end{center}
180			\end{figure}
181