cmsnotes/OSPAS2011/datadriven.tex

\section{Counting Experiments}
\label{sec:datadriven}

To look for possible BSM contributions, we define 2 signal regions that preserve about 
0.1\% of the dilepton $t\bar{t}$ events, by adding requirements of large \MET\ and \Ht:

\begin{itemize}
\item high \MET\ signal region: \MET\ $>$ 275~GeV, \Ht\ $>$ 300~GeV,
\item high \Ht\ signal region:  \MET\ $>$ 200~GeV, \Ht\ $>$ 600~GeV.
\end{itemize}

For the high \MET\ (high \Ht) signal region, the MC predicts 2.6 (2.5) SM events, 
dominated by dilepton $t\bar{t}$; the expected LM1 yield is 17 (14) and the
expected LM3 yield is 6.4 (6.7). The signal regions are indicated in Fig.~\ref{fig:met_ht}.
These signal regions are tighter than the one used in our published 2010 analysis since 
with the larger data sample they give improved sensitivity to contributions from new physics.

We perform counting experiments in these signal regions, and use three independent methods to estimate from data the background in the signal region.
The first method is a novel technique based on the ABCD method, which we used in our 2010 analysis~\cite{ref:ospaper}, 
and exploits the fact that \HT\ and $y \equiv \MET/\sqrt{H_T}$ are nearly uncorrelated for the $t\bar{t}$ background;
this method is referred to as the ABCD' technique. First, we extract the $y$ and \Ht\ distributions 
$f(y)$ and $g(H_T)$ from data, using events from control regions which are dominated by background. 
Because $y$ and \Ht\ are weakly-correlated, the distribution of events in the $y$ vs. \Ht\ plane is described by:

\begin{equation}
\label{eq:abcdprime}
\frac{\partial^2 N}{\partial y \partial H_T} = f(y)g(H_T),
\end{equation}

allowing us to deduce the number of events falling in any region of this plane. In particular,
we can deduce the number of events falling in our signal regions defined by requirements on \MET\ and \Ht.

We measure the $f(y)$ and $g(H_T)$ distributions using events in the regions indicated in Fig.~\ref{fig:abcdprimedata},
and predict the background yields in the signal regions using Eq.~\ref{eq:abcdprime}.
%Next, we randomly sample values of $y$ and \Ht\ from these distributions; each pair of $y$ and \Ht\ values is a pseudo-event.
%We generate a large ensemble of pseudo-events, and find the ratio $R_{S/C}$, the ratio of the
%number of pseudo-events falling in the signal region to the number of pseudo-events
%falling in a control region defined by the same requirements used to select events
%to measure $f(y)$ and $g(H_T)$. We then
%multiply this ratio by the number events which fall in the control region in data
%to get the predicted yield, ie. $N_{pred} = R_{S/C} \times N({\rm control})$. 
To estimate the statistical uncertainty in the predicted background, the bin contents
of $f(y)$ and $g(H_T)$ are smeared according to their Poisson uncertainties, the prediction is repeated 20 times
with these smeared distributions, and the RMS of the deviation from the nominal prediction is taken
as the statistical uncertainty. We have studied this technique using toy MC studies based on
event samples of similar size to the expected yield in data for 1 fb$^{-1}$.
Based on these studies we correct the predicted background yields by factors of 1.2 $\pm$ 0.5
(1.0 $\pm$ 0.5) for the high \MET\ (high \Ht) signal region.


The second  background estimate, henceforth referred to as the dilepton transverse momentum ($\pt(\ell\ell)$) method, 
is  based on the  idea~\cite{ref:victory} that  in dilepton  $t\bar{t}$  events the
\pt\  distributions of  the charged  leptons and  neutrinos  from $W$
decays are  related, because of the  common boosts from  the top  and $W$
decays.  This relation  is governed by the polarization  of the $W$'s,
which         is         well         understood        in         top
decays in the SM~\cite{Wpolarization,Wpolarization2}   and   can  therefore   be
reliably  accounted   for.   We then  use   the  observed
$\pt(\ell\ell)$ distribution to  model the $\pt(\nu\nu)$ distribution,
which is  identified with \MET.  Thus,  we use the  number of observed
events  with $\HT > 300\GeV$ and $\pt(\ell\ell)  > 275\GeV$ 
($\HT > 600\GeV$ and $\pt(\ell\ell)  > 200\GeV$ )
to predict the  number of  background events  with 
$\HT >  300\GeV$ and  $\MET > 275\GeV$ ($\HT >  600\GeV$ and  $\MET > 200\GeV$).  
In  practice, we apply two corrections to this prediction, following the same procedure as in Ref.~\cite{ref:ospaper}.
The first correction is $K_{50}=1.5 \pm 0.3$ ($1.3 \pm 0.2$) for the high \MET\ (high \Ht) signal region.
The  second correction factor  is $K_C  = 1.5  \pm 0.5$ ($1.3 \pm 0.4$) for the
high \MET (high \Ht) signal region.

Our third background estimation method is based on the fact that many models of new physics
produce an excess of SF with respect to OF lepton pairs, while for the \ttbar\ background the
rates of SF and OF lepton pairs are the same. Hence we make use of the OF subtraction technique
discussed in Sec.~\ref{sec:fit} in which we performed a shape analysis of the dilepton mass distribution.
Here we perform a counting experiment, by quantifying the  the excess of SF vs. OF pairs using the
quantity

\begin{equation}
\label{eq:ofhighpt}
\Delta = R_{\mu e}N(ee) + \frac{1}{R_{\mu e}}N(\mu\mu) - N(e\mu).
\end{equation}

This quantity is predicted to be 0 for processes with 
uncorrelated lepton flavors. In order for this technique to work, the kinematic selection 
applied to events in all dilepton flavor channels must be the same, which is not the case 
for our default selection because the $Z$ mass veto is applied only to same-flavor channels.
Therefore when applying the OF subtraction technique we also apply the $Z$ mass veto
to the $e\mu$ channel. 

All background estimation methods based on data are in principle subject to signal contamination
in the control regions, which tends to decrease the significance of a signal
which may be present in the data by increasing the background prediction.
In general, it is difficult to quantify these effects because we 
do not know what signal may be present in the data.  Having three
independent methods (in addition to expectations from MC)
adds redundancy because signal contamination can have different effects
in the different control regions for the three methods.
For example, in the extreme case of a
BSM signal with identical distributions of $\pt(\ell \ell)$ and \MET, an excess of events might be seen 
in the ABCD' method but not in the $\pt(\ell \ell)$ method.

Revision:	1.6
Committed:	Wed Jun 15 10:03:51 2011 UTC (13 years, 11 months ago) by benhoob
Content type:	application/x-tex
Branch:	MAIN
CVS Tags:	v2
Changes since 1.5:	+2 -5 lines
Log Message:	Minor updates
#	Content
1	\section{Counting Experiments}
2	\label{sec:datadriven}
3
4	To look for possible BSM contributions, we define 2 signal regions that preserve about
5	0.1\% of the dilepton $t\bar{t}$ events, by adding requirements of large \MET\ and \Ht:
6
7	\begin{itemize}
8	\item high \MET\ signal region: \MET\ $>$ 275~GeV, \Ht\ $>$ 300~GeV,
9	\item high \Ht\ signal region: \MET\ $>$ 200~GeV, \Ht\ $>$ 600~GeV.
10	\end{itemize}
11
12	For the high \MET\ (high \Ht) signal region, the MC predicts 2.6 (2.5) SM events,
13	dominated by dilepton $t\bar{t}$; the expected LM1 yield is 17 (14) and the
14	expected LM3 yield is 6.4 (6.7). The signal regions are indicated in Fig.~\ref{fig:met_ht}.
15	These signal regions are tighter than the one used in our published 2010 analysis since
16	with the larger data sample they give improved sensitivity to contributions from new physics.
17
18	We perform counting experiments in these signal regions, and use three independent methods to estimate from data the background in the signal region.
19	The first method is a novel technique based on the ABCD method, which we used in our 2010 analysis~\cite{ref:ospaper},
20	and exploits the fact that \HT\ and $y \equiv \MET/\sqrt{H_T}$ are nearly uncorrelated for the $t\bar{t}$ background;
21	this method is referred to as the ABCD' technique. First, we extract the $y$ and \Ht\ distributions
22	$f(y)$ and $g(H_T)$ from data, using events from control regions which are dominated by background.
23	Because $y$ and \Ht\ are weakly-correlated, the distribution of events in the $y$ vs. \Ht\ plane is described by:
24
25	\begin{equation}
26	\label{eq:abcdprime}
27	\frac{\partial^2 N}{\partial y \partial H_T} = f(y)g(H_T),
28	\end{equation}
29
30	allowing us to deduce the number of events falling in any region of this plane. In particular,
31	we can deduce the number of events falling in our signal regions defined by requirements on \MET\ and \Ht.
32
33	We measure the $f(y)$ and $g(H_T)$ distributions using events in the regions indicated in Fig.~\ref{fig:abcdprimedata},
34	and predict the background yields in the signal regions using Eq.~\ref{eq:abcdprime}.
35	%Next, we randomly sample values of $y$ and \Ht\ from these distributions; each pair of $y$ and \Ht\ values is a pseudo-event.
36	%We generate a large ensemble of pseudo-events, and find the ratio $R_{S/C}$, the ratio of the
37	%number of pseudo-events falling in the signal region to the number of pseudo-events
38	%falling in a control region defined by the same requirements used to select events
39	%to measure $f(y)$ and $g(H_T)$. We then
40	%multiply this ratio by the number events which fall in the control region in data
41	%to get the predicted yield, ie. $N_{pred} = R_{S/C} \times N({\rm control})$.
42	To estimate the statistical uncertainty in the predicted background, the bin contents
43	of $f(y)$ and $g(H_T)$ are smeared according to their Poisson uncertainties, the prediction is repeated 20 times
44	with these smeared distributions, and the RMS of the deviation from the nominal prediction is taken
45	as the statistical uncertainty. We have studied this technique using toy MC studies based on
46	event samples of similar size to the expected yield in data for 1 fb$^{-1}$.
47	Based on these studies we correct the predicted background yields by factors of 1.2 $\pm$ 0.5
48	(1.0 $\pm$ 0.5) for the high \MET\ (high \Ht) signal region.
49
50
51	The second background estimate, henceforth referred to as the dilepton transverse momentum ($\pt(\ell\ell)$) method,
52	is based on the idea~\cite{ref:victory} that in dilepton $t\bar{t}$ events the
53	\pt\ distributions of the charged leptons and neutrinos from $W$
54	decays are related, because of the common boosts from the top and $W$
55	decays. This relation is governed by the polarization of the $W$'s,
56	which is well understood in top
57	decays in the SM~\cite{Wpolarization,Wpolarization2} and can therefore be
58	reliably accounted for. We then use the observed
59	$\pt(\ell\ell)$ distribution to model the $\pt(\nu\nu)$ distribution,
60	which is identified with \MET. Thus, we use the number of observed
61	events with $\HT > 300\GeV$ and $\pt(\ell\ell) > 275\GeV$
62	($\HT > 600\GeV$ and $\pt(\ell\ell) > 200\GeV$ )
63	to predict the number of background events with
64	$\HT > 300\GeV$ and $\MET > 275\GeV$ ($\HT > 600\GeV$ and $\MET > 200\GeV$).
65	In practice, we apply two corrections to this prediction, following the same procedure as in Ref.~\cite{ref:ospaper}.
66	The first correction is $K_{50}=1.5 \pm 0.3$ ($1.3 \pm 0.2$) for the high \MET\ (high \Ht) signal region.
67	The second correction factor is $K_C = 1.5 \pm 0.5$ ($1.3 \pm 0.4$) for the
68	high \MET (high \Ht) signal region.
69
70	Our third background estimation method is based on the fact that many models of new physics
71	produce an excess of SF with respect to OF lepton pairs, while for the \ttbar\ background the
72	rates of SF and OF lepton pairs are the same. Hence we make use of the OF subtraction technique
73	discussed in Sec.~\ref{sec:fit} in which we performed a shape analysis of the dilepton mass distribution.
74	Here we perform a counting experiment, by quantifying the the excess of SF vs. OF pairs using the
75	quantity
76
77	\begin{equation}
78	\label{eq:ofhighpt}
79	\Delta = R_{\mu e}N(ee) + \frac{1}{R_{\mu e}}N(\mu\mu) - N(e\mu).
80	\end{equation}
81
82	This quantity is predicted to be 0 for processes with
83	uncorrelated lepton flavors. In order for this technique to work, the kinematic selection
84	applied to events in all dilepton flavor channels must be the same, which is not the case
85	for our default selection because the $Z$ mass veto is applied only to same-flavor channels.
86	Therefore when applying the OF subtraction technique we also apply the $Z$ mass veto
87	to the $e\mu$ channel.
88
89	All background estimation methods based on data are in principle subject to signal contamination
90	in the control regions, which tends to decrease the significance of a signal
91	which may be present in the data by increasing the background prediction.
92	In general, it is difficult to quantify these effects because we
93	do not know what signal may be present in the data. Having three
94	independent methods (in addition to expectations from MC)
95	adds redundancy because signal contamination can have different effects
96	in the different control regions for the three methods.
97	For example, in the extreme case of a
98	BSM signal with identical distributions of $\pt(\ell \ell)$ and \MET, an excess of events might be seen
99	in the ABCD' method but not in the $\pt(\ell \ell)$ method.
100