cmsnotes/OSPAS2011/datadriven.tex

\section{Background Estimates from Data}
\label{sec:datadriven}

To look for possible BSM contributions, we define 2 signal regions that preserve about 
0.1\% of the dilepton $t\bar{t}$ events, by adding requirements of large \MET\ and \Ht:

\begin{itemize}
\item high \MET\ signal region: \MET $>$ 275~GeV, \Ht $>$ 300~GeV,
\item high \Ht\ signal region:  \MET $>$ 200~GeV, \Ht $>$ 600~GeV.
\end{itemize}

For the high \MET\ (high \Ht) signal region, the MC predicts 2.6 (2.5) SM events, 
dominated by dilepton $t\bar{t}$; the expected LM1 yield is 17 (14) and the
expected LM3 yield is 4.3 (4.3). The signal regions are indicated in Fig.~\ref{fig:met_ht}.

We use three independent methods to estimate from data the background in the signal region.
The first method is a novel technique based on the ABCD method, which we used in our 2010 analysis~\cite{ref:ospaper}, 
and exploits the fact that \HT\ and $y$ are nearly uncorrelated for the $t\bar{t}$ background;
this method is referred to as the ABCD' technique. First, we extract the $y$ and \Ht\ distributions 
$f(y)$ and $g(H_T)$ from data, using events from control regions which are dominated by background. 
Because $y$ and \Ht\ are weakly-correlated, the distribution of events in the $y$ vs. \Ht\ plane is described by:

\begin{equation}
\frac{\partial^2 N}{\partial y \partial H_T} = f(y)g(H_T),
\end{equation}

allowing us to deduce the number of events falling in any region of this plane. In particular,
we can deduce the number of events falling in our signal regions defined by requirements on \MET\ and \Ht.

We measure the $f(y)$ and $g(H_T)$ distributions using events in the regions indicated in Fig.~\ref{fig:abcdprimedata}
Next, we randomly sample values of $y$ and \Ht\ from these distributions; each pair of $y$ and \Ht\ values is a pseudo-event.
We generate a large ensemble of pseudo-events, and find the ratio $R_{S/C}$, the ratio of the
number of pseudo-events falling in the signal region to the number of pseudo-events
falling in a control region defined by the same requirements used to select events
to measure $f(y)$ and $g(H_T)$. We then
multiply this ratio by the number events which fall in the control region in data
to get the predicted yield, ie. $N_{pred} = R_{S/C} \times N({\rm control})$. 
To estimate the statistical uncertainty in the predicted background, we smear the bin contents
of $f(y)$ and $g(H_T)$ according to their uncertainties. We repeat the prediction 20 times
with these smeared distributions, and take the RMS of the deviation from the nominal prediction
as the statistical uncertainty. We have studied this technique using toy MC studies based on
event samples of similar size to the expected yield in data for 1 fb$^{-1}$.
Based on these studies we correct the predicted backgrounds yields by factors of 1.2 $\pm$ 0.5
(1.0 $\pm$ 0.5) for the high \MET\ (high \Ht) signal region.


The second  background estimate, henceforth referred to as the dilepton transverse momentum ($\pt(\ell\ell)$) method, 
is  based on the  idea~\cite{ref:victory} that  in dilepton  $t\bar{t}$  events the
\pt\  distributions of  the charged  leptons and  neutrinos  from $W$
decays are  related, because of the  common boosts from  the top  and $W$
decays.  This relation  is governed by the polarization  of the $W$'s,
which         is         well         understood        in         top
decays in the SM~\cite{Wpolarization,Wpolarization2}   and   can  therefore   be
reliably  accounted   for.   We then  use   the  observed
$\pt(\ell\ell)$ distribution to  model the $\pt(\nu\nu)$ distribution,
which is  identified with \MET.  Thus,  we use the  number of observed
events  with $\HT > 300\GeV$ and $\pt(\ell\ell)  > 275\GeV$ 
($\HT > 600\GeV$ and $\pt(\ell\ell)  > 200\GeV^{1/2}$ )
to predict the  number of  background events  with 
$\HT >  300\GeV$ and  $\MET > 275\GeV$ ($\HT >  600\GeV$ and  $\MET > 200\GeV$).  
In  practice, two corrections must be applied to this prediction, as described below.

%
% Now describe the corrections
%
The first correction  accounts for the $\MET >  50\GeV$ requirement in the
preselection, which is needed to  reduce the DY background.  We
rescale  the  prediction by  a  factor equal  to  the  inverse of  the
fraction  of  events  passing  the preselection which  also  satisfy  the
requirement  $\pt(\ell\ell) >  50\GeVc$.  
For the \Ht\ $>$ 300 GeV requirement corresponding to the high \MET\ signal region,
we determine this correction from data and find  $K_{50}=1.5 \pm 0.3$.   
For the \Ht\ $>$ 600 GeV requirement corresponding to the high \Ht\ signal region,
we do not have enough events in data to determine this correction with statistical
precision, so we instead extract it from MC and find $K_{50}=1.3 \pm 0.2$.
The  second  correction ($K_C$) is  associated with the  known polarization  of the  $W$, which
introduces a difference  between the $\pt(\ell\ell)$ and $\pt(\nu\nu)$
distributions. The correction $K_C$ also takes into account detector effects such as the hadronic energy
scale and  resolution which affect  the \MET\ but  not $\pt(\ell\ell)$.
The  total correction factor  is $K_{50}  \times K_C  = 2.2  \pm 0.9$ ($1.7 \pm 0.6$) for the
high \MET (high \Ht) signal regions, where the uncertainty includes the MC statistical uncertainty 
in the extraction of $K_C$ and the 5\%  uncertainty in  the hadronic energy scale~\cite{ref:jes}.

Our third background estimation method is based on the fact that many models of new physics
produce an excess of SF with respect to OF lepton pairs. In SUSY, such an excess may be produced
in the decay $\chi_2^0 \to \chi_1^0 \ell^+\ell^-$ or in the decay of $Z$ bosons produced in
the cascade decays of heavy, colored objects. In contrast, for the \ttbar\ background the
rates of SF and OF lepton pairs are the same, as is also the case for other SM backgrounds
such as $W^+W^-$ or DY$\to\tau^+\tau^-$. We quantify the excess of SF vs. OF pairs using the
quantity

\begin{equation}
\label{eq:ofhighpt}
\Delta = R_{\mu e}N(ee) + \frac{1}{R_{\mu e}}N(\mu\mu) - N(e\mu),
\end{equation}

where $R_{\mu e} = 1.13 \pm 0.05$ is the ratio of muon to electron selection efficiencies,
evaluated by taking the square root of the ratio of the number of 
$Z \to \mu^+\mu^-$ to $Z \to e^+e^-$ events in data, in the mass range 76-106 GeV with no jets or 
\met\ requirements. The quantity $\Delta$ is predicted to be 0 for processes with 
uncorrelated lepton flavors. In order for this technique to work, the kinematic selection 
applied to events in all dilepton flavor channels must be the same, which is not the case 
for our default selection because the $Z$ mass veto is applied only to same-flavor channels.
Therefore when applying the OF subtraction technique we also apply the $Z$ mass veto also 
to the $e\mu$ channel. 

All background estimation methods based on data are in principle subject to signal contamination
in the control regions, which tends to decrease the significance of a signal
which may be present in the data by increasing the background prediction.
In general, it is difficult to quantify these effects because we 
do not know what signal may be present in the data.  Having two
independent methods (in addition to expectations from MC)
adds redundancy because signal contamination can have different effects
in the different control regions for the two methods.
For example, in the extreme case of a
BSM signal with identical distributions of $\pt(\ell \ell)$ and \MET, an excess of events might be seen 
in the ABCD' method but not in the $\pt(\ell \ell)$ method.

Revision:	1.3
Committed:	Mon Jun 13 18:08:56 2011 UTC (13 years, 11 months ago) by benhoob
Content type:	application/x-tex
Branch:	MAIN
CVS Tags:	v1
Changes since 1.2:	+29 -15 lines
Log Message:	Lots of updates
#	Content
1	\section{Background Estimates from Data}
2	\label{sec:datadriven}
3
4	To look for possible BSM contributions, we define 2 signal regions that preserve about
5	0.1\% of the dilepton $t\bar{t}$ events, by adding requirements of large \MET\ and \Ht:
6
7	\begin{itemize}
8	\item high \MET\ signal region: \MET $>$ 275~GeV, \Ht $>$ 300~GeV,
9	\item high \Ht\ signal region: \MET $>$ 200~GeV, \Ht $>$ 600~GeV.
10	\end{itemize}
11
12	For the high \MET\ (high \Ht) signal region, the MC predicts 2.6 (2.5) SM events,
13	dominated by dilepton $t\bar{t}$; the expected LM1 yield is 17 (14) and the
14	expected LM3 yield is 4.3 (4.3). The signal regions are indicated in Fig.~\ref{fig:met_ht}.
15
16	We use three independent methods to estimate from data the background in the signal region.
17	The first method is a novel technique based on the ABCD method, which we used in our 2010 analysis~\cite{ref:ospaper},
18	and exploits the fact that \HT\ and $y$ are nearly uncorrelated for the $t\bar{t}$ background;
19	this method is referred to as the ABCD' technique. First, we extract the $y$ and \Ht\ distributions
20	$f(y)$ and $g(H_T)$ from data, using events from control regions which are dominated by background.
21	Because $y$ and \Ht\ are weakly-correlated, the distribution of events in the $y$ vs. \Ht\ plane is described by:
22
23	\begin{equation}
24	\frac{\partial^2 N}{\partial y \partial H_T} = f(y)g(H_T),
25	\end{equation}
26
27	allowing us to deduce the number of events falling in any region of this plane. In particular,
28	we can deduce the number of events falling in our signal regions defined by requirements on \MET\ and \Ht.
29
30	We measure the $f(y)$ and $g(H_T)$ distributions using events in the regions indicated in Fig.~\ref{fig:abcdprimedata}
31	Next, we randomly sample values of $y$ and \Ht\ from these distributions; each pair of $y$ and \Ht\ values is a pseudo-event.
32	We generate a large ensemble of pseudo-events, and find the ratio $R_{S/C}$, the ratio of the
33	number of pseudo-events falling in the signal region to the number of pseudo-events
34	falling in a control region defined by the same requirements used to select events
35	to measure $f(y)$ and $g(H_T)$. We then
36	multiply this ratio by the number events which fall in the control region in data
37	to get the predicted yield, ie. $N_{pred} = R_{S/C} \times N({\rm control})$.
38	To estimate the statistical uncertainty in the predicted background, we smear the bin contents
39	of $f(y)$ and $g(H_T)$ according to their uncertainties. We repeat the prediction 20 times
40	with these smeared distributions, and take the RMS of the deviation from the nominal prediction
41	as the statistical uncertainty. We have studied this technique using toy MC studies based on
42	event samples of similar size to the expected yield in data for 1 fb$^{-1}$.
43	Based on these studies we correct the predicted backgrounds yields by factors of 1.2 $\pm$ 0.5
44	(1.0 $\pm$ 0.5) for the high \MET\ (high \Ht) signal region.
45
46
47	The second background estimate, henceforth referred to as the dilepton transverse momentum ($\pt(\ell\ell)$) method,
48	is based on the idea~\cite{ref:victory} that in dilepton $t\bar{t}$ events the
49	\pt\ distributions of the charged leptons and neutrinos from $W$
50	decays are related, because of the common boosts from the top and $W$
51	decays. This relation is governed by the polarization of the $W$'s,
52	which is well understood in top
53	decays in the SM~\cite{Wpolarization,Wpolarization2} and can therefore be
54	reliably accounted for. We then use the observed
55	$\pt(\ell\ell)$ distribution to model the $\pt(\nu\nu)$ distribution,
56	which is identified with \MET. Thus, we use the number of observed
57	events with $\HT > 300\GeV$ and $\pt(\ell\ell) > 275\GeV$
58	($\HT > 600\GeV$ and $\pt(\ell\ell) > 200\GeV^{1/2}$ )
59	to predict the number of background events with
60	$\HT > 300\GeV$ and $\MET > 275\GeV$ ($\HT > 600\GeV$ and $\MET > 200\GeV$).
61	In practice, two corrections must be applied to this prediction, as described below.
62
63	%
64	% Now describe the corrections
65	%
66	The first correction accounts for the $\MET > 50\GeV$ requirement in the
67	preselection, which is needed to reduce the DY background. We
68	rescale the prediction by a factor equal to the inverse of the
69	fraction of events passing the preselection which also satisfy the
70	requirement $\pt(\ell\ell) > 50\GeVc$.
71	For the \Ht\ $>$ 300 GeV requirement corresponding to the high \MET\ signal region,
72	we determine this correction from data and find $K_{50}=1.5 \pm 0.3$.
73	For the \Ht\ $>$ 600 GeV requirement corresponding to the high \Ht\ signal region,
74	we do not have enough events in data to determine this correction with statistical
75	precision, so we instead extract it from MC and find $K_{50}=1.3 \pm 0.2$.
76	The second correction ($K_C$) is associated with the known polarization of the $W$, which
77	introduces a difference between the $\pt(\ell\ell)$ and $\pt(\nu\nu)$
78	distributions. The correction $K_C$ also takes into account detector effects such as the hadronic energy
79	scale and resolution which affect the \MET\ but not $\pt(\ell\ell)$.
80	The total correction factor is $K_{50} \times K_C = 2.2 \pm 0.9$ ($1.7 \pm 0.6$) for the
81	high \MET (high \Ht) signal regions, where the uncertainty includes the MC statistical uncertainty
82	in the extraction of $K_C$ and the 5\% uncertainty in the hadronic energy scale~\cite{ref:jes}.
83
84	Our third background estimation method is based on the fact that many models of new physics
85	produce an excess of SF with respect to OF lepton pairs. In SUSY, such an excess may be produced
86	in the decay $\chi_2^0 \to \chi_1^0 \ell^+\ell^-$ or in the decay of $Z$ bosons produced in
87	the cascade decays of heavy, colored objects. In contrast, for the \ttbar\ background the
88	rates of SF and OF lepton pairs are the same, as is also the case for other SM backgrounds
89	such as $W^+W^-$ or DY$\to\tau^+\tau^-$. We quantify the excess of SF vs. OF pairs using the
90	quantity
91
92	\begin{equation}
93	\label{eq:ofhighpt}
94	\Delta = R_{\mu e}N(ee) + \frac{1}{R_{\mu e}}N(\mu\mu) - N(e\mu),
95	\end{equation}
96
97	where $R_{\mu e} = 1.13 \pm 0.05$ is the ratio of muon to electron selection efficiencies,
98	evaluated by taking the square root of the ratio of the number of
99	$Z \to \mu^+\mu^-$ to $Z \to e^+e^-$ events in data, in the mass range 76-106 GeV with no jets or
100	\met\ requirements. The quantity $\Delta$ is predicted to be 0 for processes with
101	uncorrelated lepton flavors. In order for this technique to work, the kinematic selection
102	applied to events in all dilepton flavor channels must be the same, which is not the case
103	for our default selection because the $Z$ mass veto is applied only to same-flavor channels.
104	Therefore when applying the OF subtraction technique we also apply the $Z$ mass veto also
105	to the $e\mu$ channel.
106
107	All background estimation methods based on data are in principle subject to signal contamination
108	in the control regions, which tends to decrease the significance of a signal
109	which may be present in the data by increasing the background prediction.
110	In general, it is difficult to quantify these effects because we
111	do not know what signal may be present in the data. Having two
112	independent methods (in addition to expectations from MC)
113	adds redundancy because signal contamination can have different effects
114	in the different control regions for the two methods.
115	For example, in the extreme case of a
116	BSM signal with identical distributions of $\pt(\ell \ell)$ and \MET, an excess of events might be seen
117	in the ABCD' method but not in the $\pt(\ell \ell)$ method.
118