COMP/CSA06DOC/tier0ops.tex

\section{The Tier-0}

The Tier-0 had a pivotal role in CSA06. If it failed, nothing else could
succeed, for lack of data. Robust error-handling and recovery file-by-file
were therefore less important than continuous smooth operation of the
total system. Failure of any given processing step resulted in the output
being discarded, the rest of the system supplying enough data to ensure a
steady flow. Unreliable reconstruction algorithms were therefore excluded,
after initial testing, to ensure that the reconstruction software used in
the Tier-0 was performant and robust.

The fundamental properties of the dataflow and workflow had already been
verified at scales beyond the physics-startup using the 'July prototype',
an emulation which allows exploration of the behavioural phase-space
without relying on real events or CMSSW software. The basic design was
thus known to be sound and workable, and the Tier-0 for CSA06 therefore
concentrated on exploring the operational aspects.

The Tier-0 workflow for CSA06 encompassed most of the complexity required
for first-physics: Prompt Reconstruction, creation of AOD and AlcaReco
streams, merging of small output files to larger files for archive and
export, insertion of files in DBS, injection into PhEDEx, and retrieval of
calibration and alignment constants from Frontier. Only the communication
with the Storage Manager and the repacking step were missing, since
the input Monte Carlo data was not delivered in a format that was
appropriate for such an exercise.

The Tier-0 ran for 4 weeks with 100\% uptime, with no intrinsic scaling or
behavioural problems. Operator interventions were required to introduce
new CMSSW versions, modify the workflow as more features became available
in CMSSW, adjust the rate of the different physics channels, and deal with
minor problems caused by trivial bugs in the Tier-0 framework. The Tier-0 
achieved all of its target metrics for CSA06.

\subsection{System Architecture}

The Tier-0 uses a Perl-based message-passing framework. Components in the
workflow send messages to a central dispatcher/logger, these messages are
then forwarded to other components. The forwarding is based on
subscriptions, components declare what they are interested in receiving.  
This gives a modular, pluggable system; the workflow is determined by the
interaction of the components only through their message contents,
components are not directly coupled. The workflow can be changed on the
fly to accommodate new requirements that may appear, and components can be 
stopped and restarted with no impact on the overall system behaviour.

The component structure used for CSA06 was to have one Manager component 
and zero or more Worker components for each step of the workflow (Prompt 
Reconstruction, AOD, AlcaReco, Fast Merge, DBS Update, and Exporter). The 
managers subscribe to messages that indicate input files are ready for 
them, then build queues of payloads for the workers. Workers ask the 
managers for work when they become idle. All components can report 
information to MonaLisa, and through it to the dashboard. Workers report 
per-task statistics (e.g. reporting every 50 events), the managers report 
per-step aggregates (e.g. reporting the total number of active Prompt 
Reconstruction processes).

The entire system is lightweight, flexible, robust, and has very low 
latency. This makes it very well suited to the operational environment 
expected for the Tier-0 during real data-taking.

The processing rate of the Tier-0 was determined by the rate at which 
files were injected into the Prompt Reconstruction. For a given input 
dataset, files were injected at intervals corresponding to a given rate in 
MB/sec. This rate was adjusted to correspond to the desired event rate in 
Hz, using the average event size per dataset.

\subsection{Prompt Reconstruction}

Prompt Reconstruction began at noon on October 2nd with CMSSW 1\_0\_2. 
Only minbias data was used in the first few days, following to the 
planning set out in the CSA06 wiki page. The EWKSoup sample (used as a pseudo 
express-line stream) was added on October 5th, in order to increase 
data-transfer rates. Job success-rates were over 99.7\% in all channels.

Successive versions of CMSSW were used over the following weeks, as the 
code matured for reconstruction of the signal channels and for the other 
activities needed at the Tier-0 (AOD and AlcaReco production). At first, 
new versions were tested standalone, in parallel with the main running of 
the Tier-0. Later versions were deployed live in the Tier-0, without 
separate testing, but at a low enough level that any failures would not 
harm the smooth running of the total system. Once they were seen to be 
stable the event-rate with the new version was increased, and the older 
version was gracefully retired.

As each new version of CMSSW was deployed, reconstruction was restarted
from the beginning of the input data. If a version of CMSSW was retired
before the input data was completely processed, that input channel was
left incomplete for that version of the software. With CMSSW\_1\_0\_6, all
input channels were run to completion, to provide a complete and coherent
dataset for all subsequent activities. Essentially, the Tier-0 part of
CSA06 was repeated from scratch in the last week, with CMSSW\_1\_0\_6.

CMSSW\_1\_0\_3 was deployed for the second and third weeks of running,
being stable enough for the reconstruction of the signal channels.  
CMSSW\_1\_0\_5 was used from the 19th to the 24th of October, and included
the first AlcaReco streams, from minbias data. CMSSW\_1\_0\_6 was used
from the 22nd to the 30th, when the Tier-0 participation in CSA06 ended.
This final version had all the AlcaReco streams, the AODs, and
Frontier-access to conditions data.

The output of the Prompt Reconstruction contained the original input event 
as well as the reconstructed data, because of limitations in the CMSSW 
framework. This made the RECO output larger than the original input, so 
merging of RECO files was not useful. Prompt Reconstruction was therefore 
a one-file-in/one-file-out process. Event sizes and reconstruction times 
for CMSSW\_1\_0\_6 are shown in Table~\ref{tab:PR106}.

\begin{table}[htb]
\centering
\caption{Prompt Reconstruction with CMSSW\_1\_0\_6}
\label{tab:PR106}
\begin{tabular}{|l|c|c|c|}
\hline
Channel & Reconstruction & Input Event & Output Event \\
 & Time (CPU sec) & Size (MB) & Size (MB) \\
\hline
EWKSoup     &  6.7 & 1.1 & 1.7 \\
ExoticSoup  & 18.5 & 1.8 & 2.8 \\
HLTElectron &  8.6 & -- & 1.8 \\
HLTGamma    & 37.4 & -- & 3.5 \\
HLTJet      & 42.0 & -- & 3.7 \\
HLTMuon     &  8.4 & -- & 1.8 \\
Jets        & 22.8 & 1.6 & 2.6 \\
minbias     &  2.9 & 0.5 & 0.8 \\
SoftMuon    &  8.0 & 1.2 & 1.9 \\
TTbar       & 19.3 & 2.0 & 3.4 \\
Wenu        &  8.0 & 1.2 & 1.8 \\
ZMuMu       &  8.4 & 1.2 & 2.0 \\
\hline
\end{tabular}
\end{table}

\subsection{AlcaReco Production}

AlcaReco streams are produced according to the map shown in 
%Table~\ref{tab:ARStreams}, 
Table~\ref{tab:alcareco} in Section~\ref{sec:offlineswalca},
relating input datasets with output AlcaReco 
streams. AlcaReco streams were first produced using CMSSW\_1\_0\_3 running 
on minbias RECO data produced with CMSSW\_1\_0\_2, for just the 
AlcastreamElectron stream, but not until 
CMSSW\_1\_0\_6 were all input/output stream combinations available and 
useful. As with Prompt Reconstruction, all channels were run to completion 
with CMSSW\_1\_0\_6.

%
%\begin{table}[htb]
%\centering
%\caption{AlcaReco input/output stream map}
%\label{tab:ARStreams}
%\begin{tabular}{|l|c|c|c|c|l|}
%\hline
%Input Dataset & ZMuMu & minbias & Jets & Wenu & \\
%\hline
%AlcaReco stream & & & & & Purpose\\
%\hline
%CSA06ZMuMu   & X & - & - & - & Tracker Alignment \\
%CSA06MinBias & - & X & - & - & Tracker Alignment \\
%\hline
%AlcastreamElectron & - & - & - & X & ECAL Calibration \\
%AlcastreamEcalPhiSym & - & X & - & - & ECAL Calibration \\
%\hline
%AlcastreamHcalDijets & - & - & X & - & HCAL Calibration \\
%AlcastreamHcalIsotrk & - & X & X & - & HCAL Calibration \\
%AlcastreamHcalMinbias & - & X & - & - & HCAL Calibration \\
%\hline
%CSA06ZMuMu\_muon & X & - & - & - & Muon Alignment \\
%\hline
%\end{tabular}
%\end{table}

The AlcaReco files are mostly small, by definition, so merging was 
required before they could be written to tape. Simply writing and reading 
the files from tape would have been effectively impossible without this 
step, let alone the problems of analysing so many files once they were on 
disk.

AlcaReco production was essentially error-free, with no problems from the 
CMSSW application for any channels.

\subsection{AOD Production}

AOD production was run only with CMSSW\_1\_0\_6, there being no suitable 
config file beforehand. As with AlcaReco, all channels were run to 
completion, and the output merged for efficient tape and analysis access.

AOD production was also trouble-free, with negligible failure rate.

\subsection{Fast Merge}

Merging is an integral part of the Tier-0. To minimize load on the storage system
and our own data management, we have certain requirements for the minimum size
of files we keep. But for various reasons (mostly due to optimizing workflows)
we do have jobs that write output files which are significantly smaller then
our requirements. In CSA06 AOD and especially AlcaReco output files were smaller
then what is optimal. Due to the RAW+RECO output of the Prompt Reconstruction
in CSA06, the Prompt Reconstruction output files were actually bigger then the
RAW input files and didn't need to be merged.

The Merge component consisted of two parts, a manager and a number of workers.
The manager subscribed to notifications of AlcaReco and AOD job completions
and queued them internally. There were multiple queues separated by DataType
(AlcaReco or AOD), dataset, CMSSW version and the configuration used
(Pset hash) in the AOD or AlcaReco job. Once a new entry was added to a
queue, the content of that queue was checked against three thresholds.

  \begin{itemize}
     \item Number of input files (FileThreshold)
     \item Number of events in input files (EventThreshold)
     \item Combined size of input files (SizeThreshold)
  \end{itemize}

The FileThreshold was set to 32. This limit was introduced because of the next step
in the workflow, the registration of the merged file into DBS. The registration was
done with a shell command and all the parent files (ie. the input files of all the
AlcaReco or AOD jobs whose output was being merged) were passed as command line
arguments. Because we were concerned about shell command length limits, we restricted
the number of input files to 32. In standalone longterm testing (without DBS
registration) we already successfully explored scenarios with up to 150 input files. 

The EventThreshold was set to 100000 since this is a useful number of events
for AlcaReco studies.

The SizeThreshold was set to 3.9GB to prevent overly large files.

The actual merge operation on the workers was performed by the EdmFastMerge application.
The input files were not directly accessed through Castor, instead we staged in all
the files to the local disk with rfcp, ran the merge locally with local input and
output and then staged out the output file back to Castor. For our operational
requirements (merging of many small files) this was shown in earlier tests to
be much faster then merging directly from Castor.

Since the CMSSW 1\_0\_6 processing cycle went through all input data within the last
week of CSA06, only performance numbers for that period are quoted here. The performance
numbers for merges run with CMSSW 1\_0\_5 are similar. Only for the CMSSW 1\_0\_3
cycle of AlcaReco merges did we see a significant amount of errors. About a third
of these merge jobs failed with Castor stage in errors.

In the CMSSW 1\_0\_6 processing cycle there were a total of 5263 merge jobs submitted,
2436 for AOD and 2827 for AlcaReco. 33 jobs failed due to Castor stage in errors.
Out of these 33 failed jobs, 31 were successfully rerun. The stage in error that was
observed (which is the same that caused much more havoc in the CMSSW 1\_0\_3 cycle)
is a bug in Castor that can be triggered under certain conditions. The Castor team is
aware of the problem. 
%Hopefully they will provide a fix soon.

The remaining 2 jobs were rerun twice but failed both times. Further analysis showed
the merge input files to be corrupted, ie. even an interactive rfcp of the files would
always fail. This kind of problem is surprising since the job that created these
corrupted files checked the rfcp exit code when it staged them out to
Castor.

No failures during the merge itself or for the stage out were observed.

Figs.~\ref{fig:MergeAlcaRecoFiles} and \ref{fig:MergeAODFiles} show the distribution of number
of input files for AlcaReco and AOD merges. The AlcaReco plot shows nicely that many merge
jobs are triggered at the threshold of 32 input files.

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0MergeAlcaRecoFiles}
   \end{center}
   \caption{Number of input files for all CMSSW 1\_0\_6 AlcaReco merges at Tier-0.}
   \label{fig:MergeAlcaRecoFiles}
\end{figure}

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0MergeAODFiles}
   \end{center}
   \caption{Number of input files for all CMSSW 1\_0\_6 AOD merges at Tier-0.}
   \label{fig:MergeAODFiles}
\end{figure}

Figs.~\ref{fig:MergeAlcaRecoSize} and \ref{fig:MergeAODSize} show the distribution of output file
size for AlcaReco and AOD merges. One can see that most AOD merges are triggered at the 3.9GB file
size threshold. AlcaReco merges are more diverse, some are quite large (almost reach the 3.9GB file
size threshold), while many have quite small output file sizes.

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0MergeAlcaRecoSize}
   \end{center}
   \caption{Output file size for all CMSSW 1\_0\_6 AlcaReco merges at Tier-0.}
   \label{fig:MergeAlcaRecoSize}
\end{figure}

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0MergeAODSize}
   \end{center}
   \caption{Output file size for all CMSSW 1\_0\_6 AOD merges at Tier-0.}
   \label{fig:MergeAODSize}
\end{figure}

And lastly, Figs.~\ref{fig:MergeAlcaRecoEvents} and \ref{fig:MergeAODEvents} show the distribution
of number of events for AlcaReco and AOD merges.

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0MergeAlcaRecoEvents}
   \end{center}
   \caption{Number of events for all CMSSW 1\_0\_6 AlcaReco merges at Tier-0.}
   \label{fig:MergeAlcaRecoEvents}
\end{figure}

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0MergeAODEvents}
   \end{center}
   \caption{Number of events for all CMSSW 1\_0\_6 AOD merges at Tier-0.}
   \label{fig:MergeAODEvents}
\end{figure}


\subsection{Data Registration}

Figs.~\ref{fig:LatencyRecoReadyDrop} and \ref{fig:LatencyAverageRecoReadyDrop} show the latency between the RecoReady
notification (i.e. Prompt Reconstruction job finished) and the completion of the PhEDEx drop for the RECO file. Only
RECO jobs run with CMSSW 1\_0\_6 software were considered. In Figure~\ref{fig:LatencyRecoReadyDrop} there is a long
tail up to about 2000 seconds, but it is a flat tail without structure or spikes. Figure~\ref{fig:LatencyAverageRecoReadyDrop}
shows the average latency for the day in October including the statistical errors.

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0LatencyRecoReadyDrop}
   \end{center}
   \caption{Latency between RecoReady and PhEDEx drop for all CMSSW 1\_0\_6 jobs at Tier-0.}
   \label{fig:LatencyRecoReadyDrop}
\end{figure}

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0LatencyAverageRecoReadyDrop}
   \end{center}
   \caption{Average latency by day between RecoReady and PhEDEx drop for all CMSSW 1\_0\_6 jobs at Tier-0.}
   \label{fig:LatencyAverageRecoReadyDrop}
\end{figure}

Figs.~\ref{fig:LatencyRegisterMergedDrop} and \ref{fig:LatencyAverageRegisterMergedDrop} show the latency between the
RegisterMerged notification (i.e. completion of a merge job) and the completion of the PhEDEx drop for the merged file.
Only merge jobs run with CMSSW 1\_0\_6 software and with input files produced with CMSSW 1\_0\_6 software were considered.
In Figure~\ref{fig:LatencyRegisterMergedDrop} there is a long tail up to almost 3000 seconds, but it is a flat tail
without structure or spikes. Figure~\ref{fig:LatencyAverageRegisterMergedDrop} shows the average latency for the day
in October including the statistical errors. There is no value for day
24 shown on this plot since the number of merges is offscale with a
large error: 206 with an error of 1200. These plots
contains a mix of AOD and AlcaReco merges with various different merge scenarios (number of files, size etc.). None
of these parameters should effect the DBS registration except maybe the number of input files because of the registration
of parentage in DBS. If there were large differences depending on number of input files, one would expect that to show
up as a big statistical error on the average. For the days that show
low latencies, the effect is not seen.

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0LatencyRegisterMergedDrop}
   \end{center}
   \caption{Latency between RegisterMerged and PhEDEx drop for all CMSSW 1\_0\_6 jobs at Tier-0.}
   \label{fig:LatencyRegisterMergedDrop}
\end{figure}

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0LatencyAverageRegisterMergedDrop}
   \end{center}
   \caption{Average latency by day between RegisterMerged and PhEDEx drop for all CMSSW 1\_0\_6 jobs at Tier-0.}
   \label{fig:LatencyAverageRegisterMergedDrop}
\end{figure}


\subsection{PhEDEx Data Injection}

Each PhEDEx drop prepared by the Tier-0 workflow needs to be parsed
and its meta data informations made available to PhEDEx. This task is
performed by a dedicated PhEDEx process, which analyzes the drop,
extracts all the informations and feeds them to the
Transfer Management Data Base (TMDB). This
whole process is called ``data injection''.

Since this process takes a non-negligible amount of time, the latency caused by
this workflow step was
analyzed. Fig.~\ref{fig:LatencyPhEDExDataInjection} shows the average
injection time per day and the corresponding statistical error for this
measurement. Up to day 25 only one injection agent was used, while from
day 26 onwards, 5 parallel injectors were started in order to improve
performance.
\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/PhEDEx_injection_perf.pdf}
   \end{center}
   \caption{Latency between reception of PhEDEx drop and finalization of data registration in PhEDEx.
            Up to day 25 only one injection process was used, while from day 26 onward 5 parallel injectors were running.}
   \label{fig:LatencyPhEDExDataInjection}
\end{figure}
During the first few days of CSA06 the average time per injected file
was at a constant level of about 3~seconds with negligible statistical
errors. However, up to day 25 a slow increase of the average injection
time is visible. This effect is most likely correlated to the growth
of data registered in TMDB, which slows down the injection
performance. Starting with day 26, the latency was reduced to what was
observed during the first days, since from that date on five injection
processes were started in parallel until the end of CSA06.

During the whole period of CSA06 a PhEDEx drop contained only
informations for one file. The overhead of reading an XML drop per
file could be reduced by grouping multiple files of the same block in
one drop. A brief test was conducted using about 50 files of the same
block per drop and a speed up of the injection operation to about 0.5s
per file was observed, which corresponds to a performance boost of a factor
six. It is recommended to implement such a feature in the
drop creation process of the final system in order to further optimize
the workflow performance.


\subsection{Performance}

The event processing rate during CSA06 is shown in
Figs.~\ref{fig:ProcessRate} and ~\ref{fig:ProcessRatePeak}, where the
latter plot shows the peak of the processing during the last day. The
cumulative volume of produced events is shown in
Fig.~\ref{fig:ProcessVolume}, and totals 207M events at the end of the challenge.

The processing rate in events/sec is not in fact a particularly meaningful
metric for the Tier-0 in CSA06. The event-mix was varied considerably to
accommodate external requirements, which gives a wide variety of mean
reconstruction times throughout the challenge. The set of reconstruction
algorithms was explicitly pruned to achieve the stability needed from
CMSSW, and the algorithms thereby excluded tended to be the most complex,
time-consuming algorithms. Finally, pile-up was not included in the 
simulated events. This makes the tracking, in particular, rather fast 
compared to realistic events.

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0ProcessRate}
   \end{center}
   \caption{The processing rate at Tier-0. The target rate of 40 Hz is
   illustrated. 
   \label{fig:ProcessRate}
}
\end{figure}

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0ProcessRatePeak}
   \end{center}
   \caption{The peak processing rate at Tier-0 in the last day of
   Tier-0 operations.
   \label{fig:ProcessRatePeak}
}
\end{figure}

\begin{figure}[htp]
   \begin{center}
      \includegraphics[width=0.7\linewidth]{figs/T0ProcessVolume}
   \end{center}
   \caption{The total number of produced events at Tier-0.
   \label{fig:ProcessVolume}
}
\end{figure}

Revision:	1.19
Committed:	Mon Mar 12 13:58:03 2007 UTC (18 years, 1 month ago) by acosta
Content type:	application/x-tex
Branch:	MAIN
CVS Tags:	HEAD
Changes since 1.18:	+4 -4 lines
Log Message:	suggestions from management