ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/cvsroot/COMP/CSA06DOC/tier0ops.tex
Revision: 1.19
Committed: Mon Mar 12 13:58:03 2007 UTC (18 years, 1 month ago) by acosta
Content type: application/x-tex
Branch: MAIN
CVS Tags: HEAD
Changes since 1.18: +4 -4 lines
Log Message:
suggestions from management

File Contents

# Content
1 \section{The Tier-0}
2
3 The Tier-0 had a pivotal role in CSA06. If it failed, nothing else could
4 succeed, for lack of data. Robust error-handling and recovery file-by-file
5 were therefore less important than continuous smooth operation of the
6 total system. Failure of any given processing step resulted in the output
7 being discarded, the rest of the system supplying enough data to ensure a
8 steady flow. Unreliable reconstruction algorithms were therefore excluded,
9 after initial testing, to ensure that the reconstruction software used in
10 the Tier-0 was performant and robust.
11
12 The fundamental properties of the dataflow and workflow had already been
13 verified at scales beyond the physics-startup using the 'July prototype',
14 an emulation which allows exploration of the behavioural phase-space
15 without relying on real events or CMSSW software. The basic design was
16 thus known to be sound and workable, and the Tier-0 for CSA06 therefore
17 concentrated on exploring the operational aspects.
18
19 The Tier-0 workflow for CSA06 encompassed most of the complexity required
20 for first-physics: Prompt Reconstruction, creation of AOD and AlcaReco
21 streams, merging of small output files to larger files for archive and
22 export, insertion of files in DBS, injection into PhEDEx, and retrieval of
23 calibration and alignment constants from Frontier. Only the communication
24 with the Storage Manager and the repacking step were missing, since
25 the input Monte Carlo data was not delivered in a format that was
26 appropriate for such an exercise.
27
28 The Tier-0 ran for 4 weeks with 100\% uptime, with no intrinsic scaling or
29 behavioural problems. Operator interventions were required to introduce
30 new CMSSW versions, modify the workflow as more features became available
31 in CMSSW, adjust the rate of the different physics channels, and deal with
32 minor problems caused by trivial bugs in the Tier-0 framework. The Tier-0
33 achieved all of its target metrics for CSA06.
34
35 \subsection{System Architecture}
36
37 The Tier-0 uses a Perl-based message-passing framework. Components in the
38 workflow send messages to a central dispatcher/logger, these messages are
39 then forwarded to other components. The forwarding is based on
40 subscriptions, components declare what they are interested in receiving.
41 This gives a modular, pluggable system; the workflow is determined by the
42 interaction of the components only through their message contents,
43 components are not directly coupled. The workflow can be changed on the
44 fly to accommodate new requirements that may appear, and components can be
45 stopped and restarted with no impact on the overall system behaviour.
46
47 The component structure used for CSA06 was to have one Manager component
48 and zero or more Worker components for each step of the workflow (Prompt
49 Reconstruction, AOD, AlcaReco, Fast Merge, DBS Update, and Exporter). The
50 managers subscribe to messages that indicate input files are ready for
51 them, then build queues of payloads for the workers. Workers ask the
52 managers for work when they become idle. All components can report
53 information to MonaLisa, and through it to the dashboard. Workers report
54 per-task statistics (e.g. reporting every 50 events), the managers report
55 per-step aggregates (e.g. reporting the total number of active Prompt
56 Reconstruction processes).
57
58 The entire system is lightweight, flexible, robust, and has very low
59 latency. This makes it very well suited to the operational environment
60 expected for the Tier-0 during real data-taking.
61
62 The processing rate of the Tier-0 was determined by the rate at which
63 files were injected into the Prompt Reconstruction. For a given input
64 dataset, files were injected at intervals corresponding to a given rate in
65 MB/sec. This rate was adjusted to correspond to the desired event rate in
66 Hz, using the average event size per dataset.
67
68 \subsection{Prompt Reconstruction}
69
70 Prompt Reconstruction began at noon on October 2nd with CMSSW 1\_0\_2.
71 Only minbias data was used in the first few days, following to the
72 planning set out in the CSA06 wiki page. The EWKSoup sample (used as a pseudo
73 express-line stream) was added on October 5th, in order to increase
74 data-transfer rates. Job success-rates were over 99.7\% in all channels.
75
76 Successive versions of CMSSW were used over the following weeks, as the
77 code matured for reconstruction of the signal channels and for the other
78 activities needed at the Tier-0 (AOD and AlcaReco production). At first,
79 new versions were tested standalone, in parallel with the main running of
80 the Tier-0. Later versions were deployed live in the Tier-0, without
81 separate testing, but at a low enough level that any failures would not
82 harm the smooth running of the total system. Once they were seen to be
83 stable the event-rate with the new version was increased, and the older
84 version was gracefully retired.
85
86 As each new version of CMSSW was deployed, reconstruction was restarted
87 from the beginning of the input data. If a version of CMSSW was retired
88 before the input data was completely processed, that input channel was
89 left incomplete for that version of the software. With CMSSW\_1\_0\_6, all
90 input channels were run to completion, to provide a complete and coherent
91 dataset for all subsequent activities. Essentially, the Tier-0 part of
92 CSA06 was repeated from scratch in the last week, with CMSSW\_1\_0\_6.
93
94 CMSSW\_1\_0\_3 was deployed for the second and third weeks of running,
95 being stable enough for the reconstruction of the signal channels.
96 CMSSW\_1\_0\_5 was used from the 19th to the 24th of October, and included
97 the first AlcaReco streams, from minbias data. CMSSW\_1\_0\_6 was used
98 from the 22nd to the 30th, when the Tier-0 participation in CSA06 ended.
99 This final version had all the AlcaReco streams, the AODs, and
100 Frontier-access to conditions data.
101
102 The output of the Prompt Reconstruction contained the original input event
103 as well as the reconstructed data, because of limitations in the CMSSW
104 framework. This made the RECO output larger than the original input, so
105 merging of RECO files was not useful. Prompt Reconstruction was therefore
106 a one-file-in/one-file-out process. Event sizes and reconstruction times
107 for CMSSW\_1\_0\_6 are shown in Table~\ref{tab:PR106}.
108
109 \begin{table}[htb]
110 \centering
111 \caption{Prompt Reconstruction with CMSSW\_1\_0\_6}
112 \label{tab:PR106}
113 \begin{tabular}{|l|c|c|c|}
114 \hline
115 Channel & Reconstruction & Input Event & Output Event \\
116 & Time (CPU sec) & Size (MB) & Size (MB) \\
117 \hline
118 EWKSoup & 6.7 & 1.1 & 1.7 \\
119 ExoticSoup & 18.5 & 1.8 & 2.8 \\
120 HLTElectron & 8.6 & -- & 1.8 \\
121 HLTGamma & 37.4 & -- & 3.5 \\
122 HLTJet & 42.0 & -- & 3.7 \\
123 HLTMuon & 8.4 & -- & 1.8 \\
124 Jets & 22.8 & 1.6 & 2.6 \\
125 minbias & 2.9 & 0.5 & 0.8 \\
126 SoftMuon & 8.0 & 1.2 & 1.9 \\
127 TTbar & 19.3 & 2.0 & 3.4 \\
128 Wenu & 8.0 & 1.2 & 1.8 \\
129 ZMuMu & 8.4 & 1.2 & 2.0 \\
130 \hline
131 \end{tabular}
132 \end{table}
133
134 \subsection{AlcaReco Production}
135
136 AlcaReco streams are produced according to the map shown in
137 %Table~\ref{tab:ARStreams},
138 Table~\ref{tab:alcareco} in Section~\ref{sec:offlineswalca},
139 relating input datasets with output AlcaReco
140 streams. AlcaReco streams were first produced using CMSSW\_1\_0\_3 running
141 on minbias RECO data produced with CMSSW\_1\_0\_2, for just the
142 AlcastreamElectron stream, but not until
143 CMSSW\_1\_0\_6 were all input/output stream combinations available and
144 useful. As with Prompt Reconstruction, all channels were run to completion
145 with CMSSW\_1\_0\_6.
146
147 %
148 %\begin{table}[htb]
149 %\centering
150 %\caption{AlcaReco input/output stream map}
151 %\label{tab:ARStreams}
152 %\begin{tabular}{|l|c|c|c|c|l|}
153 %\hline
154 %Input Dataset & ZMuMu & minbias & Jets & Wenu & \\
155 %\hline
156 %AlcaReco stream & & & & & Purpose\\
157 %\hline
158 %CSA06ZMuMu & X & - & - & - & Tracker Alignment \\
159 %CSA06MinBias & - & X & - & - & Tracker Alignment \\
160 %\hline
161 %AlcastreamElectron & - & - & - & X & ECAL Calibration \\
162 %AlcastreamEcalPhiSym & - & X & - & - & ECAL Calibration \\
163 %\hline
164 %AlcastreamHcalDijets & - & - & X & - & HCAL Calibration \\
165 %AlcastreamHcalIsotrk & - & X & X & - & HCAL Calibration \\
166 %AlcastreamHcalMinbias & - & X & - & - & HCAL Calibration \\
167 %\hline
168 %CSA06ZMuMu\_muon & X & - & - & - & Muon Alignment \\
169 %\hline
170 %\end{tabular}
171 %\end{table}
172
173 The AlcaReco files are mostly small, by definition, so merging was
174 required before they could be written to tape. Simply writing and reading
175 the files from tape would have been effectively impossible without this
176 step, let alone the problems of analysing so many files once they were on
177 disk.
178
179 AlcaReco production was essentially error-free, with no problems from the
180 CMSSW application for any channels.
181
182 \subsection{AOD Production}
183
184 AOD production was run only with CMSSW\_1\_0\_6, there being no suitable
185 config file beforehand. As with AlcaReco, all channels were run to
186 completion, and the output merged for efficient tape and analysis access.
187
188 AOD production was also trouble-free, with negligible failure rate.
189
190 \subsection{Fast Merge}
191
192 Merging is an integral part of the Tier-0. To minimize load on the storage system
193 and our own data management, we have certain requirements for the minimum size
194 of files we keep. But for various reasons (mostly due to optimizing workflows)
195 we do have jobs that write output files which are significantly smaller then
196 our requirements. In CSA06 AOD and especially AlcaReco output files were smaller
197 then what is optimal. Due to the RAW+RECO output of the Prompt Reconstruction
198 in CSA06, the Prompt Reconstruction output files were actually bigger then the
199 RAW input files and didn't need to be merged.
200
201 The Merge component consisted of two parts, a manager and a number of workers.
202 The manager subscribed to notifications of AlcaReco and AOD job completions
203 and queued them internally. There were multiple queues separated by DataType
204 (AlcaReco or AOD), dataset, CMSSW version and the configuration used
205 (Pset hash) in the AOD or AlcaReco job. Once a new entry was added to a
206 queue, the content of that queue was checked against three thresholds.
207
208 \begin{itemize}
209 \item Number of input files (FileThreshold)
210 \item Number of events in input files (EventThreshold)
211 \item Combined size of input files (SizeThreshold)
212 \end{itemize}
213
214 The FileThreshold was set to 32. This limit was introduced because of the next step
215 in the workflow, the registration of the merged file into DBS. The registration was
216 done with a shell command and all the parent files (ie. the input files of all the
217 AlcaReco or AOD jobs whose output was being merged) were passed as command line
218 arguments. Because we were concerned about shell command length limits, we restricted
219 the number of input files to 32. In standalone longterm testing (without DBS
220 registration) we already successfully explored scenarios with up to 150 input files.
221
222 The EventThreshold was set to 100000 since this is a useful number of events
223 for AlcaReco studies.
224
225 The SizeThreshold was set to 3.9GB to prevent overly large files.
226
227 The actual merge operation on the workers was performed by the EdmFastMerge application.
228 The input files were not directly accessed through Castor, instead we staged in all
229 the files to the local disk with rfcp, ran the merge locally with local input and
230 output and then staged out the output file back to Castor. For our operational
231 requirements (merging of many small files) this was shown in earlier tests to
232 be much faster then merging directly from Castor.
233
234 Since the CMSSW 1\_0\_6 processing cycle went through all input data within the last
235 week of CSA06, only performance numbers for that period are quoted here. The performance
236 numbers for merges run with CMSSW 1\_0\_5 are similar. Only for the CMSSW 1\_0\_3
237 cycle of AlcaReco merges did we see a significant amount of errors. About a third
238 of these merge jobs failed with Castor stage in errors.
239
240 In the CMSSW 1\_0\_6 processing cycle there were a total of 5263 merge jobs submitted,
241 2436 for AOD and 2827 for AlcaReco. 33 jobs failed due to Castor stage in errors.
242 Out of these 33 failed jobs, 31 were successfully rerun. The stage in error that was
243 observed (which is the same that caused much more havoc in the CMSSW 1\_0\_3 cycle)
244 is a bug in Castor that can be triggered under certain conditions. The Castor team is
245 aware of the problem.
246 %Hopefully they will provide a fix soon.
247
248 The remaining 2 jobs were rerun twice but failed both times. Further analysis showed
249 the merge input files to be corrupted, ie. even an interactive rfcp of the files would
250 always fail. This kind of problem is surprising since the job that created these
251 corrupted files checked the rfcp exit code when it staged them out to
252 Castor.
253
254 No failures during the merge itself or for the stage out were observed.
255
256 Figs.~\ref{fig:MergeAlcaRecoFiles} and \ref{fig:MergeAODFiles} show the distribution of number
257 of input files for AlcaReco and AOD merges. The AlcaReco plot shows nicely that many merge
258 jobs are triggered at the threshold of 32 input files.
259
260 \begin{figure}[htp]
261 \begin{center}
262 \includegraphics[width=0.7\linewidth]{figs/T0MergeAlcaRecoFiles}
263 \end{center}
264 \caption{Number of input files for all CMSSW 1\_0\_6 AlcaReco merges at Tier-0.}
265 \label{fig:MergeAlcaRecoFiles}
266 \end{figure}
267
268 \begin{figure}[htp]
269 \begin{center}
270 \includegraphics[width=0.7\linewidth]{figs/T0MergeAODFiles}
271 \end{center}
272 \caption{Number of input files for all CMSSW 1\_0\_6 AOD merges at Tier-0.}
273 \label{fig:MergeAODFiles}
274 \end{figure}
275
276 Figs.~\ref{fig:MergeAlcaRecoSize} and \ref{fig:MergeAODSize} show the distribution of output file
277 size for AlcaReco and AOD merges. One can see that most AOD merges are triggered at the 3.9GB file
278 size threshold. AlcaReco merges are more diverse, some are quite large (almost reach the 3.9GB file
279 size threshold), while many have quite small output file sizes.
280
281 \begin{figure}[htp]
282 \begin{center}
283 \includegraphics[width=0.7\linewidth]{figs/T0MergeAlcaRecoSize}
284 \end{center}
285 \caption{Output file size for all CMSSW 1\_0\_6 AlcaReco merges at Tier-0.}
286 \label{fig:MergeAlcaRecoSize}
287 \end{figure}
288
289 \begin{figure}[htp]
290 \begin{center}
291 \includegraphics[width=0.7\linewidth]{figs/T0MergeAODSize}
292 \end{center}
293 \caption{Output file size for all CMSSW 1\_0\_6 AOD merges at Tier-0.}
294 \label{fig:MergeAODSize}
295 \end{figure}
296
297 And lastly, Figs.~\ref{fig:MergeAlcaRecoEvents} and \ref{fig:MergeAODEvents} show the distribution
298 of number of events for AlcaReco and AOD merges.
299
300 \begin{figure}[htp]
301 \begin{center}
302 \includegraphics[width=0.7\linewidth]{figs/T0MergeAlcaRecoEvents}
303 \end{center}
304 \caption{Number of events for all CMSSW 1\_0\_6 AlcaReco merges at Tier-0.}
305 \label{fig:MergeAlcaRecoEvents}
306 \end{figure}
307
308 \begin{figure}[htp]
309 \begin{center}
310 \includegraphics[width=0.7\linewidth]{figs/T0MergeAODEvents}
311 \end{center}
312 \caption{Number of events for all CMSSW 1\_0\_6 AOD merges at Tier-0.}
313 \label{fig:MergeAODEvents}
314 \end{figure}
315
316
317 \subsection{Data Registration}
318
319 Figs.~\ref{fig:LatencyRecoReadyDrop} and \ref{fig:LatencyAverageRecoReadyDrop} show the latency between the RecoReady
320 notification (i.e. Prompt Reconstruction job finished) and the completion of the PhEDEx drop for the RECO file. Only
321 RECO jobs run with CMSSW 1\_0\_6 software were considered. In Figure~\ref{fig:LatencyRecoReadyDrop} there is a long
322 tail up to about 2000 seconds, but it is a flat tail without structure or spikes. Figure~\ref{fig:LatencyAverageRecoReadyDrop}
323 shows the average latency for the day in October including the statistical errors.
324
325 \begin{figure}[htp]
326 \begin{center}
327 \includegraphics[width=0.7\linewidth]{figs/T0LatencyRecoReadyDrop}
328 \end{center}
329 \caption{Latency between RecoReady and PhEDEx drop for all CMSSW 1\_0\_6 jobs at Tier-0.}
330 \label{fig:LatencyRecoReadyDrop}
331 \end{figure}
332
333 \begin{figure}[htp]
334 \begin{center}
335 \includegraphics[width=0.7\linewidth]{figs/T0LatencyAverageRecoReadyDrop}
336 \end{center}
337 \caption{Average latency by day between RecoReady and PhEDEx drop for all CMSSW 1\_0\_6 jobs at Tier-0.}
338 \label{fig:LatencyAverageRecoReadyDrop}
339 \end{figure}
340
341 Figs.~\ref{fig:LatencyRegisterMergedDrop} and \ref{fig:LatencyAverageRegisterMergedDrop} show the latency between the
342 RegisterMerged notification (i.e. completion of a merge job) and the completion of the PhEDEx drop for the merged file.
343 Only merge jobs run with CMSSW 1\_0\_6 software and with input files produced with CMSSW 1\_0\_6 software were considered.
344 In Figure~\ref{fig:LatencyRegisterMergedDrop} there is a long tail up to almost 3000 seconds, but it is a flat tail
345 without structure or spikes. Figure~\ref{fig:LatencyAverageRegisterMergedDrop} shows the average latency for the day
346 in October including the statistical errors. There is no value for day
347 24 shown on this plot since the number of merges is offscale with a
348 large error: 206 with an error of 1200. These plots
349 contains a mix of AOD and AlcaReco merges with various different merge scenarios (number of files, size etc.). None
350 of these parameters should effect the DBS registration except maybe the number of input files because of the registration
351 of parentage in DBS. If there were large differences depending on number of input files, one would expect that to show
352 up as a big statistical error on the average. For the days that show
353 low latencies, the effect is not seen.
354
355 \begin{figure}[htp]
356 \begin{center}
357 \includegraphics[width=0.7\linewidth]{figs/T0LatencyRegisterMergedDrop}
358 \end{center}
359 \caption{Latency between RegisterMerged and PhEDEx drop for all CMSSW 1\_0\_6 jobs at Tier-0.}
360 \label{fig:LatencyRegisterMergedDrop}
361 \end{figure}
362
363 \begin{figure}[htp]
364 \begin{center}
365 \includegraphics[width=0.7\linewidth]{figs/T0LatencyAverageRegisterMergedDrop}
366 \end{center}
367 \caption{Average latency by day between RegisterMerged and PhEDEx drop for all CMSSW 1\_0\_6 jobs at Tier-0.}
368 \label{fig:LatencyAverageRegisterMergedDrop}
369 \end{figure}
370
371
372 \subsection{PhEDEx Data Injection}
373
374 Each PhEDEx drop prepared by the Tier-0 workflow needs to be parsed
375 and its meta data informations made available to PhEDEx. This task is
376 performed by a dedicated PhEDEx process, which analyzes the drop,
377 extracts all the informations and feeds them to the
378 Transfer Management Data Base (TMDB). This
379 whole process is called ``data injection''.
380
381 Since this process takes a non-negligible amount of time, the latency caused by
382 this workflow step was
383 analyzed. Fig.~\ref{fig:LatencyPhEDExDataInjection} shows the average
384 injection time per day and the corresponding statistical error for this
385 measurement. Up to day 25 only one injection agent was used, while from
386 day 26 onwards, 5 parallel injectors were started in order to improve
387 performance.
388 \begin{figure}[htp]
389 \begin{center}
390 \includegraphics[width=0.7\linewidth]{figs/PhEDEx_injection_perf.pdf}
391 \end{center}
392 \caption{Latency between reception of PhEDEx drop and finalization of data registration in PhEDEx.
393 Up to day 25 only one injection process was used, while from day 26 onward 5 parallel injectors were running.}
394 \label{fig:LatencyPhEDExDataInjection}
395 \end{figure}
396 During the first few days of CSA06 the average time per injected file
397 was at a constant level of about 3~seconds with negligible statistical
398 errors. However, up to day 25 a slow increase of the average injection
399 time is visible. This effect is most likely correlated to the growth
400 of data registered in TMDB, which slows down the injection
401 performance. Starting with day 26, the latency was reduced to what was
402 observed during the first days, since from that date on five injection
403 processes were started in parallel until the end of CSA06.
404
405 During the whole period of CSA06 a PhEDEx drop contained only
406 informations for one file. The overhead of reading an XML drop per
407 file could be reduced by grouping multiple files of the same block in
408 one drop. A brief test was conducted using about 50 files of the same
409 block per drop and a speed up of the injection operation to about 0.5s
410 per file was observed, which corresponds to a performance boost of a factor
411 six. It is recommended to implement such a feature in the
412 drop creation process of the final system in order to further optimize
413 the workflow performance.
414
415
416 \subsection{Performance}
417
418 The event processing rate during CSA06 is shown in
419 Figs.~\ref{fig:ProcessRate} and ~\ref{fig:ProcessRatePeak}, where the
420 latter plot shows the peak of the processing during the last day. The
421 cumulative volume of produced events is shown in
422 Fig.~\ref{fig:ProcessVolume}, and totals 207M events at the end of the challenge.
423
424 The processing rate in events/sec is not in fact a particularly meaningful
425 metric for the Tier-0 in CSA06. The event-mix was varied considerably to
426 accommodate external requirements, which gives a wide variety of mean
427 reconstruction times throughout the challenge. The set of reconstruction
428 algorithms was explicitly pruned to achieve the stability needed from
429 CMSSW, and the algorithms thereby excluded tended to be the most complex,
430 time-consuming algorithms. Finally, pile-up was not included in the
431 simulated events. This makes the tracking, in particular, rather fast
432 compared to realistic events.
433
434 \begin{figure}[htp]
435 \begin{center}
436 \includegraphics[width=0.7\linewidth]{figs/T0ProcessRate}
437 \end{center}
438 \caption{The processing rate at Tier-0. The target rate of 40 Hz is
439 illustrated.
440 \label{fig:ProcessRate}
441 }
442 \end{figure}
443
444 \begin{figure}[htp]
445 \begin{center}
446 \includegraphics[width=0.7\linewidth]{figs/T0ProcessRatePeak}
447 \end{center}
448 \caption{The peak processing rate at Tier-0 in the last day of
449 Tier-0 operations.
450 \label{fig:ProcessRatePeak}
451 }
452 \end{figure}
453
454 \begin{figure}[htp]
455 \begin{center}
456 \includegraphics[width=0.7\linewidth]{figs/T0ProcessVolume}
457 \end{center}
458 \caption{The total number of produced events at Tier-0.
459 \label{fig:ProcessVolume}
460 }
461 \end{figure}
462