ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/cvsroot/COMP/CSA06DOC/tier12ops.tex
Revision: 1.13
Committed: Thu Nov 30 10:50:42 2006 UTC (18 years, 5 months ago) by malgeri
Content type: application/x-tex
Branch: MAIN
CVS Tags: pdflatex_v4, pdflatex_v3, pdflatex_v2, pdflatex
Changes since 1.12: +14 -14 lines
Log Message:
added calibration part and updated for pdflatex

File Contents

# Content
1 \section{Tier-1 and Tier-2 Operations}
2
3 \subsection{Data Transfers}
4
5 The Tier-1 centers were expected to receive data from CERN at a rate
6 proportional to the 25\% of the 2008 pledge rate and serve the data to
7 Tier-2 centers. The expected rate into the Tier-1 centers is shown in
8 Table~\ref{tab:tier01pledge}.
9
10 \begin{table}[htb]
11 \begin{tabular}{|l|l|l|}
12 \hline
13 Site & Goal Rates (MB/s) & Threshold Rates (MB/s) \\
14 \hline
15 ASGC & 15 & 7.5 \\
16 CNAF & 25 & 12.5 \\
17 FNAL & 50 & 25 \\
18 GridKa & 25 & 12.5 \\
19 IN2P3 & 25 & 12.5 \\
20 PIC & 10 & 5 \\
21 RAL & 10 & 5 \\
22 \hline
23 \end{tabular}
24 \caption{Expect transfer rates from CERN to Tier-1 centers based on the MOU pledges.}
25 \label{tab:tier01pledge}
26 \end{table}
27
28 The Tier-2 centers are expected in the computing model to transfer
29 from the Tier-2 centers in bursts. The goal rate in CSA06 was 20MB/s,
30 with a threshold for success of 5MB/s. Achieving these metrics in the
31 computing model was defined as hitting the transfer rate for a 24 hour
32 period. At the beginning of CSA06 CMS concentrated primarily on
33 moving data from the ``associated'' Tier-1 centers to the Tier-2s. By
34 the end of the challenge most of the Tier-1 to Tier-2 permutations had
35 been attempted.
36
37 The total data transferred between sites in CSA06 is shown in
38 Figure~\ref{fig:totaltran}. This plot only includes wide area data
39 transfers, additionally data was moved onto tape at the majority of
40 Tier-1 centers. Over the 45 days of the challenge CMS was able to
41 move more than 1 petabyte of data over the wide area.
42
43 \begin{figure}[htp]
44 \begin{center}
45 \includegraphics[width=0.7\linewidth]{figs/CSA06_CumTran}
46 \caption{The cummulative data volume transferred during CSA06 in TB.}
47 \end{center}
48 \label{fig:totaltran}
49 \end{figure}
50
51 Timeline:
52 \begin{itemize}
53
54 \item October 2, 2006: The Tier-0 to Tier-1 transfers began on the
55 first day of the challenge. In the first few hours 6 of 7 Tier-1
56 centers successfully received data. During the first week only
57 minimum bias was reconstructed and at 40Hz the total rate out of the
58 CERN site does not meet the 150MB/s target rate.
59
60 \item October 3, 2006: All 7 Tier-1 sites were able to successfully
61 received data and 8 Tier-2 centers were subscribed to data samples:
62 Belgium IIHE, UC San Diego, Wisconsin, Nebraska, DESY, Aachen, and
63 Estonia. There were successful transfers to 6 Tier-2 sites.
64
65 \item October 4, 2006: An additional 11 Tier-2 sites were subscribed
66 to data samples: Pisa, Purdue, CIEMAT, Caltech, Florida, Rome, Bari,
67 CSCS, IHEP, Belgium UCL, and Imperial College. Of the 19 registered
68 Tier-2 sites, 12 were able to receive data. Of those, 5 exceeded the
69 goal transfer rates for over an hour, and an additional 3 were over
70 the threshold rate.
71
72 \item October 5, 2006: Three additional Tier-2s were added increasing
73 the number of participating sites above the goal rate of 20 Tier-2
74 centers. New hardware installed at IN2P3 for CSA06 began to exhibit
75 stability problems leading to poor transfer efficieny.
76
77 \item October 9, 2006: RAL transitioned from a dCache SE to a CASTOR2
78 SE. The signal samples began being reconstructed at the Tier-0.
79
80 \item October 10-12, 2006: The Tier-1 sites had stable operations
81 through the week at an aggregate rate of approximately 100MB/s from
82 CERN.
83
84 \item October 13, 2006: Multiple subscriptions of the minimum bias
85 samples were made to some of the Tier-1 centers to increase the total
86 rate of data transfer from CERN. The number of participating Tier-2
87 sites increased to 23.
88
89 \item October 18, 2006: The PhEDEx transfer system held a lock in the
90 Oracle database which blocked other agents from continuing with
91 transfers. This problem appeared more frequently in the latter half
92 of the challenge when the load was higher.
93
94 \item October 20, 2006: The reconstruction rate was increased at the
95 Tier-0 to improve the output from CERN and to better exercise the
96 prompt reconstruction farm. The data rate from CERN approximately
97 doubles. An average rate over an hour of 600MB/s from CERN was
98 achieved.
99
100 \item October 25, 2006: The transfer rate from CERN was large with
101 daily average rates of 250MB/s-300MB/s. The first obsertvation of
102 transfer backlogs begin to appear.
103
104 \item October 30, 2006: Data reconstruction at the Tier-0 stopped.
105
106 \item October 31, 2006: PIC and ASGC finished transferring the assigned prompt reconstruction data from CERN.
107
108 \item November 2, 2006: FNAL and IN2P3 also completed the transfers.
109
110 \item November 3, 2006: RAL completed the transfers. The first of the
111 Tier-1 to any Tier-2 transfer validation began. The test involved
112 sending a small sample from a Tier-1 site to a validated Tier-2, in
113 the test case DESY, and then sending a small sample to all Tier-2
114 sites.
115
116 \item November 5, 2006: CNAF completed the Tier-0 transfers
117
118 \item November 6, 2006: The Tier-1 to Tier-2 transfer testing continued.
119
120 \item November 9, 2006: GridKa completed the Tier-0 transfers
121
122 \end{itemize}
123
124
125 \subsubsection{Transfers to Tier-1 Centers}
126
127 During CSA06 the Tier-1 centers met the transfer rate goals. In the
128 first week of the challenge using minimum bias events the total volume
129 of data out of CERN did not amount to 150MB/s unless the datasets were
130 subscribed to multiple sites. After the reconstruction rate was
131 increased at the Tier-0 the transfer rate easily exceeded the 150MB/s
132 target. The 30 day and 15 day averages are shown in
133 Table~\ref{tab:tier01csa06}. For the thirty day average all sites
134 except one exceed the goal rate and for the final 15 days all sites
135 easily exceed the goal. Several sites doubled and tripled the goal
136 rate during the final two weeks of high volume transfers.
137
138 The WLCG metric for availability this year is 90\% for the Tier-1
139 sites. If we apply this to the Tier-1 participating in CSA06
140 transfers we have 6 of 7 Tier-1s reaching the availablility goal.
141
142 \begin{table}[htb]
143 \begin{tabular}{|l|r|r|r|r|c|}
144 \hline
145 Site & Anticipated Rate (MB/s) & last 30 day average & last 15 day average & Outage (Days) & MSS used \\
146 \hline
147 ASGC & 15MB/s & 17MB/s & 23MB/s & 0 & (Yes) \\
148 CNAF & 25MB/s & 26MB/s & 37MB/s & 0 & (Yes) \\
149 FNAL & 50MB/s & 68MB/s & 98MB/s & 0 & Yes \\
150 GridKa & 25MB/s & 23MB/s & 28MB/s & 3 & No \\
151 IN2P3 & 25MB/s & 23MB/s & 34MB/s & 1 & Yes \\
152 PIC & 10MB/s & 22MB/s & 33MB/s & 0 & No \\
153 RAL & 10MB/s & 23MB/s & 33MB/s & 2 & Yes \\
154 \hline
155 \end{tabular}
156 \caption{Transfer rates during CSA06 between CERN and Tier-1 centers and the number of outage days during the active challenge activities. In the MSS column the parathesis indicates the site either had scaling issues keeping up with the total rate to tape, or transferred only a portion of the data to tape.}
157 \label{tab:tier01csa06}
158 \end{table}
159
160
161 The rate of data transferred averaged over 24 hours and the volume of
162 data transferred in 24 hours are shown in Figures~\ref{fig:tier01rate}
163 and~\ref{fig:tier01vol}. The start of the transfers during the first
164 week is visible on the left side of the plot as well as the transfers
165 not reaching the target rate shown as a horizontal red bar. The twin
166 peaks in excess of 300MB/s and 25TB of data moved correspond to the
167 over-subscription of data. The bottom of the graph has indicators of
168 the approximate Tier-0 reconstruction rate. Both the rate and the
169 volume figures show clearly the point when the Tier-0 trigger rate was
170 doubled to 100Hz. The daily average exceeded 350MB/s with more than
171 30TB moved. The hourly averages from CERN peaked at more than
172 650MB/s.
173
174 \begin{figure}[htp]
175 \begin{center}
176 \includegraphics[width=0.7\linewidth]{figs/Tier01rate}
177 \caption{The rate of data transferred between the Tier-0 to the Tier-1 centers in MB per second.}
178 \end{center}
179 \label{fig:tier01rate}
180 \end{figure}
181
182
183 \begin{figure}[htp]
184 \begin{center}
185 \includegraphics[width=0.7\linewidth]{figs/Tier01vol}
186 \caption{The total volume of data transferred between the Tier-0 to the Tier-1 centers in TB per day.}
187 \end{center}
188 \label{fig:tier01vol}
189 \end{figure}
190
191 The transferrable volume plot shown in Figure~\ref{fig:tier01queue} is an
192 indicator of how well the sites are keeping up with the volume of data
193 from the Tier-0 reconstruction farm. During the first three weeks of
194 the challenge almost no backlog of files is accumulated by the Tier-1
195 centers. A hardware failure at IN2P3 resulted is a small
196 accumulation. The additional data subscriptions leads to a spike in
197 data to transfer, but is quickly cleared by the Tier-1 sites. The
198 most significant volumes of data waiting for transfer come at the end
199 of the challenge. During this time GridKa has performed a dCache
200 storage upgrade that resulted in a large accumulation of data to
201 transfer. CNAF suffered a file server problem that reduced the amount
202 of available hardware. Additionally RAL turned off the import system
203 for two days over a weekend to demonstrate the ability to recover from
204 a service interruption. The Tier-1 issues combined with PhEDEx
205 database connection interruptions under the heavy load of the final
206 week of transfers to accumulate a backlog of approximatelty 50TB over
207 the final days of the heavy challenge transfers. During this time
208 CERN continued to serve data at 350MB/s on average.
209
210
211 \begin{figure}[htp]
212 \begin{center}
213 \includegraphics[width=0.7\linewidth]{figs/Tier01queue}
214 \caption{The total volume of data waiting for transfer between the Tier-0 to the Tier-1 centers in TB per day.}
215 \end{center}
216 \label{fig:tier01queue}
217 \end{figure}
218
219 The CERN to Tier-1 transfer quality is shown in
220 Figure~\ref{fig:tier01qual}. In CMS the transfer quality is defined
221 as the number of times a transfer has to be attempted before it
222 successfully completes. The link between two sites with 100\%
223 transfer quality would have had to attempt each transfer once, while a
224 10\% transfer quality would indicate each transfer had to be attempted
225 ten times to succcessfully complete. Most transfers eventually
226 complete, having low transfer quality uses the transfer resources
227 inefficiency and usually results in a low utilization of the network.
228
229 \begin{figure}[htp]
230 \begin{center}
231 \includegraphics[width=0.7\linewidth]{figs/qualt0t1}
232 \caption{Transfer quality between CERN and Tier-1 centers over 30 days}
233 \end{center}
234 \label{fig:tier01qual}
235 \end{figure}
236
237
238 The transfer quality plot compares very favorably to equivalent plots
239 made during the spring. The CERN CASTOR2 storage element performed
240 very stably throughout the challenge. There were two small
241 configuration issues that were very promptly addressed by the experts.
242 The Tier-1s also performed well throughout the challenge with several
243 24 hour periods to specific Tier-1s with no transfer errors. The
244 stablility of the RAL SE before the transition to CASTOR2 can be seen
245 at the left side of the plot, as well as the intentional downtime to
246 demonstrate recovery on the right side of the plot. The IN2P3
247 hardware problems are visible during the first week and the GridKa
248 dCache upgrade is clearly visible during the last week. Most of the
249 other periods are solidly green. Both FNAL and PIC are above 70\%
250 efficient for every say of the challenge activities.
251
252
253 Tier-1 to Tier-1 transfers were considered to be beyond the scope of
254 CSA06, though the dataflow exists in the CMS computing model. During
255 CSA06 we had an opportunity to test Tier-1 to Tier-1 transfers while
256 recovering from backlogs of data when the samples were subscribed to
257 mulitple sites. PhEDEx is designed to take the data from source site
258 where it can be efficiently transferred from. Figure~\ref{fig:t1t1}
259 shows the total Tier-1 to Tier-1 transfers during CSA06. With 7
260 Tier-1s there are 84 permutations of Tier-1 to Tier-1 transfers,
261 counting each direction separately. During CSA06 we successfully
262 exercised about half of them.
263
264 \begin{figure}[htp]
265 \begin{center}
266 \includegraphics[width=0.7\linewidth]{figs/T1T1Rate}
267 \caption{Transfer rate between Tier-1 centers during CSA06}
268 \end{center}
269 \label{fig:t1t1}
270 \end{figure}
271
272 \subsubsection{Transfers to Tier-2 Centers}
273 In the CMS computing model the Tier-2s are expected to be able to
274 receive data from any Tier-1 site. In order to simplify CSA06
275 operations we began by concentrating on transfers from the
276 ``Associated'' Tier-1 sites, and in the final two weeks of the
277 challenge began a concerted effort on transfers from any Tier-1. The
278 associated Tier-1 center is the center operating the File Transfer
279 Service (FTS) server and hosting the channels for Tier-2 transfers.
280
281 The Tier-2 transfer metrics involved both participation and
282 performance. For CSA06 CMS had 27 sites that signed up to participate
283 in the challenge. Participation was defined as having successful
284 transfers 80\% of the days during the challenge. By this metric there
285 were 21 sites that succeeded in participating in the challenge, which
286 is above the goal of 20.
287
288 The Tier-2 transfer performance goals were 20MB/s and the threshold
289 was 5MB/s. In the CMS computing model the Tier-2 transfers are
290 expected to occur in bursts. Data will be transferred to refresh a
291 Tier-2 cache, and then will be analyzed locally. The Tier-2 sites
292 were not expected to hit the goal transfer rates continuously
293 throughout the challenge. There were 12 sites that successfully
294 averaged above the goal rate for at least one 24 hour period, and an
295 additional 8 sites that rated averaged the threshold rate for at least
296 one 24 hour period.
297
298 The transfer rate over the 30 most active transfer days is shown in
299 Figure~\ref{fig:tier12rate}. The aggregrate rate from Tier-1 to
300 Tier-2 centers was not as high as the total rate from CERN, which is
301 not an accurate reflection of the transfers expected from the CMS
302 computing model. In the CMS computing model there is more data
303 exported from the Tier-1s to the Tier-2s than total raw data coming
304 from CERN because data is sent to multiple Tier-2s and the Tier-2s may
305 flush data from the cache and reload at a later time. In CSA06 the
306 Tier-2 centers were subscribed to specific samples at the beginning
307 and then specific skims when available.
308
309 \begin{figure}[htp]
310 \begin{center}
311 \includegraphics[width=0.7\linewidth]{figs/tier12rate}
312 \caption{Transfer rate between Tier-1 and Tier-2 centers during the first 30 days of CSA06}
313 \end{center}
314 \label{fig:tier12rate}
315 \end{figure}
316
317 The ability of the Tier-1 centers to export data was successfully
318 demonstrated during the challenge, but several sites indicated
319 interference between receiving and exporting data. The quality of the
320 Tier-1 to Tier-2 data transfers is shown in Figure~\ref{fig:tier12qual}.
321 The quality is not nearly as consistently green as the CERN to Tier-1
322 plots, but the variation has a number of causes. Not all of the
323 Tier-1 centers are currently exporting data as efficiently as CERN,
324 especially in the presence of a high load of data ingests, in addition
325 most of the Tier-2 sites do not have as much operational experience
326 receiving data as the Tier-1 sites do.
327
328 The Tier-1 to Tier-2 transfer quality looks very similar to the CERN
329 to Tier-1 transfer quality of 9-12 months ago. With a concerted
330 effort the Tier-1 to Tier-2 transfers should be able to reach the
331 quality of the current CERN to Tier-1 transfers before they are needed
332 to move large qualities of experiment data to users.
333
334 \begin{figure}[htp]
335 \begin{center}
336 \includegraphics[width=0.7\linewidth]{figs/tier12qual}
337 \caption{Transfer quality between Tier-1 and Tier-2 centers during the first 30 days of CSA06}
338 \end{center}
339 \label{fig:tier12qual}
340 \end{figure}
341
342 There are a number of very positive examples of Tier-1 to Tier-2
343 transfers. Figure~\ref{fig:picqual} shows the results of the Tier-1
344 to all Tier-2 tests when PIC was the source of the dataset. A small
345 skim sample was chosen and within 24 hours 20 sites had successfully
346 received the dataset. The transfer quality over the 24 hour period
347 remained high with success transfers to all four continents
348 participating in CMS.
349
350 \begin{figure}[htp]
351 \begin{center}
352 \includegraphics[width=0.7\linewidth]{figs/PICQual}
353 \caption{Transfer quality between PIC and Tier-2 sites participating in the dedicated Tier-1 to Tier-2 transfer tests}
354 \end{center}
355 \label{fig:picqual}
356 \end{figure}
357
358 Figure~\ref{fig:fnalrate} is an example of the very high export rates
359 the tier-1 centers were able to achieve transferring data to Tier-2
360 centers. The peak rate on the plot is over 5Gb/s, which was
361 independently verified by the site network monitoring. This rate is
362 over 50\% of the anticipated Tier-1 data export rate expected in the
363 full sized system.
364
365 \begin{figure}[htp]
366 \begin{center}
367 \includegraphics[width=0.7\linewidth]{figs/FNAL_Rate}
368 \caption{Transfer Performance between FNAL and Tier-2 sites participating in the dedicated Tier-1 to Tier-2 transfer tests}
369 \end{center}
370 \label{fig:fnalrate}
371 \end{figure}
372
373 Figure~\ref{fig:FZK_DESY} is an example of the very high rates achieved at both Tier-1 export and Tier-2 import observed in CSA06. The plot shows both the hourly average and the instantaneous rate. DESY achieved an import rate to disk of higher than 400MB/s.
374
375 \begin{figure}[ht]
376 \begin{center}
377 $\begin{array}{c@{\hspace{1in}}c}
378 \includegraphics[width=0.50\linewidth]{figs/FZK_DESY_1} &
379 \includegraphics[width=0.45\linewidth]{figs/FZK_DESY_2} \\ [-0.53cm]
380 \end{array}$
381 \end{center}
382 \caption{The plot on the left is the hourly average transfer rate between GridKa and DESY. The plot on the right is the instantaneous rate between the two sites measured with Ganglia.}
383 \label{fig:FZK_DESY}
384 \end{figure}
385
386 \subsection{Tier-1 Skim Job Production}
387 \subsection{Tier-1 Re-Reconstruction}
388 \subsubsection{Baseline Approach}
389 \subsubsection{Two-Step Approach}
390
391 \subsection{Job Execution at Tier-1 and Tier-2}
392 \subsubsection{Job Robot}
393 The processing metrics in CSA06 as they were defined foresaw that sites
394 offering computing capacity to CMS and participating in CSA06 were expected
395 to complete an aggregate of 50k jobs per day. The goal was to exercise the
396 job submission infrastructure and to monitor the input/output rate.
397
398 \begin{itemize}
399 \item About 10k per day were intended as skimming and reconstruction jobs
400 at the Tier-1 centers
401 \item About 40k per day were expected to be a combination of user submitted
402 analysis jobs and robot submitted analysis-like jobs
403 \end{itemize}
404
405 The job robots are automated expert systems to simulate user analysis tasks
406 using the CMS Remote Analysis Builder (CRAB). Therefore they provide a reasonable
407 method to generate load on the system by running analysis on all datasamples
408 at all sites individually. They consist of a component/agent based
409 structure which enables parallel execution. Job distribution to CMS compute
410 resources is accomplished by using Condor-G direct submission on the OSG sites
411 and gLite bulk submission on the EGEE sites.\\
412
413 The job preparation phase comprises four distinct steps
414 \begin{itemize}
415 \item job creation
416 \begin{itemize}
417 \item data discovery using DBS/DLS
418 \item job splitting according to user requirements
419 \item preparation of job dependent files (incl. the jdl)
420 \end{itemize}
421 \item job submission
422 \begin{itemize}
423 \item check if there any compatible resources in the Grid Information System known to the submission system
424 \item submit job to the Grid submission component (Resource Broker or Condor-G) through the CMS bookkeeping component (BOSS)
425 \end{itemize}
426 \item job status check
427 \item job output retrieval
428 \begin{itemize}
429 \item retrieve job output from the sandbox located on the Resource Broker (EGEE sites) or the common filesystem (OSG sites)
430 \end{itemize}
431 \end{itemize}
432
433 The job robot executes all four steps of the above described workflow on a large scale.\\
434
435 Apart from job submission the monitoring of the job execution over the
436 entire chain of all steps involved plays an important role. CMS has
437 chosen to use a product called Dashboard, a development that is part
438 of the CMS Integration Program. It is a joint effort of LCG's
439 ARDA project and the MonAlisa team in close collaboration with the CMS
440 developers working on job submission tools for production and analysis.
441 The objective of the Dashboard is to provide a complete view of the CMS
442 activity independently of the Grid flavour (i.e. OSG vs. EGEE). The
443 Dashboard maintains and displays the quantitative characteristics of the
444 usage pattern by including CMS-specific information and it reports problems
445 of various nature.\\
446
447 The monitoring information used in CSA06 is available via a web interface
448 and includes the following categories
449 \begin{itemize}
450 \item Quantities - how many jobs are running, pending, successfully
451 completed, failed, per user, per site, per input data collection, and
452 the distribution of these quantities over time
453 \item Usage of the resources (CPU, memory consumption, I/O rates), and
454 distribution over time with aggregation on different levels
455 \item Distribution of resources between different application areas
456 (i.e. analysis vs. production), different analysis groups and individual
457 users
458 \item Grid behaviour - success rate, failure reasons as a function of time,
459 site and data collection
460 \item CMS application behaviour
461 \item Distribution of data samples over sites and analysis groups
462 \end{itemize}
463
464 Timeline:
465 \begin{itemize}
466 \item October 15, 2006: The job robots have started analysis submission. 10k
467 jobs were submitted by two robot instances, with 90\% of them going to OSG sites
468 using Codor-G direct submission and 10\% going through the traditional LCG
469 Resource Broker (RB) to EGEE sites. In preparation of moving to the gLite RB,
470 thereby improving the submission rate to EGEE sites, bulk submission was
471 integrated into CRAB and is currently being tested.
472
473 \item October 17, 2006: Job robot submissions continue at a larger scale. There
474 was an issue found with the bulk submission feature used at EGEE sites leaving
475 jobs hanging indefinitely. The explanation was parsing
476 of file names in the RB output sandbox failed for file name lengths of exactly 100
477 characters. The problem, located in the gLite User Interface (UI), was corrected by
478 the EGEE developers within a day and a new release of the UI was made available
479 to the job robot operations team.\\
480
481 A total of 20k jobs were submitted in the past 24 hours. A large number of jobs
482 seemed not to report all the site information to the
483 Dashboard, which results into a major fraction marked as "unknown" in the report.
484 The effect needs to be understood.\\
485 Apart from the jobs being affected by the problem mentioned above the efficiency
486 regarding successfully completed jobs is very high.
487
488 \item October 19, 2006: Robotic job submission via both the Condor-G direct
489 submission and the gLite RB bulk submission is activated. The job completion efficiency
490 remains very high for some sites. Over the course of the past day nearly 2000
491 jobs were completed at Caltech with only 5 failures.
492
493 \item October 20, 2006: The number of "unknown" jobs is decreasing following
494 further investigations by the robot operations team. The job completion efficiency
495 remains high though the total number of submissions is lower than in the preovious
496 days. A large number of sites running the PBS batch system have taken their
497 resources off the Grid because of a critical security vulnerability. Sites
498 applied a respective patch at short notice and were back to normal operation
499 within a day or two.
500
501 \item October 23, 2006: Over the weekend significant scaling issues were
502 encountered in the robot. Those were mainly associated with the mySQL
503 server holding the BOSS DB. On the gLite submission side a problem was
504 found with projects comprising more than 2000 jobs. A limit was
505 introduced with the consequence that the same data files are more often
506 accessed.
507
508 \item October 24, 2006: There were again scaling problems observed in the
509 job robots. Switching to a central mySQL data base for both the robots
510 has lead to the databases developing a lock state. Though the locks
511 automatically clear within 10 to 30 minutes the effect has an impact on
512 the overall job submissions rate. To resolve the issue two data bases
513 were created, one for each robot. While the Condor-G side performs well
514 the gLite robot continues to develop locking. A memory leak leading to
515 robot crashes was observed in CRAB/BOSS submission through gLite. The
516 robot operations team is working with the BOSS developers on a solution.
517
518 \item October 25, 2006: The BOSS developers have analyzed the problem
519 yesterday reported as a "scaling issue" and found that an SQL statement
520 issued by CRAB was incomplete, leading to long table rows being accessed
521 resulting in a heavy load on the data base server. The CRAB developers
522 have made a new release available the same day and the robot operations
523 team found that the robots are running fine since.
524
525 \item October 26, 2006: Following the decision to move from analysis
526 of data that has been produced with CMSSW\_1\_0\_3 to more recent data
527 that was produced with CMSSW\_1\_0\_5 a lot of sites were not selected
528 and therefore not participating since they are still lacking respective
529 datasets.
530
531 \item November 1, 2006: The submission rate reached by the job robots
532 is currently at about 25k jobs per day. To improve scaling up to the
533 desired rate 11 robots were set up and are currently submitting to OSG
534 and EGEE sites.
535
536 \item November 2, 2006: The total number of jobs was in the order of
537 21k. Due to more sites having datasets published in DBS/DLS that were
538 created with CMSSW\_1\_0\_5 the number of participating sites has increased.
539 The total application and Grid efficiency is both over 99%.
540
541 \item November 6, 2006: The number of submitted and completed jobs is still increasing.
542 30k jobs have successfully passed all steps in the past 24 hours. 24
543 Tier-2 sites are now publishing data and are accepting jobs from the robot.
544 The efficiency remains high.
545
546 \item November 7, 2006: The combined job robot, production and analysis submissions hit
547 the number of 55k. The activity breakdown is shown in the plot below.
548
549 \begin{figure}[htp]
550 \begin{center}
551 \includegraphics[width=0.7\linewidth]{figs/jobs-breakdown-1102}
552 \caption{Job breakdown by activity}
553 \end{center}
554 \label{fig:breakdown}
555 \end{figure}
556
557 The job robot submissions by site are shown below. Six out of seven Tier-1
558 centers are included in the job robot. As expected the Tier-2 centers are
559 still dominating the submissions. The addition of the Tier-1 centers has
560 driven the job robot submission rates past the load that can be sustained
561 by a single mySQL job monitor.
562
563 \begin{figure}[htp]
564 \begin{center}
565 \includegraphics[width=0.7\linewidth]{figs/jobs-per-site-1102}
566 \caption{Job breakdown by site}
567 \end{center}
568 \label{fig:jobs-per-site}
569 \end{figure}
570 \end{itemize}