ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/cvsroot/COMP/CSA06DOC/tier12ops.tex
Revision: 1.13
Committed: Thu Nov 30 10:50:42 2006 UTC (18 years, 5 months ago) by malgeri
Content type: application/x-tex
Branch: MAIN
CVS Tags: pdflatex_v4, pdflatex_v3, pdflatex_v2, pdflatex
Changes since 1.12: +14 -14 lines
Log Message:
added calibration part and updated for pdflatex

File Contents

# User Rev Content
1 fisk 1.1 \section{Tier-1 and Tier-2 Operations}
2    
3 fisk 1.2 \subsection{Data Transfers}
4    
5     The Tier-1 centers were expected to receive data from CERN at a rate
6     proportional to the 25\% of the 2008 pledge rate and serve the data to
7     Tier-2 centers. The expected rate into the Tier-1 centers is shown in
8     Table~\ref{tab:tier01pledge}.
9    
10     \begin{table}[htb]
11 fisk 1.5 \begin{tabular}{|l|l|l|}
12 fisk 1.2 \hline
13 fisk 1.4 Site & Goal Rates (MB/s) & Threshold Rates (MB/s) \\
14 fisk 1.2 \hline
15 fisk 1.4 ASGC & 15 & 7.5 \\
16     CNAF & 25 & 12.5 \\
17     FNAL & 50 & 25 \\
18     GridKa & 25 & 12.5 \\
19     IN2P3 & 25 & 12.5 \\
20     PIC & 10 & 5 \\
21     RAL & 10 & 5 \\
22 fisk 1.2 \hline
23     \end{tabular}
24     \caption{Expect transfer rates from CERN to Tier-1 centers based on the MOU pledges.}
25     \label{tab:tier01pledge}
26     \end{table}
27    
28 fisk 1.4 The Tier-2 centers are expected in the computing model to transfer
29     from the Tier-2 centers in bursts. The goal rate in CSA06 was 20MB/s,
30     with a threshold for success of 5MB/s. Achieving these metrics in the
31     computing model was defined as hitting the transfer rate for a 24 hour
32     period. At the beginning of CSA06 CMS concentrated primarily on
33     moving data from the ``associated'' Tier-1 centers to the Tier-2s. By
34     the end of the challenge most of the Tier-1 to Tier-2 permutations had
35     been attempted.
36 fisk 1.2
37 fisk 1.6 The total data transferred between sites in CSA06 is shown in
38     Figure~\ref{fig:totaltran}. This plot only includes wide area data
39     transfers, additionally data was moved onto tape at the majority of
40 fisk 1.7 Tier-1 centers. Over the 45 days of the challenge CMS was able to
41     move more than 1 petabyte of data over the wide area.
42 fisk 1.6
43     \begin{figure}[htp]
44     \begin{center}
45 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/CSA06_CumTran}
46 fisk 1.6 \caption{The cummulative data volume transferred during CSA06 in TB.}
47     \end{center}
48     \label{fig:totaltran}
49     \end{figure}
50    
51 fisk 1.5 Timeline:
52     \begin{itemize}
53    
54     \item October 2, 2006: The Tier-0 to Tier-1 transfers began on the
55     first day of the challenge. In the first few hours 6 of 7 Tier-1
56 fisk 1.6 centers successfully received data. During the first week only
57     minimum bias was reconstructed and at 40Hz the total rate out of the
58     CERN site does not meet the 150MB/s target rate.
59 fisk 1.5
60     \item October 3, 2006: All 7 Tier-1 sites were able to successfully
61     received data and 8 Tier-2 centers were subscribed to data samples:
62     Belgium IIHE, UC San Diego, Wisconsin, Nebraska, DESY, Aachen, and
63     Estonia. There were successful transfers to 6 Tier-2 sites.
64    
65     \item October 4, 2006: An additional 11 Tier-2 sites were subscribed
66     to data samples: Pisa, Purdue, CIEMAT, Caltech, Florida, Rome, Bari,
67     CSCS, IHEP, Belgium UCL, and Imperial College. Of the 19 registered
68     Tier-2 sites, 12 were able to receive data. Of those, 5 exceeded the
69     goal transfer rates for over an hour, and an additional 3 were over
70     the threshold rate.
71    
72     \item October 5, 2006: Three additional Tier-2s were added increasing
73     the number of participating sites above the goal rate of 20 Tier-2
74     centers. New hardware installed at IN2P3 for CSA06 began to exhibit
75     stability problems leading to poor transfer efficieny.
76    
77 fisk 1.6 \item October 9, 2006: RAL transitioned from a dCache SE to a CASTOR2
78     SE. The signal samples began being reconstructed at the Tier-0.
79    
80     \item October 10-12, 2006: The Tier-1 sites had stable operations
81     through the week at an aggregate rate of approximately 100MB/s from
82     CERN.
83    
84     \item October 13, 2006: Multiple subscriptions of the minimum bias
85     samples were made to some of the Tier-1 centers to increase the total
86     rate of data transfer from CERN. The number of participating Tier-2
87     sites increased to 23.
88    
89     \item October 18, 2006: The PhEDEx transfer system held a lock in the
90     Oracle database which blocked other agents from continuing with
91     transfers. This problem appeared more frequently in the latter half
92     of the challenge when the load was higher.
93    
94     \item October 20, 2006: The reconstruction rate was increased at the
95     Tier-0 to improve the output from CERN and to better exercise the
96     prompt reconstruction farm. The data rate from CERN approximately
97     doubles. An average rate over an hour of 600MB/s from CERN was
98     achieved.
99    
100     \item October 25, 2006: The transfer rate from CERN was large with
101     daily average rates of 250MB/s-300MB/s. The first obsertvation of
102     transfer backlogs begin to appear.
103    
104     \item October 30, 2006: Data reconstruction at the Tier-0 stopped.
105    
106     \item October 31, 2006: PIC and ASGC finished transferring the assigned prompt reconstruction data from CERN.
107    
108     \item November 2, 2006: FNAL and IN2P3 also completed the transfers.
109    
110     \item November 3, 2006: RAL completed the transfers. The first of the
111     Tier-1 to any Tier-2 transfer validation began. The test involved
112     sending a small sample from a Tier-1 site to a validated Tier-2, in
113     the test case DESY, and then sending a small sample to all Tier-2
114     sites.
115    
116     \item November 5, 2006: CNAF completed the Tier-0 transfers
117 fisk 1.5
118 fisk 1.6 \item November 6, 2006: The Tier-1 to Tier-2 transfer testing continued.
119 fisk 1.5
120 fisk 1.6 \item November 9, 2006: GridKa completed the Tier-0 transfers
121 fisk 1.5
122     \end{itemize}
123    
124    
125 fisk 1.2 \subsubsection{Transfers to Tier-1 Centers}
126 fisk 1.3
127 fisk 1.7 During CSA06 the Tier-1 centers met the transfer rate goals. In the
128     first week of the challenge using minimum bias events the total volume
129     of data out of CERN did not amount to 150MB/s unless the datasets were
130     subscribed to multiple sites. After the reconstruction rate was
131     increased at the Tier-0 the transfer rate easily exceeded the 150MB/s
132     target. The 30 day and 15 day averages are shown in
133     Table~\ref{tab:tier01csa06}. For the thirty day average all sites
134     except one exceed the goal rate and for the final 15 days all sites
135     easily exceed the goal. Several sites doubled and tripled the goal
136     rate during the final two weeks of high volume transfers.
137    
138     The WLCG metric for availability this year is 90\% for the Tier-1
139     sites. If we apply this to the Tier-1 participating in CSA06
140     transfers we have 6 of 7 Tier-1s reaching the availablility goal.
141 fisk 1.4
142 fisk 1.3 \begin{table}[htb]
143     \begin{tabular}{|l|r|r|r|r|c|}
144     \hline
145     Site & Anticipated Rate (MB/s) & last 30 day average & last 15 day average & Outage (Days) & MSS used \\
146     \hline
147     ASGC & 15MB/s & 17MB/s & 23MB/s & 0 & (Yes) \\
148     CNAF & 25MB/s & 26MB/s & 37MB/s & 0 & (Yes) \\
149     FNAL & 50MB/s & 68MB/s & 98MB/s & 0 & Yes \\
150     GridKa & 25MB/s & 23MB/s & 28MB/s & 3 & No \\
151     IN2P3 & 25MB/s & 23MB/s & 34MB/s & 1 & Yes \\
152     PIC & 10MB/s & 22MB/s & 33MB/s & 0 & No \\
153     RAL & 10MB/s & 23MB/s & 33MB/s & 2 & Yes \\
154     \hline
155     \end{tabular}
156     \caption{Transfer rates during CSA06 between CERN and Tier-1 centers and the number of outage days during the active challenge activities. In the MSS column the parathesis indicates the site either had scaling issues keeping up with the total rate to tape, or transferred only a portion of the data to tape.}
157     \label{tab:tier01csa06}
158     \end{table}
159    
160 fisk 1.7
161     The rate of data transferred averaged over 24 hours and the volume of
162     data transferred in 24 hours are shown in Figures~\ref{fig:tier01rate}
163     and~\ref{fig:tier01vol}. The start of the transfers during the first
164     week is visible on the left side of the plot as well as the transfers
165     not reaching the target rate shown as a horizontal red bar. The twin
166     peaks in excess of 300MB/s and 25TB of data moved correspond to the
167     over-subscription of data. The bottom of the graph has indicators of
168     the approximate Tier-0 reconstruction rate. Both the rate and the
169     volume figures show clearly the point when the Tier-0 trigger rate was
170     doubled to 100Hz. The daily average exceeded 350MB/s with more than
171     30TB moved. The hourly averages from CERN peaked at more than
172     650MB/s.
173    
174 fisk 1.2 \begin{figure}[htp]
175     \begin{center}
176 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/Tier01rate}
177 fisk 1.2 \caption{The rate of data transferred between the Tier-0 to the Tier-1 centers in MB per second.}
178     \end{center}
179     \label{fig:tier01rate}
180     \end{figure}
181    
182    
183     \begin{figure}[htp]
184     \begin{center}
185 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/Tier01vol}
186 fisk 1.2 \caption{The total volume of data transferred between the Tier-0 to the Tier-1 centers in TB per day.}
187     \end{center}
188     \label{fig:tier01vol}
189     \end{figure}
190    
191 fisk 1.7 The transferrable volume plot shown in Figure~\ref{fig:tier01queue} is an
192     indicator of how well the sites are keeping up with the volume of data
193     from the Tier-0 reconstruction farm. During the first three weeks of
194     the challenge almost no backlog of files is accumulated by the Tier-1
195     centers. A hardware failure at IN2P3 resulted is a small
196     accumulation. The additional data subscriptions leads to a spike in
197     data to transfer, but is quickly cleared by the Tier-1 sites. The
198     most significant volumes of data waiting for transfer come at the end
199     of the challenge. During this time GridKa has performed a dCache
200     storage upgrade that resulted in a large accumulation of data to
201     transfer. CNAF suffered a file server problem that reduced the amount
202     of available hardware. Additionally RAL turned off the import system
203     for two days over a weekend to demonstrate the ability to recover from
204     a service interruption. The Tier-1 issues combined with PhEDEx
205     database connection interruptions under the heavy load of the final
206     week of transfers to accumulate a backlog of approximatelty 50TB over
207     the final days of the heavy challenge transfers. During this time
208     CERN continued to serve data at 350MB/s on average.
209 fisk 1.2
210    
211     \begin{figure}[htp]
212     \begin{center}
213 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/Tier01queue}
214 fisk 1.2 \caption{The total volume of data waiting for transfer between the Tier-0 to the Tier-1 centers in TB per day.}
215     \end{center}
216     \label{fig:tier01queue}
217     \end{figure}
218    
219 fisk 1.7 The CERN to Tier-1 transfer quality is shown in
220     Figure~\ref{fig:tier01qual}. In CMS the transfer quality is defined
221     as the number of times a transfer has to be attempted before it
222     successfully completes. The link between two sites with 100\%
223     transfer quality would have had to attempt each transfer once, while a
224     10\% transfer quality would indicate each transfer had to be attempted
225     ten times to succcessfully complete. Most transfers eventually
226     complete, having low transfer quality uses the transfer resources
227     inefficiency and usually results in a low utilization of the network.
228    
229 fisk 1.3 \begin{figure}[htp]
230     \begin{center}
231 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/qualt0t1}
232 fisk 1.3 \caption{Transfer quality between CERN and Tier-1 centers over 30 days}
233     \end{center}
234     \label{fig:tier01qual}
235     \end{figure}
236 fisk 1.2
237 fisk 1.7
238     The transfer quality plot compares very favorably to equivalent plots
239     made during the spring. The CERN CASTOR2 storage element performed
240     very stably throughout the challenge. There were two small
241     configuration issues that were very promptly addressed by the experts.
242     The Tier-1s also performed well throughout the challenge with several
243     24 hour periods to specific Tier-1s with no transfer errors. The
244     stablility of the RAL SE before the transition to CASTOR2 can be seen
245     at the left side of the plot, as well as the intentional downtime to
246     demonstrate recovery on the right side of the plot. The IN2P3
247     hardware problems are visible during the first week and the GridKa
248     dCache upgrade is clearly visible during the last week. Most of the
249     other periods are solidly green. Both FNAL and PIC are above 70\%
250     efficient for every say of the challenge activities.
251    
252    
253     Tier-1 to Tier-1 transfers were considered to be beyond the scope of
254     CSA06, though the dataflow exists in the CMS computing model. During
255     CSA06 we had an opportunity to test Tier-1 to Tier-1 transfers while
256     recovering from backlogs of data when the samples were subscribed to
257     mulitple sites. PhEDEx is designed to take the data from source site
258     where it can be efficiently transferred from. Figure~\ref{fig:t1t1}
259     shows the total Tier-1 to Tier-1 transfers during CSA06. With 7
260     Tier-1s there are 84 permutations of Tier-1 to Tier-1 transfers,
261     counting each direction separately. During CSA06 we successfully
262     exercised about half of them.
263    
264     \begin{figure}[htp]
265     \begin{center}
266 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/T1T1Rate}
267 fisk 1.7 \caption{Transfer rate between Tier-1 centers during CSA06}
268     \end{center}
269     \label{fig:t1t1}
270     \end{figure}
271    
272 fisk 1.8 \subsubsection{Transfers to Tier-2 Centers}
273     In the CMS computing model the Tier-2s are expected to be able to
274     receive data from any Tier-1 site. In order to simplify CSA06
275     operations we began by concentrating on transfers from the
276     ``Associated'' Tier-1 sites, and in the final two weeks of the
277     challenge began a concerted effort on transfers from any Tier-1. The
278     associated Tier-1 center is the center operating the File Transfer
279     Service (FTS) server and hosting the channels for Tier-2 transfers.
280    
281     The Tier-2 transfer metrics involved both participation and
282     performance. For CSA06 CMS had 27 sites that signed up to participate
283     in the challenge. Participation was defined as having successful
284     transfers 80\% of the days during the challenge. By this metric there
285     were 21 sites that succeeded in participating in the challenge, which
286     is above the goal of 20.
287    
288     The Tier-2 transfer performance goals were 20MB/s and the threshold
289     was 5MB/s. In the CMS computing model the Tier-2 transfers are
290     expected to occur in bursts. Data will be transferred to refresh a
291     Tier-2 cache, and then will be analyzed locally. The Tier-2 sites
292     were not expected to hit the goal transfer rates continuously
293     throughout the challenge. There were 12 sites that successfully
294     averaged above the goal rate for at least one 24 hour period, and an
295     additional 8 sites that rated averaged the threshold rate for at least
296     one 24 hour period.
297    
298     The transfer rate over the 30 most active transfer days is shown in
299     Figure~\ref{fig:tier12rate}. The aggregrate rate from Tier-1 to
300     Tier-2 centers was not as high as the total rate from CERN, which is
301     not an accurate reflection of the transfers expected from the CMS
302     computing model. In the CMS computing model there is more data
303     exported from the Tier-1s to the Tier-2s than total raw data coming
304     from CERN because data is sent to multiple Tier-2s and the Tier-2s may
305     flush data from the cache and reload at a later time. In CSA06 the
306     Tier-2 centers were subscribed to specific samples at the beginning
307     and then specific skims when available.
308    
309     \begin{figure}[htp]
310     \begin{center}
311 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/tier12rate}
312 fisk 1.8 \caption{Transfer rate between Tier-1 and Tier-2 centers during the first 30 days of CSA06}
313     \end{center}
314     \label{fig:tier12rate}
315     \end{figure}
316    
317     The ability of the Tier-1 centers to export data was successfully
318     demonstrated during the challenge, but several sites indicated
319     interference between receiving and exporting data. The quality of the
320     Tier-1 to Tier-2 data transfers is shown in Figure~\ref{fig:tier12qual}.
321     The quality is not nearly as consistently green as the CERN to Tier-1
322     plots, but the variation has a number of causes. Not all of the
323     Tier-1 centers are currently exporting data as efficiently as CERN,
324     especially in the presence of a high load of data ingests, in addition
325     most of the Tier-2 sites do not have as much operational experience
326     receiving data as the Tier-1 sites do.
327    
328     The Tier-1 to Tier-2 transfer quality looks very similar to the CERN
329     to Tier-1 transfer quality of 9-12 months ago. With a concerted
330     effort the Tier-1 to Tier-2 transfers should be able to reach the
331     quality of the current CERN to Tier-1 transfers before they are needed
332     to move large qualities of experiment data to users.
333    
334     \begin{figure}[htp]
335     \begin{center}
336 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/tier12qual}
337 fisk 1.8 \caption{Transfer quality between Tier-1 and Tier-2 centers during the first 30 days of CSA06}
338     \end{center}
339     \label{fig:tier12qual}
340     \end{figure}
341    
342 fisk 1.9 There are a number of very positive examples of Tier-1 to Tier-2
343     transfers. Figure~\ref{fig:picqual} shows the results of the Tier-1
344     to all Tier-2 tests when PIC was the source of the dataset. A small
345     skim sample was chosen and within 24 hours 20 sites had successfully
346     received the dataset. The transfer quality over the 24 hour period
347     remained high with success transfers to all four continents
348     participating in CMS.
349    
350     \begin{figure}[htp]
351     \begin{center}
352 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/PICQual}
353 fisk 1.9 \caption{Transfer quality between PIC and Tier-2 sites participating in the dedicated Tier-1 to Tier-2 transfer tests}
354     \end{center}
355     \label{fig:picqual}
356     \end{figure}
357    
358     Figure~\ref{fig:fnalrate} is an example of the very high export rates
359     the tier-1 centers were able to achieve transferring data to Tier-2
360     centers. The peak rate on the plot is over 5Gb/s, which was
361     independently verified by the site network monitoring. This rate is
362     over 50\% of the anticipated Tier-1 data export rate expected in the
363     full sized system.
364    
365     \begin{figure}[htp]
366     \begin{center}
367 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/FNAL_Rate}
368 fisk 1.9 \caption{Transfer Performance between FNAL and Tier-2 sites participating in the dedicated Tier-1 to Tier-2 transfer tests}
369     \end{center}
370     \label{fig:fnalrate}
371     \end{figure}
372    
373     Figure~\ref{fig:FZK_DESY} is an example of the very high rates achieved at both Tier-1 export and Tier-2 import observed in CSA06. The plot shows both the hourly average and the instantaneous rate. DESY achieved an import rate to disk of higher than 400MB/s.
374    
375     \begin{figure}[ht]
376     \begin{center}
377     $\begin{array}{c@{\hspace{1in}}c}
378 malgeri 1.13 \includegraphics[width=0.50\linewidth]{figs/FZK_DESY_1} &
379     \includegraphics[width=0.45\linewidth]{figs/FZK_DESY_2} \\ [-0.53cm]
380 fisk 1.9 \end{array}$
381     \end{center}
382     \caption{The plot on the left is the hourly average transfer rate between GridKa and DESY. The plot on the right is the instantaneous rate between the two sites measured with Ganglia.}
383     \label{fig:FZK_DESY}
384     \end{figure}
385    
386 fisk 1.1 \subsection{Tier-1 Skim Job Production}
387     \subsection{Tier-1 Re-Reconstruction}
388     \subsubsection{Baseline Approach}
389     \subsubsection{Two-Step Approach}
390    
391     \subsection{Job Execution at Tier-1 and Tier-2}
392 ernst 1.10 \subsubsection{Job Robot}
393 ernst 1.12 The processing metrics in CSA06 as they were defined foresaw that sites
394     offering computing capacity to CMS and participating in CSA06 were expected
395     to complete an aggregate of 50k jobs per day. The goal was to exercise the
396     job submission infrastructure and to monitor the input/output rate.
397    
398     \begin{itemize}
399     \item About 10k per day were intended as skimming and reconstruction jobs
400     at the Tier-1 centers
401     \item About 40k per day were expected to be a combination of user submitted
402     analysis jobs and robot submitted analysis-like jobs
403     \end{itemize}
404    
405 ernst 1.10 The job robots are automated expert systems to simulate user analysis tasks
406     using the CMS Remote Analysis Builder (CRAB). Therefore they provide a reasonable
407     method to generate load on the system by running analysis on all datasamples
408     at all sites individually. They consist of a component/agent based
409     structure which enables parallel execution. Job distribution to CMS compute
410     resources is accomplished by using Condor-G direct submission on the OSG sites
411     and gLite bulk submission on the EGEE sites.\\
412    
413     The job preparation phase comprises four distinct steps
414     \begin{itemize}
415     \item job creation
416     \begin{itemize}
417     \item data discovery using DBS/DLS
418     \item job splitting according to user requirements
419     \item preparation of job dependent files (incl. the jdl)
420     \end{itemize}
421     \item job submission
422     \begin{itemize}
423     \item check if there any compatible resources in the Grid Information System known to the submission system
424     \item submit job to the Grid submission component (Resource Broker or Condor-G) through the CMS bookkeeping component (BOSS)
425     \end{itemize}
426     \item job status check
427     \item job output retrieval
428     \begin{itemize}
429     \item retrieve job output from the sandbox located on the Resource Broker (EGEE sites) or the common filesystem (OSG sites)
430     \end{itemize}
431     \end{itemize}
432    
433     The job robot executes all four steps of the above described workflow on a large scale.\\
434    
435     Apart from job submission the monitoring of the job execution over the
436     entire chain of all steps involved plays an important role. CMS has
437 ernst 1.11 chosen to use a product called Dashboard, a development that is part
438 ernst 1.10 of the CMS Integration Program. It is a joint effort of LCG's
439     ARDA project and the MonAlisa team in close collaboration with the CMS
440     developers working on job submission tools for production and analysis.
441     The objective of the Dashboard is to provide a complete view of the CMS
442     activity independently of the Grid flavour (i.e. OSG vs. EGEE). The
443 ernst 1.11 Dashboard maintains and displays the quantitative characteristics of the
444 ernst 1.10 usage pattern by including CMS-specific information and it reports problems
445     of various nature.\\
446    
447 ernst 1.11 The monitoring information used in CSA06 is available via a web interface
448     and includes the following categories
449 ernst 1.10 \begin{itemize}
450     \item Quantities - how many jobs are running, pending, successfully
451     completed, failed, per user, per site, per input data collection, and
452 ernst 1.11 the distribution of these quantities over time
453 ernst 1.10 \item Usage of the resources (CPU, memory consumption, I/O rates), and
454     distribution over time with aggregation on different levels
455     \item Distribution of resources between different application areas
456 ernst 1.11 (i.e. analysis vs. production), different analysis groups and individual
457 ernst 1.10 users
458     \item Grid behaviour - success rate, failure reasons as a function of time,
459     site and data collection
460     \item CMS application behaviour
461     \item Distribution of data samples over sites and analysis groups
462 ernst 1.11 \end{itemize}
463    
464     Timeline:
465     \begin{itemize}
466     \item October 15, 2006: The job robots have started analysis submission. 10k
467 ernst 1.12 jobs were submitted by two robot instances, with 90\% of them going to OSG sites
468     using Codor-G direct submission and 10\% going through the traditional LCG
469 ernst 1.11 Resource Broker (RB) to EGEE sites. In preparation of moving to the gLite RB,
470     thereby improving the submission rate to EGEE sites, bulk submission was
471     integrated into CRAB and is currently being tested.
472    
473     \item October 17, 2006: Job robot submissions continue at a larger scale. There
474     was an issue found with the bulk submission feature used at EGEE sites leaving
475     jobs hanging indefinitely. The explanation was parsing
476     of file names in the RB output sandbox failed for file name lengths of exactly 100
477     characters. The problem, located in the gLite User Interface (UI), was corrected by
478     the EGEE developers within a day and a new release of the UI was made available
479     to the job robot operations team.\\
480    
481     A total of 20k jobs were submitted in the past 24 hours. A large number of jobs
482     seemed not to report all the site information to the
483     Dashboard, which results into a major fraction marked as "unknown" in the report.
484     The effect needs to be understood.\\
485     Apart from the jobs being affected by the problem mentioned above the efficiency
486     regarding successfully completed jobs is very high.
487    
488     \item October 19, 2006: Robotic job submission via both the Condor-G direct
489     submission and the gLite RB bulk submission is activated. The job completion efficiency
490     remains very high for some sites. Over the course of the past day nearly 2000
491     jobs were completed at Caltech with only 5 failures.
492    
493     \item October 20, 2006: The number of "unknown" jobs is decreasing following
494     further investigations by the robot operations team. The job completion efficiency
495     remains high though the total number of submissions is lower than in the preovious
496     days. A large number of sites running the PBS batch system have taken their
497     resources off the Grid because of a critical security vulnerability. Sites
498     applied a respective patch at short notice and were back to normal operation
499     within a day or two.
500    
501     \item October 23, 2006: Over the weekend significant scaling issues were
502     encountered in the robot. Those were mainly associated with the mySQL
503     server holding the BOSS DB. On the gLite submission side a problem was
504     found with projects comprising more than 2000 jobs. A limit was
505     introduced with the consequence that the same data files are more often
506     accessed.
507    
508     \item October 24, 2006: There were again scaling problems observed in the
509     job robots. Switching to a central mySQL data base for both the robots
510     has lead to the databases developing a lock state. Though the locks
511     automatically clear within 10 to 30 minutes the effect has an impact on
512     the overall job submissions rate. To resolve the issue two data bases
513     were created, one for each robot. While the Condor-G side performs well
514     the gLite robot continues to develop locking. A memory leak leading to
515     robot crashes was observed in CRAB/BOSS submission through gLite. The
516     robot operations team is working with the BOSS developers on a solution.
517    
518     \item October 25, 2006: The BOSS developers have analyzed the problem
519     yesterday reported as a "scaling issue" and found that an SQL statement
520     issued by CRAB was incomplete, leading to long table rows being accessed
521     resulting in a heavy load on the data base server. The CRAB developers
522     have made a new release available the same day and the robot operations
523     team found that the robots are running fine since.
524    
525     \item October 26, 2006: Following the decision to move from analysis
526     of data that has been produced with CMSSW\_1\_0\_3 to more recent data
527     that was produced with CMSSW\_1\_0\_5 a lot of sites were not selected
528     and therefore not participating since they are still lacking respective
529     datasets.
530    
531     \item November 1, 2006: The submission rate reached by the job robots
532     is currently at about 25k jobs per day. To improve scaling up to the
533     desired rate 11 robots were set up and are currently submitting to OSG
534     and EGEE sites.
535    
536     \item November 2, 2006: The total number of jobs was in the order of
537     21k. Due to more sites having datasets published in DBS/DLS that were
538     created with CMSSW\_1\_0\_5 the number of participating sites has increased.
539     The total application and Grid efficiency is both over 99%.
540    
541 ernst 1.12 \item November 6, 2006: The number of submitted and completed jobs is still increasing.
542 ernst 1.11 30k jobs have successfully passed all steps in the past 24 hours. 24
543     Tier-2 sites are now publishing data and are accepting jobs from the robot.
544     The efficiency remains high.
545    
546 ernst 1.12 \item November 7, 2006: The combined job robot, production and analysis submissions hit
547 ernst 1.11 the number of 55k. The activity breakdown is shown in the plot below.
548 ernst 1.10
549 ernst 1.11 \begin{figure}[htp]
550     \begin{center}
551 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/jobs-breakdown-1102}
552 ernst 1.11 \caption{Job breakdown by activity}
553     \end{center}
554     \label{fig:breakdown}
555     \end{figure}
556    
557     The job robot submissions by site are shown below. Six out of seven Tier-1
558     centers are included in the job robot. As expected the Tier-2 centers are
559     still dominating the submissions. The addition of the Tier-1 centers has
560     driven the job robot submission rates past the load that can be sustained
561 ernst 1.12 by a single mySQL job monitor.
562 ernst 1.11
563     \begin{figure}[htp]
564     \begin{center}
565 malgeri 1.13 \includegraphics[width=0.7\linewidth]{figs/jobs-per-site-1102}
566 ernst 1.12 \caption{Job breakdown by site}
567 ernst 1.11 \end{center}
568     \label{fig:jobs-per-site}
569     \end{figure}
570     \end{itemize}