ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/cvsroot/COMP/CSA06DOC/preprod.tex
Revision: 1.13
Committed: Mon Feb 5 02:09:35 2007 UTC (18 years, 2 months ago) by acosta
Content type: application/x-tex
Branch: MAIN
CVS Tags: HEAD
Changes since 1.12: +1 -0 lines
Log Message:
more edits

File Contents

# User Rev Content
1 fisk 1.1 \section{Pre-challenge Monte Carlo Production}
2    
3 ernst 1.2 The Monte Carlo Production for CSA06 started in mid July. The original aim was
4     to produce 50 million events in total to be used as input for prompt
5 ernst 1.6 reconstruction.\\
6 acosta 1.9
7 ernst 1.2 Four teams from CIEMAT, DESY/RWTH, INFN/Bari, and University of
8 acosta 1.9 Wisconsin, Madison, volunteered to run production using the Production
9 ernst 1.4 Agent for the first time at large scale. All production related
10     job submissions were Grid-based, we refrained from using local
11     submissions entirely. After a short ramp-up when sites prepared for
12     production (e.g. most of the sites received the CMSSW software via
13     a centrally managed installation mechanism, while a few managed the
14     installation manually) a total of 28 sites offered resources for the
15 ernst 1.6 pre-production step.
16 ernst 1.4
17 ernst 1.6 Table~\ref{tab:prechallenge} shows the datasets by event category and the associated
18 acosta 1.9 number of events that were requested and actually produced. All four teams started production with the simulation of minimum bias
19     events.
20    
21 ernst 1.2
22     \begin{table}[htb]
23 acosta 1.13 \centering
24 acosta 1.9 \caption{CSA06 Pre-challenge Production by event category.}
25     \label{tab:prechallenge}
26     \vspace{3mm}
27 ernst 1.2 \begin{tabular}{|l|l|r|r|}
28     \hline
29     CMSSW & & Nb Events produced & Nb Events requested \\
30     \hline
31 acosta 1.9 0\_8\_1 & minbias & 39.8 & 25.0 \\
32     0\_8\_2 & TTbar & 5.8 & 5.0 \\
33     0\_8\_2 & Zmumu & 2.2 & 2.0 \\
34     0\_8\_3 & Wenu & 4.6 & 4.0 \\
35     0\_8\_3 & SoftMuon & 2.0 & 2.0 \\
36     0\_8\_3 & EWK Soup & 5.6 & 5.0 \\
37     0\_8\_3 & Jets & 1.2 & 1.2 \\
38     0\_8\_3 & Exo Soup & 1.0 & 1.0 \\
39     0\_8\_4 & HLT Soup & 5.0 & 5.0 \\
40 ernst 1.4 \hline
41 ernst 1.2 & Total & 67.2 & 50.2 \\
42     \hline
43     \end{tabular}
44     \end{table}
45    
46 acosta 1.9
47 ernst 1.4 The average event processing time observed on a 3.6GHz Xeon processor and the
48 ernst 1.6 event size is shown in Figure~\ref{fig:minbias-processing-performance}.
49 ernst 1.4
50     \begin{figure}[htp]
51     \begin{center}
52     $\begin{array}{c@{\hspace{1in}}c}
53 ernst 1.5 \includegraphics[width=0.4\linewidth]{figs/Pre-prod-minbias.pdf} &
54     \includegraphics[width=0.4\linewidth]{figs/Pre-prod-minbias-size.pdf} \\ [-0.53cm]
55 ernst 1.4 \end{array}$
56     \end{center}
57 acosta 1.11 \caption{Minimum Bias Event processing time and Event size, where
58     some jobs have 1000 events/job and some less than 500 events/job.}
59 ernst 1.4 \label{fig:minbias-processing-performance}
60     \end{figure}
61    
62 ernst 1.5 More complex signal events like those associated with the TTbar data sample take
63 acosta 1.9 significant more time to simulate. According to the experience, it required
64 ernst 1.6 about 4 minutes per event to complete.
65 ernst 1.5
66 acosta 1.9 Regarding the job submission strategy, the teams found that the information about
67 ernst 1.4 resource usage at sites as it is published by the Grid Information System is not
68     useful to build the ranking since it is lacking the job information associated
69     with the particular VO role. Therefore a static ranking was used that was built
70     according to the available resources as they were discovered by the ProdAgent's
71 ernst 1.6 Job Tracking component.
72 ernst 1.4
73 acosta 1.9 With an average of up to 100 jobs/hour, per agent the performance of job submission
74 ernst 1.4 by the ProdAgent was rather low. With the level of resources available to the
75     teams it took a day or more until all CPUs could be utilized.
76     Given the anticipated scale of production for CSA06 and the fact that there were
77 ernst 1.6 four teams running two instances of ProdAgent each this was not a problem for the
78 ernst 1.4 CSA pre-production, however needs to be taken into account for future production
79     activities and is an area that needs to be improved. Moving to the new gLite
80     resource Broker with its bulk submission feature may help to some extent, but
81 acosta 1.12 this is certainly not the only area that needs to be looked at.
82 ernst 1.4
83     \begin{figure}[htp]
84     \begin{center}
85 ernst 1.5 \includegraphics[width=0.6\linewidth]{figs/Pre-prod-submission-rate.pdf}
86 ernst 1.4 \end{center}
87 ernst 1.6 \caption{ProdAgent job submission rate}
88 ernst 1.4 \label{fig:submission-rate}
89     \end{figure}
90    
91     Rather than using the output sandbox for the produced data, files are staged out
92     to the local Storage Element (SE). The performance of the process copying files
93 ernst 1.6 from the the Worker Node disk to the SE, illustrated by
94     Figure~\ref{fig:stage-out-performance}, is very good for all the prominent SE's
95     CMS is using at sites (Castor, dCache and DPM).
96 ernst 1.4
97     \begin{figure}[htp]
98     \begin{center}
99 ernst 1.5 \includegraphics[width=0.6\linewidth]{figs/Pre-prod-stage-out.pdf}
100 ernst 1.4 \end{center}
101 ernst 1.6 \caption{Local stage-out performance}
102 ernst 1.4 \label{fig:stage-out-performance}
103     \end{figure}
104    
105 acosta 1.9 As was reported by the production teams, early production was affected by
106 ernst 1.4 instabilities in the JobTracking and MergeSensor components of ProdAgent and
107     required continuous attention by the operators. Fortunately the problems
108 ernst 1.6 were solved in late August/early September.
109 ernst 1.4
110     Regarding operational problems one that turned out to be common across almost
111     all participating sites was with data access between the farm of Worker Nodes
112     and the local SE, i.e. for stage-out and in particular the merge process.
113     Given the many processes running in parallel the latter has shown to stress
114     some of the deployed SE's up to their limit. It is therefore important to
115 ernst 1.6 maintain a suitable CPU to storage access bandwidth ratio.
116 ernst 1.4
117 ernst 1.5 To help operate the ProdAgent more efficiently people from INFN Bari
118 ernst 1.6 developed a monitoring tool that allows a comprehensive overview
119     of the current state and access to log files
120 ernst 1.8 from a single web page. A screenshot is shown in Figure~\ref{fig:pa-monitoring} and~\ref{fig:pa-monitoring-1}.
121 ernst 1.4
122 ernst 1.5 \begin{figure}[htp]
123     \begin{center}
124 acosta 1.10 \includegraphics[width=0.95\linewidth]{figs/Pre-prod-PA-mon.pdf}
125 ernst 1.5 \end{center}
126 ernst 1.7 \caption{ProdAgent monitoring tool developed by INFN/Bari}
127 ernst 1.5 \label{fig:pa-monitoring}
128     \end{figure}
129 fisk 1.1
130 ernst 1.8 \begin{figure}[htp]
131     \begin{center}
132 acosta 1.10 \includegraphics[width=0.95\linewidth]{figs/Pre-prod-PA-mon-1.pdf}
133 ernst 1.8 \end{center}
134     \caption{Bari ProdAgent monitoring summarizing important production job information}
135     \label{fig:pa-monitoring-1}
136     \end{figure}