1 |
\section{Conclusions and Lessons Learned}
|
2 |
|
3 |
As a complete exercise CSA06 was extremely successful. The technical
|
4 |
metrics were all met and some were exceeded by large factors. While
|
5 |
there is still considerable work do to, especially in the integration
|
6 |
with data acquisition and on-line computing, the intended
|
7 |
functionality was demonstrated in the challenge. With the success
|
8 |
there are valuable lessons for CMS as we transition to operations and
|
9 |
stable running.
|
10 |
|
11 |
\subsection{General}
|
12 |
|
13 |
There are a number of general lessons CMS can take away from the
|
14 |
challenge. The first is in the area of the transition to operations.
|
15 |
CMS needs development work to ease the operations load. CSA06 was
|
16 |
very successful, but it required a higher level of effort and
|
17 |
attention than could reasonably be expended for an experiment running
|
18 |
for years. As CMS transitions from development into successful scale
|
19 |
demonstrations to stable operations it is to be expected that
|
20 |
development activities will be identified to reduce the operational
|
21 |
load. During the challenge several specific areas were identified
|
22 |
that will be listed in the sections below. More fine-grained operator
|
23 |
control was identified for several elements, while more automation was
|
24 |
identified for others.
|
25 |
|
26 |
Another general lesson was that the strong engagement with Worldwide
|
27 |
LHC Computing Grid (WLCG) and the computing sites themselves was
|
28 |
extremely useful. There were a few problems encountered with grid
|
29 |
services which were addressed very promptly. Sites generally
|
30 |
responded to problems and solved them. An item related to the lesson
|
31 |
about operations is the communication of problems that needed to be
|
32 |
addressed. The sites and services were generally repaired promptly as
|
33 |
soon as problems were identified, but frequently it was attentive
|
34 |
operators and not automated systems that saw the problem first.
|
35 |
|
36 |
Scale testing continues to be an extremely important activity.
|
37 |
Initial scaling issues were identified and solved in several CSA06
|
38 |
components. Most of the problems identified were related to
|
39 |
components that were relatively new, or were used in at a new scale
|
40 |
without being thoroughly tested. CMS was lucky that all of the
|
41 |
scaling problems seen in CSA06 were straightforward to solve and were
|
42 |
fixed promptly. CMS needs to achieve nearly a factor of four in scale
|
43 |
for some of the components before high energy running in 2008, and the
|
44 |
sooner scaling issues are identified the more time will be available
|
45 |
to solve them.
|
46 |
|
47 |
\subsection{Offline Software}
|
48 |
There were three important lessons from the offline software experience
|
49 |
of CSA06. The first is that the software in the configuration used
|
50 |
was able to sustain greater than a 25\% load for the prompt
|
51 |
reconstruction activity. The software error rate, performance and
|
52 |
memory footprint were well within expectations contributing to smooth
|
53 |
running of the Tier-0 farm. The performance was somewhat faster than
|
54 |
the time budget allows, but several of the slower and less stable
|
55 |
reconstruction algorithms were intentionally left out of the prompt
|
56 |
reconstruction workflow. As the reconstruction software evolves the
|
57 |
performance and stability should be watched.
|
58 |
|
59 |
The second lesson was that the ability to promptly create and
|
60 |
distribute a new software release was invaluable. During CSA06
|
61 |
operations CMS released four versions of the software to address
|
62 |
issues that were encountered and to add functionality. The ability to
|
63 |
make a release promptly and to equally promptly install it on the
|
64 |
remote sites was extremely helpful in meeting challenge goals.
|
65 |
|
66 |
The last lesson is that CMS needs a more formal validation process and
|
67 |
checklist for the application before a release is tagged. A problem
|
68 |
in the re-reconstruction application was identified in the final two
|
69 |
weeks of the challenge. With a more rigorous validation procedure the
|
70 |
problem might have been seen in the opening days of the challenge
|
71 |
giving more time to solve it. While it is impossible to test for all
|
72 |
possible conditions, a validated list of validation checks would be
|
73 |
useful.
|
74 |
|
75 |
\subsection{Production and Grid Tools}
|
76 |
|
77 |
\subsubsection{Organized Processing}
|
78 |
There are a number of lessons to take away from the experience with
|
79 |
production and grid tools. The first is that the Prod\_Agent worked
|
80 |
well in the pre-challenge production. CMS was able to meet the very
|
81 |
ambitious goals of 25M events per month of simulated event
|
82 |
production. The one pass production chain contributed to the high
|
83 |
efficiency of the production application during July and August. The
|
84 |
production was performed by four teams, which is a decrease in
|
85 |
operations effort over previous production exercises. CMS will need
|
86 |
to maintain the efficiency and the flexibility as the simulation
|
87 |
becomes more complicated for the physics validation.
|
88 |
|
89 |
The Prod\_Agent infrastructure also worked well for accessing existing
|
90 |
data and applying the user selections. The teams operating the agents
|
91 |
were able to apply multiple selections simultaneously. The merging
|
92 |
and data registration components worked well and could be reused from
|
93 |
the simulated event production workflow.
|
94 |
|
95 |
Even though the exercises were successful there is clearly room for
|
96 |
improvement. CMS needs to continue to improve automation of workflow
|
97 |
for re-reconstruction, selection and skimming of events. The
|
98 |
infrastructure of request, to validation, to scheduling, to large
|
99 |
scale execution has components that involve people. The human
|
100 |
interactions can be reduced and more automated workflows can be
|
101 |
implemented. In the CSA06 workflows the work assignments and output
|
102 |
destinations were conveyed by e-mail. During CSA06 the production
|
103 |
teams were responsible for combining the skims, testing the
|
104 |
configurations, and executing the skims.
|
105 |
|
106 |
In addition to improving the automation for scheduleable items like
|
107 |
skimming, we also need to improve the transparency. In general users
|
108 |
and groups need a consistent entry point to see the status of the
|
109 |
requests and the location of the output. The request system, the data
|
110 |
transfer system, and the dataset booking systems need to be tied
|
111 |
together for consistent end-to-end user views.
|
112 |
|
113 |
The re-reconstruction activity was, by design, a demonstration of
|
114 |
functionality and not a demonstration of the final production
|
115 |
workflow. There is work left to do to make to ensuring every event is
|
116 |
re-reconstructed and processing failures are tracked and addressed.
|
117 |
|
118 |
|
119 |
\subsubsection{User Analysis Workflows}
|
120 |
|
121 |
The largest success for the analysis workflows was the successful
|
122 |
demonstration that the gLite and Condor-G job submission systems can
|
123 |
achieve the goal of 50k jobs per day. Integration and scale testing
|
124 |
continues to be very important. CMS integrated CRAB with the gLite
|
125 |
bulk submission only shortly before the challenge began. There had
|
126 |
been testing of the underlying infrastructure through the CMS WLCG
|
127 |
Integration task force but no scale testing with the CMS submission
|
128 |
system. The two problems in achieving scale were both related to the
|
129 |
CMS implementation and not in the underlying infrastructure and both
|
130 |
issues were promptly addressed by the CMS developers. As the number
|
131 |
of people participating and the number of jobs increases the
|
132 |
importance of scale testing will only increase.
|
133 |
|
134 |
In order to reach the target submission rate CMS needed to make heavy
|
135 |
use of load generating ``job robots''. While the robots generate
|
136 |
workflows that closely resemble user analysis jobs, the robots are not
|
137 |
a substitute for an active user community for testing. For the next
|
138 |
series of challenges CMS should ensure a larger number of individuals
|
139 |
performing analysis.
|
140 |
|
141 |
Though only about 10\% of the total job submissions, the user driven
|
142 |
analysis in the challenge was successful with CRAB functioning well on
|
143 |
both EGEE and OSG sites. This document highlights some of the types
|
144 |
of analysis that were successfully completed. Nevertheless, there are
|
145 |
a number of lessons. The first is that CMS needs to improve the user
|
146 |
support model. Currently user support is provided by a mailing list
|
147 |
in a community support model that works well for a size of the
|
148 |
community currently being supported. It is not clear if this informal
|
149 |
support will scale to the larger collaboration. It is possible for
|
150 |
requests to fall through the cracks. CMS should look at hybrid
|
151 |
support models that assign and track tickets while ensuring that a
|
152 |
large enough community of people see the support requests to continue
|
153 |
to provide a broad base of supporters.
|
154 |
|
155 |
\subsection{Offline Database and Frontier}
|
156 |
|
157 |
The offline database infrastructure was successful in the challenge.
|
158 |
The calibration data could be distributed to remote locations from a
|
159 |
single database instance at CERN using the Frontier infrastructure.
|
160 |
The initial attempt in the Tier-0 workflow identified scaling
|
161 |
limitations in the CMS web cache configuration for Frontier and
|
162 |
stability issues in the application code. Both of these were promptly
|
163 |
addressed, but they underscore the need for validation and scale
|
164 |
testing.
|
165 |
|
166 |
The other offline database lesson is related the way CMS stores
|
167 |
calibration constants in the database and the frequency with which
|
168 |
they are invalidated. Currently CMS stores the calibration
|
169 |
information as a large number of small objects, which are treated as
|
170 |
independent queries by the offline databases and they invalidated
|
171 |
daily. The first application of the day can expect almost an hour
|
172 |
updating the database information in the offline cache, which is not
|
173 |
reasonable in the long term.
|
174 |
|
175 |
\subsection{Data Management}
|
176 |
|
177 |
The CMS data management solution relying on central components for
|
178 |
data bookkeeping, data location, and data transfer management and site
|
179 |
components for data resolution worked well and reduced the effort
|
180 |
required by the site operators. The changes in the CMS event data
|
181 |
modem significantly simplified the access of the data by analysis
|
182 |
applications.
|
183 |
|
184 |
The general lesson from data management is that CMS needs to ensure
|
185 |
that all the data management components have a consistent picture of
|
186 |
the data. The synchronization of the various views needs to be better
|
187 |
automated. CMS has data management information in the dataset
|
188 |
bookkeeping system (DBS), the data transfer system (PhEDEx), and the
|
189 |
dataset location service. CMS was able to fall out of sync in the
|
190 |
various data management components. Maintaining consistency currently
|
191 |
involves some manual operations.
|
192 |
|
193 |
A specific element that was identified in CSA06 was the need to
|
194 |
examine the DBS performance in the presence of merging output. The
|
195 |
initial performance needs estimates did not include this use-case,
|
196 |
which introduces a heavy load on the DBS. For many output streams the
|
197 |
performance of the bookkeeping system limited the performance to
|
198 |
prepare data selections. The performance limitation is being
|
199 |
addressed in the next generation of the DBS.
|
200 |
|
201 |
Data publication and the trivial file catalog resolution of the
|
202 |
logical to physical file names both worked well. The trivial file
|
203 |
catalog scaled well and applications were able to consistently
|
204 |
discover data file locations with minimal additional services required
|
205 |
at the sites.
|
206 |
|
207 |
\subsection{Workflow Management}
|
208 |
|
209 |
Workflow management components both at CERN and at remote centers were
|
210 |
able to perform the achieve the required level of activity expected in
|
211 |
the challenge. There is some overlap in the implementation of the Tier-0 workflow and
|
212 |
the Prod\_Agent workflow used at the Tier-1 and Tier-2 centers, which
|
213 |
should be re-examined after the challenge with an eye for long term
|
214 |
maintainability and support.
|
215 |
|
216 |
|
217 |
\subsection{Central Services}
|
218 |
|
219 |
Central services and facilities at CERN from IT and WLCG, including
|
220 |
the batch resources and FTS, were carefully monitored and problems
|
221 |
were solved. CASTOR support at CERN was excellent. As an export
|
222 |
system, CASTOR2 performed at a higher rate and more stably than in
|
223 |
past CMS exercises. CMS ran into an issue with the SRM release for
|
224 |
files greater than 2GB in DPM file which was solved the next day.
|
225 |
|
226 |
|
227 |
\subsection{Tier-0}
|
228 |
|
229 |
The Tier-0 workflow and dataflow management tools performed better than
|
230 |
required for CSA06, showing no significant problems throughout the
|
231 |
challenge. The flexibility of the message-based architecture allowed
|
232 |
adaptation of the running system to the changing operational conditions,
|
233 |
as the challenge progressed, without any interruption of service.
|
234 |
|
235 |
No inherent scaling problems were found, and key Tier-0 components
|
236 |
(hardware, software and people) were far from being stressed during the
|
237 |
challenge. The system achieved the low latency response required for
|
238 |
real data-taking.
|
239 |
|
240 |
Most of the full range of complexity of the final system was explored
|
241 |
during the challenge. Other aspects were already explored with the ``July
|
242 |
prototype''. The design of the Tier-0 can therefore be deemed validated.
|
243 |
|
244 |
Operationally, the Tier-0 can be installed, configured, and run by
|
245 |
non-experts already. The Tier-0 internal goals of exploring the operations
|
246 |
during CSA06 have therefore also been met.
|
247 |
|
248 |
|
249 |
\subsection{Tier-1}
|
250 |
|
251 |
While 6 of the 7 Tier-1 centers met the complete goals for full
|
252 |
participating in the challenge with successful transfers on 90\% of
|
253 |
the days. The transfer quality, defined in CMS as the number of times
|
254 |
a transfer was attempted before successful, was significantly improved
|
255 |
for CERN to Tier-1 transfers during the challenge as compared to
|
256 |
previous service challenge exercises. There are several elements to
|
257 |
improve in the final year of experiment preparation. Several Tier-1
|
258 |
centers had problems importing and exporting data simultaneously. The
|
259 |
Tier-1 centers either experienced unstable data export or limited
|
260 |
performance. The majority of Tier-1 sites demonstrated successful
|
261 |
migration of data to tape, but there is a substantial work left to
|
262 |
demonstrate CMS can write the full data rate to tape at Tier-1 centers
|
263 |
and serve the data to all Tier-2 centers when requested.
|
264 |
|
265 |
A specific technical item was identified in the FTS timeouts too tight
|
266 |
for sites with low access bandwidth and high latency when CMS moved to
|
267 |
files that were larger than 4GB. In the final experiment the raw data
|
268 |
files should be between 5GB and 10GB, so CMS will need to revisit the
|
269 |
transfer timeouts again.
|
270 |
|
271 |
\subsection{Tier-2}
|
272 |
|
273 |
The number of Tier-2 centers participating in the challenge was larger
|
274 |
than the original goal and a broader variety of activities was
|
275 |
successfully performed by the Tier-2 centers. An item to improve is
|
276 |
the amount of effort required to make Tier-2 transfers work. Some
|
277 |
sites accepted data only from particular sites. Early in the
|
278 |
challenge PhEDEx dynamic Routing led to unpredictable Tier-1-to-Tier-2
|
279 |
paths through intermediate Tier-1s. Early in the challenge the PhEDEx
|
280 |
operations team modified the path cost metrics in PhEDEx to avoid
|
281 |
multi-hop transfers and make the route more static and prescribed,
|
282 |
which makes transfers more look like baseline computing model.
|
283 |
|
284 |
The poor transfer quality on the PhEDEx monitoring plot is not
|
285 |
necessarily a Tier-2 site issue. Some Tier-1s could better import data
|
286 |
from Tier-0 than export to Tier-2s. One item that was identified is
|
287 |
that lots of transfer requests could clog the queues and lead to
|
288 |
component failures. The FTS system is designed to throttle transfer
|
289 |
requests but developers initially focusing on protection of import
|
290 |
rather than export. CMS is continuing the discussion on architecture
|
291 |
and implementation of throttling in FTS with the developers.
|
292 |
|
293 |
One area where the general lesson about operations load was felt the
|
294 |
strongest was the data management at the Tier-2 centers. The data
|
295 |
stored at a Tier-2 center is defined by the supported community and a
|
296 |
clear need for tools to allow the Tier-2 to control the resident data
|
297 |
was identified during the challenge.
|