ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/cvsroot/COMP/CSA06DOC/lessons.tex
Revision: 1.9
Committed: Sun Jan 28 19:14:17 2007 UTC (18 years, 3 months ago) by acosta
Content type: application/x-tex
Branch: MAIN
CVS Tags: HEAD
Changes since 1.8: +1 -1 lines
Error occurred while calculating annotation data.
Log Message:
major edits from DA

File Contents

# Content
1 \section{Conclusions and Lessons Learned}
2
3 As a complete exercise CSA06 was extremely successful. The technical
4 metrics were all met and some were exceeded by large factors. While
5 there is still considerable work do to, especially in the integration
6 with data acquisition and on-line computing, the intended
7 functionality was demonstrated in the challenge. With the success
8 there are valuable lessons for CMS as we transition to operations and
9 stable running.
10
11 \subsection{General}
12
13 There are a number of general lessons CMS can take away from the
14 challenge. The first is in the area of the transition to operations.
15 CMS needs development work to ease the operations load. CSA06 was
16 very successful, but it required a higher level of effort and
17 attention than could reasonably be expended for an experiment running
18 for years. As CMS transitions from development into successful scale
19 demonstrations to stable operations it is to be expected that
20 development activities will be identified to reduce the operational
21 load. During the challenge several specific areas were identified
22 that will be listed in the sections below. More fine-grained operator
23 control was identified for several elements, while more automation was
24 identified for others.
25
26 Another general lesson was that the strong engagement with Worldwide
27 LHC Computing Grid (WLCG) and the computing sites themselves was
28 extremely useful. There were a few problems encountered with grid
29 services which were addressed very promptly. Sites generally
30 responded to problems and solved them. An item related to the lesson
31 about operations is the communication of problems that needed to be
32 addressed. The sites and services were generally repaired promptly as
33 soon as problems were identified, but frequently it was attentive
34 operators and not automated systems that saw the problem first.
35
36 Scale testing continues to be an extremely important activity.
37 Initial scaling issues were identified and solved in several CSA06
38 components. Most of the problems identified were related to
39 components that were relatively new, or were used in at a new scale
40 without being thoroughly tested. CMS was lucky that all of the
41 scaling problems seen in CSA06 were straightforward to solve and were
42 fixed promptly. CMS needs to achieve nearly a factor of four in scale
43 for some of the components before high energy running in 2008, and the
44 sooner scaling issues are identified the more time will be available
45 to solve them.
46
47 \subsection{Offline Software}
48 There were three important lessons from the offline software experience
49 of CSA06. The first is that the software in the configuration used
50 was able to sustain greater than a 25\% load for the prompt
51 reconstruction activity. The software error rate, performance and
52 memory footprint were well within expectations contributing to smooth
53 running of the Tier-0 farm. The performance was somewhat faster than
54 the time budget allows, but several of the slower and less stable
55 reconstruction algorithms were intentionally left out of the prompt
56 reconstruction workflow. As the reconstruction software evolves the
57 performance and stability should be watched.
58
59 The second lesson was that the ability to promptly create and
60 distribute a new software release was invaluable. During CSA06
61 operations CMS released four versions of the software to address
62 issues that were encountered and to add functionality. The ability to
63 make a release promptly and to equally promptly install it on the
64 remote sites was extremely helpful in meeting challenge goals.
65
66 The last lesson is that CMS needs a more formal validation process and
67 checklist for the application before a release is tagged. A problem
68 in the re-reconstruction application was identified in the final two
69 weeks of the challenge. With a more rigorous validation procedure the
70 problem might have been seen in the opening days of the challenge
71 giving more time to solve it. While it is impossible to test for all
72 possible conditions, a validated list of validation checks would be
73 useful.
74
75 \subsection{Production and Grid Tools}
76
77 \subsubsection{Organized Processing}
78 There are a number of lessons to take away from the experience with
79 production and grid tools. The first is that the Prod\_Agent worked
80 well in the pre-challenge production. CMS was able to meet the very
81 ambitious goals of 25M events per month of simulated event
82 production. The one pass production chain contributed to the high
83 efficiency of the production application during July and August. The
84 production was performed by four teams, which is a decrease in
85 operations effort over previous production exercises. CMS will need
86 to maintain the efficiency and the flexibility as the simulation
87 becomes more complicated for the physics validation.
88
89 The Prod\_Agent infrastructure also worked well for accessing existing
90 data and applying the user selections. The teams operating the agents
91 were able to apply multiple selections simultaneously. The merging
92 and data registration components worked well and could be reused from
93 the simulated event production workflow.
94
95 Even though the exercises were successful there is clearly room for
96 improvement. CMS needs to continue to improve automation of workflow
97 for re-reconstruction, selection and skimming of events. The
98 infrastructure of request, to validation, to scheduling, to large
99 scale execution has components that involve people. The human
100 interactions can be reduced and more automated workflows can be
101 implemented. In the CSA06 workflows the work assignments and output
102 destinations were conveyed by e-mail. During CSA06 the production
103 teams were responsible for combining the skims, testing the
104 configurations, and executing the skims.
105
106 In addition to improving the automation for scheduleable items like
107 skimming, we also need to improve the transparency. In general users
108 and groups need a consistent entry point to see the status of the
109 requests and the location of the output. The request system, the data
110 transfer system, and the dataset booking systems need to be tied
111 together for consistent end-to-end user views.
112
113 The re-reconstruction activity was, by design, a demonstration of
114 functionality and not a demonstration of the final production
115 workflow. There is work left to do to make to ensuring every event is
116 re-reconstructed and processing failures are tracked and addressed.
117
118
119 \subsubsection{User Analysis Workflows}
120
121 The largest success for the analysis workflows was the successful
122 demonstration that the gLite and Condor-G job submission systems can
123 achieve the goal of 50k jobs per day. Integration and scale testing
124 continues to be very important. CMS integrated CRAB with the gLite
125 bulk submission only shortly before the challenge began. There had
126 been testing of the underlying infrastructure through the CMS WLCG
127 Integration task force but no scale testing with the CMS submission
128 system. The two problems in achieving scale were both related to the
129 CMS implementation and not in the underlying infrastructure and both
130 issues were promptly addressed by the CMS developers. As the number
131 of people participating and the number of jobs increases the
132 importance of scale testing will only increase.
133
134 In order to reach the target submission rate CMS needed to make heavy
135 use of load generating ``job robots''. While the robots generate
136 workflows that closely resemble user analysis jobs, the robots are not
137 a substitute for an active user community for testing. For the next
138 series of challenges CMS should ensure a larger number of individuals
139 performing analysis.
140
141 Though only about 10\% of the total job submissions, the user driven
142 analysis in the challenge was successful with CRAB functioning well on
143 both EGEE and OSG sites. This document highlights some of the types
144 of analysis that were successfully completed. Nevertheless, there are
145 a number of lessons. The first is that CMS needs to improve the user
146 support model. Currently user support is provided by a mailing list
147 in a community support model that works well for a size of the
148 community currently being supported. It is not clear if this informal
149 support will scale to the larger collaboration. It is possible for
150 requests to fall through the cracks. CMS should look at hybrid
151 support models that assign and track tickets while ensuring that a
152 large enough community of people see the support requests to continue
153 to provide a broad base of supporters.
154
155 \subsection{Offline Database and Frontier}
156
157 The offline database infrastructure was successful in the challenge.
158 The calibration data could be distributed to remote locations from a
159 single database instance at CERN using the Frontier infrastructure.
160 The initial attempt in the Tier-0 workflow identified scaling
161 limitations in the CMS web cache configuration for Frontier and
162 stability issues in the application code. Both of these were promptly
163 addressed, but they underscore the need for validation and scale
164 testing.
165
166 The other offline database lesson is related the way CMS stores
167 calibration constants in the database and the frequency with which
168 they are invalidated. Currently CMS stores the calibration
169 information as a large number of small objects, which are treated as
170 independent queries by the offline databases and they invalidated
171 daily. The first application of the day can expect almost an hour
172 updating the database information in the offline cache, which is not
173 reasonable in the long term.
174
175 \subsection{Data Management}
176
177 The CMS data management solution relying on central components for
178 data bookkeeping, data location, and data transfer management and site
179 components for data resolution worked well and reduced the effort
180 required by the site operators. The changes in the CMS event data
181 modem significantly simplified the access of the data by analysis
182 applications.
183
184 The general lesson from data management is that CMS needs to ensure
185 that all the data management components have a consistent picture of
186 the data. The synchronization of the various views needs to be better
187 automated. CMS has data management information in the dataset
188 bookkeeping system (DBS), the data transfer system (PhEDEx), and the
189 dataset location service. CMS was able to fall out of sync in the
190 various data management components. Maintaining consistency currently
191 involves some manual operations.
192
193 A specific element that was identified in CSA06 was the need to
194 examine the DBS performance in the presence of merging output. The
195 initial performance needs estimates did not include this use-case,
196 which introduces a heavy load on the DBS. For many output streams the
197 performance of the bookkeeping system limited the performance to
198 prepare data selections. The performance limitation is being
199 addressed in the next generation of the DBS.
200
201 Data publication and the trivial file catalog resolution of the
202 logical to physical file names both worked well. The trivial file
203 catalog scaled well and applications were able to consistently
204 discover data file locations with minimal additional services required
205 at the sites.
206
207 \subsection{Workflow Management}
208
209 Workflow management components both at CERN and at remote centers were
210 able to perform the achieve the required level of activity expected in
211 the challenge. There is some overlap in the implementation of the Tier-0 workflow and
212 the Prod\_Agent workflow used at the Tier-1 and Tier-2 centers, which
213 should be re-examined after the challenge with an eye for long term
214 maintainability and support.
215
216
217 \subsection{Central Services}
218
219 Central services and facilities at CERN from IT and WLCG, including
220 the batch resources and FTS, were carefully monitored and problems
221 were solved. CASTOR support at CERN was excellent. As an export
222 system, CASTOR2 performed at a higher rate and more stably than in
223 past CMS exercises. CMS ran into an issue with the SRM release for
224 files greater than 2GB in DPM file which was solved the next day.
225
226
227 \subsection{Tier-0}
228
229 The Tier-0 workflow and dataflow management tools performed better than
230 required for CSA06, showing no significant problems throughout the
231 challenge. The flexibility of the message-based architecture allowed
232 adaptation of the running system to the changing operational conditions,
233 as the challenge progressed, without any interruption of service.
234
235 No inherent scaling problems were found, and key Tier-0 components
236 (hardware, software and people) were far from being stressed during the
237 challenge. The system achieved the low latency response required for
238 real data-taking.
239
240 Most of the full range of complexity of the final system was explored
241 during the challenge. Other aspects were already explored with the ``July
242 prototype''. The design of the Tier-0 can therefore be deemed validated.
243
244 Operationally, the Tier-0 can be installed, configured, and run by
245 non-experts already. The Tier-0 internal goals of exploring the operations
246 during CSA06 have therefore also been met.
247
248
249 \subsection{Tier-1}
250
251 While 6 of the 7 Tier-1 centers met the complete goals for full
252 participating in the challenge with successful transfers on 90\% of
253 the days. The transfer quality, defined in CMS as the number of times
254 a transfer was attempted before successful, was significantly improved
255 for CERN to Tier-1 transfers during the challenge as compared to
256 previous service challenge exercises. There are several elements to
257 improve in the final year of experiment preparation. Several Tier-1
258 centers had problems importing and exporting data simultaneously. The
259 Tier-1 centers either experienced unstable data export or limited
260 performance. The majority of Tier-1 sites demonstrated successful
261 migration of data to tape, but there is a substantial work left to
262 demonstrate CMS can write the full data rate to tape at Tier-1 centers
263 and serve the data to all Tier-2 centers when requested.
264
265 A specific technical item was identified in the FTS timeouts too tight
266 for sites with low access bandwidth and high latency when CMS moved to
267 files that were larger than 4GB. In the final experiment the raw data
268 files should be between 5GB and 10GB, so CMS will need to revisit the
269 transfer timeouts again.
270
271 \subsection{Tier-2}
272
273 The number of Tier-2 centers participating in the challenge was larger
274 than the original goal and a broader variety of activities was
275 successfully performed by the Tier-2 centers. An item to improve is
276 the amount of effort required to make Tier-2 transfers work. Some
277 sites accepted data only from particular sites. Early in the
278 challenge PhEDEx dynamic Routing led to unpredictable Tier-1-to-Tier-2
279 paths through intermediate Tier-1s. Early in the challenge the PhEDEx
280 operations team modified the path cost metrics in PhEDEx to avoid
281 multi-hop transfers and make the route more static and prescribed,
282 which makes transfers more look like baseline computing model.
283
284 The poor transfer quality on the PhEDEx monitoring plot is not
285 necessarily a Tier-2 site issue. Some Tier-1s could better import data
286 from Tier-0 than export to Tier-2s. One item that was identified is
287 that lots of transfer requests could clog the queues and lead to
288 component failures. The FTS system is designed to throttle transfer
289 requests but developers initially focusing on protection of import
290 rather than export. CMS is continuing the discussion on architecture
291 and implementation of throttling in FTS with the developers.
292
293 One area where the general lesson about operations load was felt the
294 strongest was the data management at the Tier-2 centers. The data
295 stored at a Tier-2 center is defined by the supported community and a
296 clear need for tools to allow the Tier-2 to control the resident data
297 was identified during the challenge.