ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/cvsroot/COMP/CSA06DOC/lessons.tex
Revision: 1.8
Committed: Sun Jan 28 02:48:35 2007 UTC (18 years, 3 months ago) by fisk
Content type: application/x-tex
Branch: MAIN
Changes since 1.7: +42 -25 lines
Log Message:
Added positive statements in lessons for all sections

File Contents

# User Rev Content
1 fisk 1.1 \section{Lessons Learned}
2    
3 fisk 1.2 As a complete exercise CSA06 was extremely successful. The technical
4     metrics were all met and some were exceeded by large factors. While
5     there is still considerable work do to, especially in the integration
6     with data acquisition and on-line computing, the intended
7     functionality was demonstrated in the challenge. With the success
8     there are valuable lessons for CMS as we transition to operations and
9     stable running.
10    
11 fisk 1.1 \subsection{General}
12    
13 fisk 1.2 There are a number of general lessons CMS can take away from the
14     challenge. The first is in the area of the transition to operations.
15     CMS needs development work to ease the operations load. CSA06 was
16     very successful, but it required a higher level of effort and
17     attention than could reasonably be expended for an experiment running
18     for years. As CMS transitions from development into successful scale
19     demonstrations to stable operations it is to be expected that
20     development activities will be identified to reduce the operational
21     load. During the challenge several specific areas were identified
22     that will be listed in the sections below. More fine-grained operator
23     control was identified for several elements, while more automation was
24     identified for others.
25    
26     Another general lesson was that the strong engagement with Worldwide
27     LHC Computing Grid (WLCG) and the computing sites themselves was
28     extremely useful. There were a few problems encountered with grid
29     services which were addressed very promptly. Sites generally
30     responded to problems and solved them. An item related to the lesson
31     about operations is the communication of problems that needed to be
32     addressed. The sites and services were generally repaired promptly as
33     soon as problems were identified, but frequently it was attentive
34     operators and not automated systems that saw the problem first.
35    
36     Scale testing continues to be an extremely important activity.
37     Initial scaling issues were identified and solved in several CSA06
38     components. Most of the problems identified were related to
39     components that were relatively new, or were used in at a new scale
40     without being thoroughly tested. CMS was lucky that all of the
41     scaling problems seen in CSA06 were straightforward to solve and were
42     fixed promptly. CMS needs to achieve nearly a factor of four in scale
43     for some of the components before high energy running in 2008, and the
44     sooner scaling issues are identified the more time will be available
45     to solve them.
46    
47 fisk 1.1 \subsection{Offline Software}
48 fisk 1.2 There were three important lessons from the offline software experience
49     of CSA06. The first is that the software in the configuration used
50     was able to sustain greater than a 25\% load for the prompt
51     reconstruction activity. The software error rate, performance and
52     memory footprint were well within expectations contributing to smooth
53     running of the Tier-0 farm. The performance was somewhat faster than
54     the time budget allows, but several of the slower and less stable
55     reconstruction algorithms were intentionally left out of the prompt
56     reconstruction workflow. As the reconstruction software evolves the
57     performance and stability should be watched.
58    
59     The second lesson was that the ability to promptly create and
60     distribute a new software release was invaluable. During CSA06
61     operations CMS released four versions of the software to address
62     issues that were encountered and to add functionality. The ability to
63     make a release promptly and to equally promptly install it on the
64     remote sites was extremely helpful in meeting challenge goals.
65    
66     The last lesson is that CMS needs a more formal validation process and
67     checklist for the application before a release is tagged. A problem
68     in the re-reconstruction application was identified in the final two
69     weeks of the challenge. With a more rigorous validation procedure the
70     problem might have been seen in the opening days of the challenge
71     giving more time to solve it. While it is impossible to test for all
72     possible conditions, a validated list of validation checks would be
73     useful.
74 fisk 1.1
75     \subsection{Production and Grid Tools}
76    
77 fisk 1.2 \subsubsection{Organized Processing}
78     There are a number of lessons to take away from the experience with
79     production and grid tools. The first is that the Prod\_Agent worked
80     well in the pre-challenge production. CMS was able to meet the very
81     ambitious goals of 25M events per month of simulated event
82     production. The one pass production chain contributed to the high
83     efficiency of the production application during July and August. The
84     production was performed by four teams, which is a decrease in
85     operations effort over previous production exercises. CMS will need
86     to maintain the efficiency and the flexibility as the simulation
87     becomes more complicated for the physics validation.
88    
89     The Prod\_Agent infrastructure also worked well for accessing existing
90     data and applying the user selections. The teams operating the agents
91     were able to apply multiple selections simultaneously. The merging
92     and data registration components worked well and could be reused from
93     the simulated event production workflow.
94    
95     Even though the exercises were successful there is clearly room for
96 acosta 1.5 improvement. CMS needs to continue to improve automation of workflow
97 fisk 1.2 for re-reconstruction, selection and skimming of events. The
98     infrastructure of request, to validation, to scheduling, to large
99     scale execution has components that involve people. The human
100     interactions can be reduced and more automated workflows can be
101     implemented. In the CSA06 workflows the work assignments and output
102     destinations were conveyed by e-mail. During CSA06 the production
103 acosta 1.5 teams were responsible for combining the skims, testing the
104 fisk 1.2 configurations, and executing the skims.
105    
106 acosta 1.5 In addition to improving the automation for scheduleable items like
107 fisk 1.2 skimming, we also need to improve the transparency. In general users
108     and groups need a consistent entry point to see the status of the
109     requests and the location of the output. The request system, the data
110     transfer system, and the dataset booking systems need to be tied
111     together for consistent end-to-end user views.
112    
113     The re-reconstruction activity was, by design, a demonstration of
114     functionality and not a demonstration of the final production
115     workflow. There is work left to do to make to ensuring every event is
116     re-reconstructed and processing failures are tracked and addressed.
117    
118    
119     \subsubsection{User Analysis Workflows}
120    
121     The largest success for the analysis workflows was the successful
122     demonstration that the gLite and Condor-G job submission systems can
123     achieve the goal of 50k jobs per day. Integration and scale testing
124     continues to be very important. CMS integrated CRAB with the gLite
125     bulk submission only shortly before the challenge began. There had
126     been testing of the underlying infrastructure through the CMS WLCG
127     Integration task force but no scale testing with the CMS submission
128     system. The two problems in achieving scale were both related to the
129     CMS implementation and not in the underlying infrastructure and both
130     issues were promptly addressed by the CMS developers. As the number
131     of people participating and the number of jobs increases the
132     importance of scale testing will only increase.
133    
134     In order to reach the target submission rate CMS needed to make heavy
135     use of load generating ``job robots''. While the robots generate
136     workflows that closely resemble user analysis jobs, the robots are not
137     a substitute for an active user community for testing. For the next
138     series of challenges CMS should ensure a larger number of individuals
139     performing analysis.
140    
141 acosta 1.5 Though only about 10\% of the total job submissions, the user driven
142 fisk 1.2 analysis in the challenge was successful with CRAB functioning well on
143     both EGEE and OSG sites. This document highlights some of the types
144     of analysis that were successfully completed. Nevertheless, there are
145     a number of lessons. The first is that CMS needs to improve the user
146     support model. Currently user support is provided by a mailing list
147     in a community support model that works well for a size of the
148     community currently being supported. It is not clear if this informal
149     support will scale to the larger collaboration. It is possible for
150     requests to fall through the cracks. CMS should look at hybrid
151     support models that assign and track tickets while ensuring that a
152     large enough community of people see the support requests to continue
153     to provide a broad base of supporters.
154    
155 fisk 1.1 \subsection{Offline Database and Frontier}
156    
157 fisk 1.8 The offline database infrastructure was successful in the challenge.
158     The calibration data could be distributed to remote locations from a
159     single database instance at CERN using the Frontier infrastructure.
160     The initial attempt in the Tier-0 workflow identified scaling
161 fisk 1.2 limitations in the CMS web cache configuration for Frontier and
162     stability issues in the application code. Both of these were promptly
163 fisk 1.8 addressed, but they underscore the need for validation and scale
164     testing.
165 fisk 1.2
166     The other offline database lesson is related the way CMS stores
167     calibration constants in the database and the frequency with which
168     they are invalidated. Currently CMS stores the calibration
169     information as a large number of small objects, which are treated as
170     independent queries by the offline databases and they invalidated
171     daily. The first application of the day can expect almost an hour
172     updating the database information in the offline cache, which is not
173 acosta 1.5 reasonable in the long term.
174 fisk 1.2
175     \subsection{Data Management}
176    
177 fisk 1.8 The CMS data management solution relying on central components for
178     data bookkeeping, data location, and data transfer management and site
179     components for data resolution worked well and reduced the effort
180     required by the site operators. The changes in the CMS event data
181     modem significantly simplified the access of the data by analysis
182     applications.
183    
184 acosta 1.5 The general lesson from data management is that CMS needs to ensure
185 fisk 1.2 that all the data management components have a consistent picture of
186     the data. The synchronization of the various views needs to be better
187     automated. CMS has data management information in the dataset
188     bookkeeping system (DBS), the data transfer system (PhEDEx), and the
189     dataset location service. CMS was able to fall out of sync in the
190     various data management components. Maintaining consistency currently
191     involves some manual operations.
192    
193     A specific element that was identified in CSA06 was the need to
194     examine the DBS performance in the presence of merging output. The
195     initial performance needs estimates did not include this use-case,
196     which introduces a heavy load on the DBS. For many output streams the
197     performance of the bookkeeping system limited the performance to
198     prepare data selections. The performance limitation is being
199     addressed in the next generation of the DBS.
200    
201     Data publication and the trivial file catalog resolution of the
202     logical to physical file names both worked well. The trivial file
203     catalog scaled well and applications were able to consistently
204     discover data file locations with minimal additional services required
205     at the sites.
206    
207 acosta 1.7 \subsection{Workflow Management}
208    
209 fisk 1.8 Workflow management components both at CERN and at remote centers were
210     able to perform the achieve the required level of activity expected in
211     the challenge. There is some overlap in the implementation of the Tier-0 workflow and
212     the Prod\_Agent workflow used at the Tier-1 and Tier-2 centers, which
213     should be re-examined after the challenge with an eye for long term
214 acosta 1.7 maintainability and support.
215    
216    
217     \subsection{Central Services}
218    
219     Central services and facilities at CERN from IT and WLCG, including
220     the batch resources and FTS, were carefully monitored and problems
221 fisk 1.8 were solved. CASTOR support at CERN was excellent. As an export
222     system, CASTOR2 performed at a higher rate and more stably than in
223     past CMS exercises. CMS ran into an issue with the SRM release for
224     files greater than 2GB in DPM file which was solved the next day.
225 acosta 1.7
226    
227 fisk 1.1 \subsection{Tier-0}
228    
229 wildish 1.3 The Tier-0 workflow and dataflow management tools performed better than
230     required for CSA06, showing no significant problems throughout the
231     challenge. The flexibility of the message-based architecture allowed
232     adaptation of the running system to the changing operational conditions,
233     as the challenge progressed, without any interruption of service.
234    
235     No inherent scaling problems were found, and key Tier-0 components
236     (hardware, software and people) were far from being stressed during the
237     challenge. The system achieved the low latency response required for
238     real data-taking.
239    
240     Most of the full range of complexity of the final system was explored
241 acosta 1.6 during the challenge. Other aspects were already explored with the ``July
242     prototype''. The design of the Tier-0 can therefore be deemed validated.
243 fisk 1.2
244 wildish 1.4 Operationally, the Tier-0 can be installed, configured, and run by
245     non-experts already. The Tier-0 internal goals of exploring the operations
246     during CSA06 have therefore also been met.
247    
248 acosta 1.6
249 fisk 1.1 \subsection{Tier-1}
250    
251 fisk 1.2 While 6 of the 7 Tier-1 centers met the complete goals for full
252     participating in the challenge with successful transfers on 90\% of
253 fisk 1.8 the days. The transfer quality, defined in CMS as the number of times
254     a transfer was attempted before successful, was significantly improved
255     for CERN to Tier-1 transfers during the challenge as compared to
256     previous service challenge exercises. There are several elements to
257     improve in the final year of experiment preparation. Several Tier-1
258     centers had problems importing and exporting data simultaneously. The
259     Tier-1 centers either experienced unstable data export or limited
260     performance. The majority of Tier-1 sites demonstrated successful
261     migration of data to tape, but there is a substantial work left to
262     demonstrate CMS can write the full data rate to tape at Tier-1 centers
263     and serve the data to all Tier-2 centers when requested.
264 fisk 1.2
265     A specific technical item was identified in the FTS timeouts too tight
266     for sites with low access bandwidth and high latency when CMS moved to
267     files that were larger than 4GB. In the final experiment the raw data
268     files should be between 5GB and 10GB, so CMS will need to revisit the
269     transfer timeouts again.
270    
271 fisk 1.1 \subsection{Tier-2}
272    
273 fisk 1.2 The number of Tier-2 centers participating in the challenge was larger
274 fisk 1.8 than the original goal and a broader variety of activities was
275     successfully performed by the Tier-2 centers. An item to improve is
276     the amount of effort required to make Tier-2 transfers work. Some
277     sites accepted data only from particular sites. Early in the
278     challenge PhEDEx dynamic Routing led to unpredictable Tier-1-to-Tier-2
279     paths through intermediate Tier-1s. Early in the challenge the PhEDEx
280     operations team modified the path cost metrics in PhEDEx to avoid
281     multi-hop transfers and make the route more static and prescribed,
282     which makes transfers more look like baseline computing model.
283 fisk 1.2
284     The poor transfer quality on the PhEDEx monitoring plot is not
285     necessarily a Tier-2 site issue. Some Tier-1s could better import data
286     from Tier-0 than export to Tier-2s. One item that was identified is
287     that lots of transfer requests could clog the queues and lead to
288     component failures. The FTS system is designed to throttle transfer
289     requests but developers initially focusing on protection of import
290     rather than export. CMS is continuing the discussion on architecture
291     and implementation of throttling in FTS with the developers.
292    
293     One area where the general lesson about operations load was felt the
294     strongest was the data management at the Tier-2 centers. The data
295     stored at a Tier-2 center is defined by the supported community and a
296     clear need for tools to allow the Tier-2 to control the resident data
297     was identified during the challenge.