ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/cvsroot/COMP/CSA06DOC/lessons.tex
Revision: 1.4
Committed: Thu Jan 18 16:41:22 2007 UTC (18 years, 3 months ago) by wildish
Content type: application/x-tex
Branch: MAIN
Changes since 1.3: +4 -0 lines
Log Message:
minor text improvements

File Contents

# User Rev Content
1 fisk 1.1 \section{Lessons Learned}
2    
3 fisk 1.2 As a complete exercise CSA06 was extremely successful. The technical
4     metrics were all met and some were exceeded by large factors. While
5     there is still considerable work do to, especially in the integration
6     with data acquisition and on-line computing, the intended
7     functionality was demonstrated in the challenge. With the success
8     there are valuable lessons for CMS as we transition to operations and
9     stable running.
10    
11 fisk 1.1 \subsection{General}
12    
13 fisk 1.2 There are a number of general lessons CMS can take away from the
14     challenge. The first is in the area of the transition to operations.
15     CMS needs development work to ease the operations load. CSA06 was
16     very successful, but it required a higher level of effort and
17     attention than could reasonably be expended for an experiment running
18     for years. As CMS transitions from development into successful scale
19     demonstrations to stable operations it is to be expected that
20     development activities will be identified to reduce the operational
21     load. During the challenge several specific areas were identified
22     that will be listed in the sections below. More fine-grained operator
23     control was identified for several elements, while more automation was
24     identified for others.
25    
26     Another general lesson was that the strong engagement with Worldwide
27     LHC Computing Grid (WLCG) and the computing sites themselves was
28     extremely useful. There were a few problems encountered with grid
29     services which were addressed very promptly. Sites generally
30     responded to problems and solved them. An item related to the lesson
31     about operations is the communication of problems that needed to be
32     addressed. The sites and services were generally repaired promptly as
33     soon as problems were identified, but frequently it was attentive
34     operators and not automated systems that saw the problem first.
35    
36     Scale testing continues to be an extremely important activity.
37     Initial scaling issues were identified and solved in several CSA06
38     components. Most of the problems identified were related to
39     components that were relatively new, or were used in at a new scale
40     without being thoroughly tested. CMS was lucky that all of the
41     scaling problems seen in CSA06 were straightforward to solve and were
42     fixed promptly. CMS needs to achieve nearly a factor of four in scale
43     for some of the components before high energy running in 2008, and the
44     sooner scaling issues are identified the more time will be available
45     to solve them.
46    
47 fisk 1.1 \subsection{Offline Software}
48 fisk 1.2 There were three important lessons from the offline software experience
49     of CSA06. The first is that the software in the configuration used
50     was able to sustain greater than a 25\% load for the prompt
51     reconstruction activity. The software error rate, performance and
52     memory footprint were well within expectations contributing to smooth
53     running of the Tier-0 farm. The performance was somewhat faster than
54     the time budget allows, but several of the slower and less stable
55     reconstruction algorithms were intentionally left out of the prompt
56     reconstruction workflow. As the reconstruction software evolves the
57     performance and stability should be watched.
58    
59     The second lesson was that the ability to promptly create and
60     distribute a new software release was invaluable. During CSA06
61     operations CMS released four versions of the software to address
62     issues that were encountered and to add functionality. The ability to
63     make a release promptly and to equally promptly install it on the
64     remote sites was extremely helpful in meeting challenge goals.
65    
66     The last lesson is that CMS needs a more formal validation process and
67     checklist for the application before a release is tagged. A problem
68     in the re-reconstruction application was identified in the final two
69     weeks of the challenge. With a more rigorous validation procedure the
70     problem might have been seen in the opening days of the challenge
71     giving more time to solve it. While it is impossible to test for all
72     possible conditions, a validated list of validation checks would be
73     useful.
74 fisk 1.1
75     \subsection{Production and Grid Tools}
76    
77 fisk 1.2 \subsubsection{Organized Processing}
78     There are a number of lessons to take away from the experience with
79     production and grid tools. The first is that the Prod\_Agent worked
80     well in the pre-challenge production. CMS was able to meet the very
81     ambitious goals of 25M events per month of simulated event
82     production. The one pass production chain contributed to the high
83     efficiency of the production application during July and August. The
84     production was performed by four teams, which is a decrease in
85     operations effort over previous production exercises. CMS will need
86     to maintain the efficiency and the flexibility as the simulation
87     becomes more complicated for the physics validation.
88    
89     The Prod\_Agent infrastructure also worked well for accessing existing
90     data and applying the user selections. The teams operating the agents
91     were able to apply multiple selections simultaneously. The merging
92     and data registration components worked well and could be reused from
93     the simulated event production workflow.
94    
95     Even though the exercises were successful there is clearly room for
96     improvement. CMS needs to continue to better automation of workflow
97     for re-reconstruction, selection and skimming of events. The
98     infrastructure of request, to validation, to scheduling, to large
99     scale execution has components that involve people. The human
100     interactions can be reduced and more automated workflows can be
101     implemented. In the CSA06 workflows the work assignments and output
102     destinations were conveyed by e-mail. During CSA06 the production
103     teams were responsible for combining the skims,testing the
104     configurations, and executing the skims.
105    
106     In addition to improving the automation for schedule-able items like
107     skimming, we also need to improve the transparency. In general users
108     and groups need a consistent entry point to see the status of the
109     requests and the location of the output. The request system, the data
110     transfer system, and the dataset booking systems need to be tied
111     together for consistent end-to-end user views.
112    
113     The re-reconstruction activity was, by design, a demonstration of
114     functionality and not a demonstration of the final production
115     workflow. There is work left to do to make to ensuring every event is
116     re-reconstructed and processing failures are tracked and addressed.
117    
118    
119     \subsubsection{User Analysis Workflows}
120    
121     The largest success for the analysis workflows was the successful
122     demonstration that the gLite and Condor-G job submission systems can
123     achieve the goal of 50k jobs per day. Integration and scale testing
124     continues to be very important. CMS integrated CRAB with the gLite
125     bulk submission only shortly before the challenge began. There had
126     been testing of the underlying infrastructure through the CMS WLCG
127     Integration task force but no scale testing with the CMS submission
128     system. The two problems in achieving scale were both related to the
129     CMS implementation and not in the underlying infrastructure and both
130     issues were promptly addressed by the CMS developers. As the number
131     of people participating and the number of jobs increases the
132     importance of scale testing will only increase.
133    
134     In order to reach the target submission rate CMS needed to make heavy
135     use of load generating ``job robots''. While the robots generate
136     workflows that closely resemble user analysis jobs, the robots are not
137     a substitute for an active user community for testing. For the next
138     series of challenges CMS should ensure a larger number of individuals
139     performing analysis.
140    
141     Though only about 10\% of the total job submissions the user driven
142     analysis in the challenge was successful with CRAB functioning well on
143     both EGEE and OSG sites. This document highlights some of the types
144     of analysis that were successfully completed. Nevertheless, there are
145     a number of lessons. The first is that CMS needs to improve the user
146     support model. Currently user support is provided by a mailing list
147     in a community support model that works well for a size of the
148     community currently being supported. It is not clear if this informal
149     support will scale to the larger collaboration. It is possible for
150     requests to fall through the cracks. CMS should look at hybrid
151     support models that assign and track tickets while ensuring that a
152     large enough community of people see the support requests to continue
153     to provide a broad base of supporters.
154    
155 fisk 1.1 \subsection{Offline Database and Frontier}
156    
157 fisk 1.2 The offline database infrastructure was successful in the challenge
158     though the initial attempt in the Tier-0 workflow identified scaling
159     limitations in the CMS web cache configuration for Frontier and
160     stability issues in the application code. Both of these were promptly
161     addressed, but underscore the need for validation and scale testing.
162    
163     The other offline database lesson is related the way CMS stores
164     calibration constants in the database and the frequency with which
165     they are invalidated. Currently CMS stores the calibration
166     information as a large number of small objects, which are treated as
167     independent queries by the offline databases and they invalidated
168     daily. The first application of the day can expect almost an hour
169     updating the database information in the offline cache, which is not
170     reasonable in the long run.
171    
172     \subsection{Data Management}
173    
174     The general lesson from data management is that CMS needs to ensuring
175     that all the data management components have a consistent picture of
176     the data. The synchronization of the various views needs to be better
177     automated. CMS has data management information in the dataset
178     bookkeeping system (DBS), the data transfer system (PhEDEx), and the
179     dataset location service. CMS was able to fall out of sync in the
180     various data management components. Maintaining consistency currently
181     involves some manual operations.
182    
183     A specific element that was identified in CSA06 was the need to
184     examine the DBS performance in the presence of merging output. The
185     initial performance needs estimates did not include this use-case,
186     which introduces a heavy load on the DBS. For many output streams the
187     performance of the bookkeeping system limited the performance to
188     prepare data selections. The performance limitation is being
189     addressed in the next generation of the DBS.
190    
191     Data publication and the trivial file catalog resolution of the
192     logical to physical file names both worked well. The trivial file
193     catalog scaled well and applications were able to consistently
194     discover data file locations with minimal additional services required
195     at the sites.
196    
197 fisk 1.1 \subsection{Tier-0}
198    
199 wildish 1.3 The Tier-0 workflow and dataflow management tools performed better than
200     required for CSA06, showing no significant problems throughout the
201     challenge. The flexibility of the message-based architecture allowed
202     adaptation of the running system to the changing operational conditions,
203     as the challenge progressed, without any interruption of service.
204    
205     No inherent scaling problems were found, and key Tier-0 components
206     (hardware, software and people) were far from being stressed during the
207     challenge. The system achieved the low latency response required for
208     real data-taking.
209    
210     Most of the full range of complexity of the final system was explored
211     during the challenge. Other aspects were already explored with the 'July
212     prototype'. The design of the Tier-0 can therefore be deemed validated.
213 fisk 1.2
214 wildish 1.4 Operationally, the Tier-0 can be installed, configured, and run by
215     non-experts already. The Tier-0 internal goals of exploring the operations
216     during CSA06 have therefore also been met.
217    
218 fisk 1.1 \subsection{Tier-1}
219    
220 fisk 1.2 While 6 of the 7 Tier-1 centers met the complete goals for full
221     participating in the challenge with successful transfers on 90\% of
222     the days, several Tier-1 centers had problems importing and exporting
223     data simultaneously. The Tier-1 centers either experienced unstable
224     data export or limited performance. The majority of Tier-1 sites
225     demonstrated successful migration of data to tape, but there is a
226     substantial work left to demonstrate CMS can write the full data rate
227     to tape at Tier-1 centers and serve the data to all Tier-2 centers
228     when requested.
229    
230     A specific technical item was identified in the FTS timeouts too tight
231     for sites with low access bandwidth and high latency when CMS moved to
232     files that were larger than 4GB. In the final experiment the raw data
233     files should be between 5GB and 10GB, so CMS will need to revisit the
234     transfer timeouts again.
235    
236 fisk 1.1 \subsection{Tier-2}
237    
238 fisk 1.2 The number of Tier-2 centers participating in the challenge was larger
239     than the original goals, but a lot of effort was needed to make Tier-2
240     transfers work. Some sites accepted data only from particular sites.
241     Early in the challenge PhEDEx dynamic Routing led to unpredictable
242     Tier-1-to-Tier-2 paths through intermediate Tier-1s. Early in the
243     challenge the PhEDEx operations team modified the path cost metrics in
244     PhEDEx to avoid multi-hop transfers and make the route more static and
245     prescribed, which makes transfers more look like baseline computing
246     model.
247    
248     The poor transfer quality on the PhEDEx monitoring plot is not
249     necessarily a Tier-2 site issue. Some Tier-1s could better import data
250     from Tier-0 than export to Tier-2s. One item that was identified is
251     that lots of transfer requests could clog the queues and lead to
252     component failures. The FTS system is designed to throttle transfer
253     requests but developers initially focusing on protection of import
254     rather than export. CMS is continuing the discussion on architecture
255     and implementation of throttling in FTS with the developers.
256    
257     One area where the general lesson about operations load was felt the
258     strongest was the data management at the Tier-2 centers. The data
259     stored at a Tier-2 center is defined by the supported community and a
260     clear need for tools to allow the Tier-2 to control the resident data
261     was identified during the challenge.