COMP/CSA06DOC/lessons.tex

\section{Conclusions and Lessons Learned}

As a complete exercise CSA06 was extremely successful.  The technical
metrics were all met and some were exceeded by large factors.  While
there is still considerable work do to, especially in the integration
with data acquisition and on-line computing, the intended
functionality was demonstrated in the challenge.  With the success
there are valuable lessons for CMS as we transition to operations and
stable running.

\subsection{General}

There are a number of general lessons CMS can take away from the
challenge.  The first is in the area of the transition to operations.
CMS needs development work to ease the operations load.  CSA06 was
very successful, but it required a higher level of effort and
attention than could reasonably be expended for an experiment running
for years.  As CMS transitions from development into successful scale
demonstrations to stable operations it is to be expected that
development activities will be identified to reduce the operational
load.  During the challenge several specific areas were identified
that will be listed in the sections below.  More fine-grained operator
control was identified for several elements, while more automation was
identified for others.

Another general lesson was that the strong engagement with Worldwide
LHC Computing Grid (WLCG) and the computing sites themselves was
extremely useful.  There were a few problems encountered with grid
services which were addressed very promptly.  Sites generally
responded to problems and solved them.  An item related to the lesson
about operations is the communication of problems that needed to be
addressed.  The sites and services were generally repaired promptly as
soon as problems were identified, but frequently it was attentive
operators and not automated systems that saw the problem first.

Scale testing continues to be an extremely important activity.
Initial scaling issues were identified and solved in several CSA06
components.  Most of the problems identified were related to
components that were relatively new, or were used in at a new scale
without being thoroughly tested.  CMS was lucky that all of the
scaling problems seen in CSA06 were straightforward to solve and were
fixed promptly.  CMS needs to achieve nearly a factor of four in scale
for some of the components before high energy running in 2008, and the
sooner scaling issues are identified the more time will be available
to solve them.

\subsection{Offline Software}
There were three important lessons from the offline software experience
of CSA06.  The first is that the software in the configuration used
was able to sustain greater than a 25\% load for the prompt
reconstruction activity.  The software error rate, performance and
memory footprint were well within expectations contributing to smooth
running of the Tier-0 farm.  The performance was somewhat faster than
the time budget allows, but several of the slower and less stable
reconstruction algorithms were intentionally left out of the prompt
reconstruction workflow.  As the reconstruction software evolves the
performance and stability should be watched.

The second lesson was that the ability to promptly create and
distribute a new software release was invaluable.  During CSA06
operations CMS released four versions of the software to address
issues that were encountered and to add functionality.  The ability to
make a release promptly and to equally promptly install it on the
remote sites was extremely helpful in meeting challenge goals.

The last lesson is that CMS needs a more formal validation process and
checklist for the application before a release is tagged.  A problem
in the re-reconstruction application was identified in the final two
weeks of the challenge.  With a more rigorous validation procedure the
problem might have been seen in the opening days of the challenge
giving more time to solve it.  While it is impossible to test for all
possible conditions, a validated list of validation checks would be
useful.

\subsection{Production and Grid Tools}

\subsubsection{Organized Processing}
There are a number of lessons to take away from the experience with
production and grid tools.  The first is that the Prod\_Agent worked
well in the pre-challenge production.  CMS was able to meet the very
ambitious goals of 25M events per month of simulated event
production.  The one pass production chain contributed to the high
efficiency of the production application during July and August.  The
production was performed by four teams, which is a decrease in
operations effort over previous production exercises.  CMS will need
to maintain the efficiency and the flexibility as the simulation
becomes more complicated for the physics validation.

The Prod\_Agent infrastructure also worked well for accessing existing
data and applying the user selections.  The teams operating the agents
were able to apply multiple selections simultaneously.  The merging
and data registration components worked well and could be reused from
the simulated event production workflow.

Even though the exercises were successful there is clearly room for
improvement.  CMS needs to continue to improve automation of workflow
for re-reconstruction, selection and skimming of events.  The
infrastructure of request, to validation, to scheduling, to large
scale execution has components that involve people.  The human
interactions can be reduced and more automated workflows can be
implemented.  In the CSA06 workflows the work assignments and output
destinations were conveyed by e-mail.  During CSA06 the production
teams were responsible for combining the skims, testing the
configurations, and executing the skims.

In addition to improving the automation for scheduleable items like
skimming, we also need to improve the transparency.  In general users
and groups need a consistent entry point to see the status of the
requests and the location of the output.  The request system, the data
transfer system, and the dataset booking systems need to be tied
together for consistent end-to-end user views.

The re-reconstruction activity was, by design, a demonstration of
functionality and not a demonstration of the final production
workflow.  There is work left to do to make to ensuring every event is
re-reconstructed and processing failures are tracked and addressed.


\subsubsection{User Analysis Workflows}

The largest success for the analysis workflows was the successful
demonstration that the gLite and Condor-G job submission systems can
achieve the goal of 50k jobs per day.  Integration and scale testing
continues to be very important.  CMS integrated CRAB with the gLite
bulk submission only shortly before the challenge began.  There had
been testing of the underlying infrastructure through the CMS WLCG
Integration task force but no scale testing with the CMS submission
system.  The two problems in achieving scale were both related to the
CMS implementation and not in the underlying infrastructure and both
issues were promptly addressed by the CMS developers.  As the number
of people participating and the number of jobs increases the
importance of scale testing will only increase.

In order to reach the target submission rate CMS needed to make heavy
use of load generating ``job robots''.  While the robots generate
workflows that closely resemble user analysis jobs, the robots are not
a substitute for an active user community for testing.  For the next
series of challenges CMS should ensure a larger number of individuals
performing analysis.

Though only about 10\% of the total job submissions, the user driven
analysis in the challenge was successful with CRAB functioning well on
both EGEE and OSG sites.  This document highlights some of the types
of analysis that were successfully completed.  Nevertheless, there are
a number of lessons.  The first is that CMS needs to improve the user
support model.  Currently user support is provided by a mailing list
in a community support model that works well for a size of the
community currently being supported.  It is not clear if this informal
support will scale to the larger collaboration.  It is possible for
requests to fall through the cracks.  CMS should look at hybrid
support models that assign and track tickets while ensuring that a
large enough community of people see the support requests to continue
to provide a broad base of supporters.

\subsection{Offline Database and Frontier}

The offline database infrastructure was successful in the challenge.
The calibration data could be distributed to remote locations from a
single database instance at CERN using the Frontier infrastructure.
The initial attempt in the Tier-0 workflow identified scaling
limitations in the CMS web cache configuration for Frontier and
stability issues in the application code.  Both of these were promptly
addressed, but they underscore the need for validation and scale
testing.

The other offline database lesson is related the way CMS stores
calibration constants in the database and the frequency with which
they are invalidated.  Currently CMS stores the calibration
information as a large number of small objects, which are treated as
independent queries by the offline databases and they invalidated
daily.  The first application of the day can expect almost an hour
updating the database information in the offline cache, which is not
reasonable in the long term.

\subsection{Data Management}

The CMS data management solution relying on central components for
data bookkeeping, data location, and data transfer management and site
components for data resolution worked well and reduced the effort
required by the site operators.  The changes in the CMS event data
modem significantly simplified the access of the data by analysis
applications.

The general lesson from data management is that CMS needs to ensure
that all the data management components have a consistent picture of
the data.  The synchronization of the various views needs to be better
automated.  CMS has data management information in the dataset
bookkeeping system (DBS), the data transfer system (PhEDEx), and the
dataset location service.  CMS was able to fall out of sync in the
various data management components.  Maintaining consistency currently
involves some manual operations.

A specific element that was identified in CSA06 was the need to
examine the DBS performance in the presence of merging output.  The
initial performance needs estimates did not include this use-case,
which introduces a heavy load on the DBS.  For many output streams the
performance of the bookkeeping system limited the performance to
prepare data selections.  The performance limitation is being
addressed in the next generation of the DBS.

Data publication and the trivial file catalog resolution of the
logical to physical file names both worked well.  The trivial file
catalog scaled well and applications were able to consistently
discover data file locations with minimal additional services required
at the sites.

\subsection{Workflow Management}

Workflow management components both at CERN and at remote centers were
able to perform the achieve the required level of activity expected in
the challenge.   There is some overlap in the implementation of the Tier-0 workflow and
the Prod\_Agent workflow used at the Tier-1 and Tier-2 centers, which
should be re-examined after the challenge with an eye for long term
maintainability and support.


\subsection{Central Services}

Central services and facilities at CERN from IT and WLCG, including
the batch resources and FTS, were carefully monitored and problems
were solved.  CASTOR support at CERN was excellent.  As an export
system, CASTOR2 performed at a higher rate and more stably than in
past CMS exercises.  CMS ran into an issue with the SRM release for
files greater than 2GB in DPM file which was solved the next day.


\subsection{Tier-0}

The Tier-0 workflow and dataflow management tools performed better than
required for CSA06, showing no significant problems throughout the
challenge. The flexibility of the message-based architecture allowed
adaptation of the running system to the changing operational conditions,
as the challenge progressed, without any interruption of service.

No inherent scaling problems were found, and key Tier-0 components
(hardware, software and people) were far from being stressed during the
challenge. The system achieved the low latency response required for 
real data-taking.

Most of the full range of complexity of the final system was explored
during the challenge. Other aspects were already explored with the ``July
prototype''. The design of the Tier-0 can therefore be deemed validated.

Operationally, the Tier-0 can be installed, configured, and run by
non-experts already. The Tier-0 internal goals of exploring the operations
during CSA06 have therefore also been met.


\subsection{Tier-1}

While 6 of the 7 Tier-1 centers met the complete goals for full
participating in the challenge with successful transfers on 90\% of
the days.  The transfer quality, defined in CMS as the number of times
a transfer was attempted before successful, was significantly improved
for CERN to Tier-1 transfers during the challenge as compared to
previous service challenge exercises.  There are several elements to
improve in the final year of experiment preparation.  Several Tier-1
centers had problems importing and exporting data simultaneously.  The
Tier-1 centers either experienced unstable data export or limited
performance.  The majority of Tier-1 sites demonstrated successful
migration of data to tape, but there is a substantial work left to
demonstrate CMS can write the full data rate to tape at Tier-1 centers
and serve the data to all Tier-2 centers when requested.

A specific technical item was identified in the FTS timeouts too tight
for sites with low access bandwidth and high latency when CMS moved to
files that were larger than 4GB.  In the final experiment the raw data
files should be between 5GB and 10GB, so CMS will need to revisit the
transfer timeouts again.

\subsection{Tier-2}

The number of Tier-2 centers participating in the challenge was larger
than the original goal and a broader variety of activities was
successfully performed by the Tier-2 centers.  An item to improve is
the amount of effort required to make Tier-2 transfers work.  Some
sites accepted data only from particular sites.  Early in the
challenge PhEDEx dynamic Routing led to unpredictable Tier-1-to-Tier-2
paths through intermediate Tier-1s. Early in the challenge the PhEDEx
operations team modified the path cost metrics in PhEDEx to avoid
multi-hop transfers and make the route more static and prescribed,
which makes transfers more look like baseline computing model.
 
The poor transfer quality on the PhEDEx monitoring plot is not
necessarily a Tier-2 site issue. Some Tier-1s could better import data
from Tier-0 than export to Tier-2s.  One item that was identified is
that lots of transfer requests could clog the queues and lead to
component failures.  The FTS system is designed to throttle transfer
requests but developers initially focusing on protection of import
rather than export.  CMS is continuing the discussion on architecture
and implementation of throttling in FTS with the developers.

One area where the general lesson about operations load was felt the
strongest was the data management at the Tier-2 centers.  The data
stored at a Tier-2 center is defined by the supported community and a
clear need for tools to allow the Tier-2 to control the resident data
was identified during the challenge.
Revision:	1.9
Committed:	Sun Jan 28 19:14:17 2007 UTC (18 years, 3 months ago) by acosta
Content type:	application/x-tex
Branch:	MAIN
CVS Tags:	HEAD
Changes since 1.8:	+1 -1 lines
Error occurred while calculating annotation data.
Log Message:	major edits from DA
#	Content
1	\section{Conclusions and Lessons Learned}
2
3	As a complete exercise CSA06 was extremely successful. The technical
4	metrics were all met and some were exceeded by large factors. While
5	there is still considerable work do to, especially in the integration
6	with data acquisition and on-line computing, the intended
7	functionality was demonstrated in the challenge. With the success
8	there are valuable lessons for CMS as we transition to operations and
9	stable running.
10
11	\subsection{General}
12
13	There are a number of general lessons CMS can take away from the
14	challenge. The first is in the area of the transition to operations.
15	CMS needs development work to ease the operations load. CSA06 was
16	very successful, but it required a higher level of effort and
17	attention than could reasonably be expended for an experiment running
18	for years. As CMS transitions from development into successful scale
19	demonstrations to stable operations it is to be expected that
20	development activities will be identified to reduce the operational
21	load. During the challenge several specific areas were identified
22	that will be listed in the sections below. More fine-grained operator
23	control was identified for several elements, while more automation was
24	identified for others.
25
26	Another general lesson was that the strong engagement with Worldwide
27	LHC Computing Grid (WLCG) and the computing sites themselves was
28	extremely useful. There were a few problems encountered with grid
29	services which were addressed very promptly. Sites generally
30	responded to problems and solved them. An item related to the lesson
31	about operations is the communication of problems that needed to be
32	addressed. The sites and services were generally repaired promptly as
33	soon as problems were identified, but frequently it was attentive
34	operators and not automated systems that saw the problem first.
35
36	Scale testing continues to be an extremely important activity.
37	Initial scaling issues were identified and solved in several CSA06
38	components. Most of the problems identified were related to
39	components that were relatively new, or were used in at a new scale
40	without being thoroughly tested. CMS was lucky that all of the
41	scaling problems seen in CSA06 were straightforward to solve and were
42	fixed promptly. CMS needs to achieve nearly a factor of four in scale
43	for some of the components before high energy running in 2008, and the
44	sooner scaling issues are identified the more time will be available
45	to solve them.
46
47	\subsection{Offline Software}
48	There were three important lessons from the offline software experience
49	of CSA06. The first is that the software in the configuration used
50	was able to sustain greater than a 25\% load for the prompt
51	reconstruction activity. The software error rate, performance and
52	memory footprint were well within expectations contributing to smooth
53	running of the Tier-0 farm. The performance was somewhat faster than
54	the time budget allows, but several of the slower and less stable
55	reconstruction algorithms were intentionally left out of the prompt
56	reconstruction workflow. As the reconstruction software evolves the
57	performance and stability should be watched.
58
59	The second lesson was that the ability to promptly create and
60	distribute a new software release was invaluable. During CSA06
61	operations CMS released four versions of the software to address
62	issues that were encountered and to add functionality. The ability to
63	make a release promptly and to equally promptly install it on the
64	remote sites was extremely helpful in meeting challenge goals.
65
66	The last lesson is that CMS needs a more formal validation process and
67	checklist for the application before a release is tagged. A problem
68	in the re-reconstruction application was identified in the final two
69	weeks of the challenge. With a more rigorous validation procedure the
70	problem might have been seen in the opening days of the challenge
71	giving more time to solve it. While it is impossible to test for all
72	possible conditions, a validated list of validation checks would be
73	useful.
74
75	\subsection{Production and Grid Tools}
76
77	\subsubsection{Organized Processing}
78	There are a number of lessons to take away from the experience with
79	production and grid tools. The first is that the Prod\_Agent worked
80	well in the pre-challenge production. CMS was able to meet the very
81	ambitious goals of 25M events per month of simulated event
82	production. The one pass production chain contributed to the high
83	efficiency of the production application during July and August. The
84	production was performed by four teams, which is a decrease in
85	operations effort over previous production exercises. CMS will need
86	to maintain the efficiency and the flexibility as the simulation
87	becomes more complicated for the physics validation.
88
89	The Prod\_Agent infrastructure also worked well for accessing existing
90	data and applying the user selections. The teams operating the agents
91	were able to apply multiple selections simultaneously. The merging
92	and data registration components worked well and could be reused from
93	the simulated event production workflow.
94
95	Even though the exercises were successful there is clearly room for
96	improvement. CMS needs to continue to improve automation of workflow
97	for re-reconstruction, selection and skimming of events. The
98	infrastructure of request, to validation, to scheduling, to large
99	scale execution has components that involve people. The human
100	interactions can be reduced and more automated workflows can be
101	implemented. In the CSA06 workflows the work assignments and output
102	destinations were conveyed by e-mail. During CSA06 the production
103	teams were responsible for combining the skims, testing the
104	configurations, and executing the skims.
105
106	In addition to improving the automation for scheduleable items like
107	skimming, we also need to improve the transparency. In general users
108	and groups need a consistent entry point to see the status of the
109	requests and the location of the output. The request system, the data
110	transfer system, and the dataset booking systems need to be tied
111	together for consistent end-to-end user views.
112
113	The re-reconstruction activity was, by design, a demonstration of
114	functionality and not a demonstration of the final production
115	workflow. There is work left to do to make to ensuring every event is
116	re-reconstructed and processing failures are tracked and addressed.
117
118
119	\subsubsection{User Analysis Workflows}
120
121	The largest success for the analysis workflows was the successful
122	demonstration that the gLite and Condor-G job submission systems can
123	achieve the goal of 50k jobs per day. Integration and scale testing
124	continues to be very important. CMS integrated CRAB with the gLite
125	bulk submission only shortly before the challenge began. There had
126	been testing of the underlying infrastructure through the CMS WLCG
127	Integration task force but no scale testing with the CMS submission
128	system. The two problems in achieving scale were both related to the
129	CMS implementation and not in the underlying infrastructure and both
130	issues were promptly addressed by the CMS developers. As the number
131	of people participating and the number of jobs increases the
132	importance of scale testing will only increase.
133
134	In order to reach the target submission rate CMS needed to make heavy
135	use of load generating ``job robots''. While the robots generate
136	workflows that closely resemble user analysis jobs, the robots are not
137	a substitute for an active user community for testing. For the next
138	series of challenges CMS should ensure a larger number of individuals
139	performing analysis.
140
141	Though only about 10\% of the total job submissions, the user driven
142	analysis in the challenge was successful with CRAB functioning well on
143	both EGEE and OSG sites. This document highlights some of the types
144	of analysis that were successfully completed. Nevertheless, there are
145	a number of lessons. The first is that CMS needs to improve the user
146	support model. Currently user support is provided by a mailing list
147	in a community support model that works well for a size of the
148	community currently being supported. It is not clear if this informal
149	support will scale to the larger collaboration. It is possible for
150	requests to fall through the cracks. CMS should look at hybrid
151	support models that assign and track tickets while ensuring that a
152	large enough community of people see the support requests to continue
153	to provide a broad base of supporters.
154
155	\subsection{Offline Database and Frontier}
156
157	The offline database infrastructure was successful in the challenge.
158	The calibration data could be distributed to remote locations from a
159	single database instance at CERN using the Frontier infrastructure.
160	The initial attempt in the Tier-0 workflow identified scaling
161	limitations in the CMS web cache configuration for Frontier and
162	stability issues in the application code. Both of these were promptly
163	addressed, but they underscore the need for validation and scale
164	testing.
165
166	The other offline database lesson is related the way CMS stores
167	calibration constants in the database and the frequency with which
168	they are invalidated. Currently CMS stores the calibration
169	information as a large number of small objects, which are treated as
170	independent queries by the offline databases and they invalidated
171	daily. The first application of the day can expect almost an hour
172	updating the database information in the offline cache, which is not
173	reasonable in the long term.
174
175	\subsection{Data Management}
176
177	The CMS data management solution relying on central components for
178	data bookkeeping, data location, and data transfer management and site
179	components for data resolution worked well and reduced the effort
180	required by the site operators. The changes in the CMS event data
181	modem significantly simplified the access of the data by analysis
182	applications.
183
184	The general lesson from data management is that CMS needs to ensure
185	that all the data management components have a consistent picture of
186	the data. The synchronization of the various views needs to be better
187	automated. CMS has data management information in the dataset
188	bookkeeping system (DBS), the data transfer system (PhEDEx), and the
189	dataset location service. CMS was able to fall out of sync in the
190	various data management components. Maintaining consistency currently
191	involves some manual operations.
192
193	A specific element that was identified in CSA06 was the need to
194	examine the DBS performance in the presence of merging output. The
195	initial performance needs estimates did not include this use-case,
196	which introduces a heavy load on the DBS. For many output streams the
197	performance of the bookkeeping system limited the performance to
198	prepare data selections. The performance limitation is being
199	addressed in the next generation of the DBS.
200
201	Data publication and the trivial file catalog resolution of the
202	logical to physical file names both worked well. The trivial file
203	catalog scaled well and applications were able to consistently
204	discover data file locations with minimal additional services required
205	at the sites.
206
207	\subsection{Workflow Management}
208
209	Workflow management components both at CERN and at remote centers were
210	able to perform the achieve the required level of activity expected in
211	the challenge. There is some overlap in the implementation of the Tier-0 workflow and
212	the Prod\_Agent workflow used at the Tier-1 and Tier-2 centers, which
213	should be re-examined after the challenge with an eye for long term
214	maintainability and support.
215
216
217	\subsection{Central Services}
218
219	Central services and facilities at CERN from IT and WLCG, including
220	the batch resources and FTS, were carefully monitored and problems
221	were solved. CASTOR support at CERN was excellent. As an export
222	system, CASTOR2 performed at a higher rate and more stably than in
223	past CMS exercises. CMS ran into an issue with the SRM release for
224	files greater than 2GB in DPM file which was solved the next day.
225
226
227	\subsection{Tier-0}
228
229	The Tier-0 workflow and dataflow management tools performed better than
230	required for CSA06, showing no significant problems throughout the
231	challenge. The flexibility of the message-based architecture allowed
232	adaptation of the running system to the changing operational conditions,
233	as the challenge progressed, without any interruption of service.
234
235	No inherent scaling problems were found, and key Tier-0 components
236	(hardware, software and people) were far from being stressed during the
237	challenge. The system achieved the low latency response required for
238	real data-taking.
239
240	Most of the full range of complexity of the final system was explored
241	during the challenge. Other aspects were already explored with the ``July
242	prototype''. The design of the Tier-0 can therefore be deemed validated.
243
244	Operationally, the Tier-0 can be installed, configured, and run by
245	non-experts already. The Tier-0 internal goals of exploring the operations
246	during CSA06 have therefore also been met.
247
248
249	\subsection{Tier-1}
250
251	While 6 of the 7 Tier-1 centers met the complete goals for full
252	participating in the challenge with successful transfers on 90\% of
253	the days. The transfer quality, defined in CMS as the number of times
254	a transfer was attempted before successful, was significantly improved
255	for CERN to Tier-1 transfers during the challenge as compared to
256	previous service challenge exercises. There are several elements to
257	improve in the final year of experiment preparation. Several Tier-1
258	centers had problems importing and exporting data simultaneously. The
259	Tier-1 centers either experienced unstable data export or limited
260	performance. The majority of Tier-1 sites demonstrated successful
261	migration of data to tape, but there is a substantial work left to
262	demonstrate CMS can write the full data rate to tape at Tier-1 centers
263	and serve the data to all Tier-2 centers when requested.
264
265	A specific technical item was identified in the FTS timeouts too tight
266	for sites with low access bandwidth and high latency when CMS moved to
267	files that were larger than 4GB. In the final experiment the raw data
268	files should be between 5GB and 10GB, so CMS will need to revisit the
269	transfer timeouts again.
270
271	\subsection{Tier-2}
272
273	The number of Tier-2 centers participating in the challenge was larger
274	than the original goal and a broader variety of activities was
275	successfully performed by the Tier-2 centers. An item to improve is
276	the amount of effort required to make Tier-2 transfers work. Some
277	sites accepted data only from particular sites. Early in the
278	challenge PhEDEx dynamic Routing led to unpredictable Tier-1-to-Tier-2
279	paths through intermediate Tier-1s. Early in the challenge the PhEDEx
280	operations team modified the path cost metrics in PhEDEx to avoid
281	multi-hop transfers and make the route more static and prescribed,
282	which makes transfers more look like baseline computing model.
283
284	The poor transfer quality on the PhEDEx monitoring plot is not
285	necessarily a Tier-2 site issue. Some Tier-1s could better import data
286	from Tier-0 than export to Tier-2s. One item that was identified is
287	that lots of transfer requests could clog the queues and lead to
288	component failures. The FTS system is designed to throttle transfer
289	requests but developers initially focusing on protection of import
290	rather than export. CMS is continuing the discussion on architecture
291	and implementation of throttling in FTS with the developers.
292
293	One area where the general lesson about operations load was felt the
294	strongest was the data management at the Tier-2 centers. The data
295	stored at a Tier-2 center is defined by the supported community and a
296	clear need for tools to allow the Tier-2 to control the resident data
297	was identified during the challenge.