COMP/CSA06DOC/lessons.tex

\section{Lessons Learned}

As a complete exercise CSA06 was extremely successful.  The technical
metrics were all met and some were exceeded by large factors.  While
there is still considerable work do to, especially in the integration
with data acquisition and on-line computing, the intended
functionality was demonstrated in the challenge.  With the success
there are valuable lessons for CMS as we transition to operations and
stable running.

\subsection{General}

There are a number of general lessons CMS can take away from the
challenge.  The first is in the area of the transition to operations.
CMS needs development work to ease the operations load.  CSA06 was
very successful, but it required a higher level of effort and
attention than could reasonably be expended for an experiment running
for years.  As CMS transitions from development into successful scale
demonstrations to stable operations it is to be expected that
development activities will be identified to reduce the operational
load.  During the challenge several specific areas were identified
that will be listed in the sections below.  More fine-grained operator
control was identified for several elements, while more automation was
identified for others.

Another general lesson was that the strong engagement with Worldwide
LHC Computing Grid (WLCG) and the computing sites themselves was
extremely useful.  There were a few problems encountered with grid
services which were addressed very promptly.  Sites generally
responded to problems and solved them.  An item related to the lesson
about operations is the communication of problems that needed to be
addressed.  The sites and services were generally repaired promptly as
soon as problems were identified, but frequently it was attentive
operators and not automated systems that saw the problem first.

Scale testing continues to be an extremely important activity.
Initial scaling issues were identified and solved in several CSA06
components.  Most of the problems identified were related to
components that were relatively new, or were used in at a new scale
without being thoroughly tested.  CMS was lucky that all of the
scaling problems seen in CSA06 were straightforward to solve and were
fixed promptly.  CMS needs to achieve nearly a factor of four in scale
for some of the components before high energy running in 2008, and the
sooner scaling issues are identified the more time will be available
to solve them.

\subsection{Offline Software}
There were three important lessons from the offline software experience
of CSA06.  The first is that the software in the configuration used
was able to sustain greater than a 25\% load for the prompt
reconstruction activity.  The software error rate, performance and
memory footprint were well within expectations contributing to smooth
running of the Tier-0 farm.  The performance was somewhat faster than
the time budget allows, but several of the slower and less stable
reconstruction algorithms were intentionally left out of the prompt
reconstruction workflow.  As the reconstruction software evolves the
performance and stability should be watched.

The second lesson was that the ability to promptly create and
distribute a new software release was invaluable.  During CSA06
operations CMS released four versions of the software to address
issues that were encountered and to add functionality.  The ability to
make a release promptly and to equally promptly install it on the
remote sites was extremely helpful in meeting challenge goals.

The last lesson is that CMS needs a more formal validation process and
checklist for the application before a release is tagged.  A problem
in the re-reconstruction application was identified in the final two
weeks of the challenge.  With a more rigorous validation procedure the
problem might have been seen in the opening days of the challenge
giving more time to solve it.  While it is impossible to test for all
possible conditions, a validated list of validation checks would be
useful.

\subsection{Production and Grid Tools}

\subsubsection{Organized Processing}
There are a number of lessons to take away from the experience with
production and grid tools.  The first is that the Prod\_Agent worked
well in the pre-challenge production.  CMS was able to meet the very
ambitious goals of 25M events per month of simulated event
production.  The one pass production chain contributed to the high
efficiency of the production application during July and August.  The
production was performed by four teams, which is a decrease in
operations effort over previous production exercises.  CMS will need
to maintain the efficiency and the flexibility as the simulation
becomes more complicated for the physics validation.

The Prod\_Agent infrastructure also worked well for accessing existing
data and applying the user selections.  The teams operating the agents
were able to apply multiple selections simultaneously.  The merging
and data registration components worked well and could be reused from
the simulated event production workflow.

Even though the exercises were successful there is clearly room for
improvement.  CMS needs to continue to better automation of workflow
for re-reconstruction, selection and skimming of events.  The
infrastructure of request, to validation, to scheduling, to large
scale execution has components that involve people.  The human
interactions can be reduced and more automated workflows can be
implemented.  In the CSA06 workflows the work assignments and output
destinations were conveyed by e-mail.  During CSA06 the production
teams were responsible for combining the skims,testing the
configurations, and executing the skims.

In addition to improving the automation for schedule-able items like
skimming, we also need to improve the transparency.  In general users
and groups need a consistent entry point to see the status of the
requests and the location of the output.  The request system, the data
transfer system, and the dataset booking systems need to be tied
together for consistent end-to-end user views.

The re-reconstruction activity was, by design, a demonstration of
functionality and not a demonstration of the final production
workflow.  There is work left to do to make to ensuring every event is
re-reconstructed and processing failures are tracked and addressed.


\subsubsection{User Analysis Workflows}

The largest success for the analysis workflows was the successful
demonstration that the gLite and Condor-G job submission systems can
achieve the goal of 50k jobs per day.  Integration and scale testing
continues to be very important.  CMS integrated CRAB with the gLite
bulk submission only shortly before the challenge began.  There had
been testing of the underlying infrastructure through the CMS WLCG
Integration task force but no scale testing with the CMS submission
system.  The two problems in achieving scale were both related to the
CMS implementation and not in the underlying infrastructure and both
issues were promptly addressed by the CMS developers.  As the number
of people participating and the number of jobs increases the
importance of scale testing will only increase.

In order to reach the target submission rate CMS needed to make heavy
use of load generating ``job robots''.  While the robots generate
workflows that closely resemble user analysis jobs, the robots are not
a substitute for an active user community for testing.  For the next
series of challenges CMS should ensure a larger number of individuals
performing analysis.

Though only about 10\% of the total job submissions the user driven
analysis in the challenge was successful with CRAB functioning well on
both EGEE and OSG sites.  This document highlights some of the types
of analysis that were successfully completed.  Nevertheless, there are
a number of lessons.  The first is that CMS needs to improve the user
support model.  Currently user support is provided by a mailing list
in a community support model that works well for a size of the
community currently being supported.  It is not clear if this informal
support will scale to the larger collaboration.  It is possible for
requests to fall through the cracks.  CMS should look at hybrid
support models that assign and track tickets while ensuring that a
large enough community of people see the support requests to continue
to provide a broad base of supporters.

\subsection{Offline Database and Frontier}

The offline database infrastructure was successful in the challenge
though the initial attempt in the Tier-0 workflow identified scaling
limitations in the CMS web cache configuration for Frontier and
stability issues in the application code.  Both of these were promptly
addressed, but underscore the need for validation and scale testing.

The other offline database lesson is related the way CMS stores
calibration constants in the database and the frequency with which
they are invalidated.  Currently CMS stores the calibration
information as a large number of small objects, which are treated as
independent queries by the offline databases and they invalidated
daily.  The first application of the day can expect almost an hour
updating the database information in the offline cache, which is not
reasonable in the long run.

\subsection{Data Management}

The general lesson from data management is that CMS needs to ensuring
that all the data management components have a consistent picture of
the data.  The synchronization of the various views needs to be better
automated.  CMS has data management information in the dataset
bookkeeping system (DBS), the data transfer system (PhEDEx), and the
dataset location service.  CMS was able to fall out of sync in the
various data management components.  Maintaining consistency currently
involves some manual operations.

A specific element that was identified in CSA06 was the need to
examine the DBS performance in the presence of merging output.  The
initial performance needs estimates did not include this use-case,
which introduces a heavy load on the DBS.  For many output streams the
performance of the bookkeeping system limited the performance to
prepare data selections.  The performance limitation is being
addressed in the next generation of the DBS.

Data publication and the trivial file catalog resolution of the
logical to physical file names both worked well.  The trivial file
catalog scaled well and applications were able to consistently
discover data file locations with minimal additional services required
at the sites.

\subsection{Tier-0}

The Tier-0 workflow and dataflow management tools performed better than
required for CSA06, showing no significant problems throughout the
challenge. The flexibility of the message-based architecture allowed
adaptation of the running system to the changing operational conditions,
as the challenge progressed, without any interruption of service.

No inherent scaling problems were found, and key Tier-0 components
(hardware, software and people) were far from being stressed during the
challenge. The system achieved the low latency response required for 
real data-taking.

Most of the full range of complexity of the final system was explored
during the challenge. Other aspects were already explored with the 'July
prototype'. The design of the Tier-0 can therefore be deemed validated.

Operationally, the Tier-0 can be installed, configured, and run by
non-experts already. The Tier-0 internal goals of exploring the operations
during CSA06 have therefore also been met.

\subsection{Tier-1}

While 6 of the 7 Tier-1 centers met the complete goals for full
participating in the challenge with successful transfers on 90\% of
the days, several Tier-1 centers had problems importing and exporting
data simultaneously.  The Tier-1 centers either experienced unstable
data export or limited performance.  The majority of Tier-1 sites
demonstrated successful migration of data to tape, but there is a
substantial work left to demonstrate CMS can write the full data rate
to tape at Tier-1 centers and serve the data to all Tier-2 centers
when requested.

A specific technical item was identified in the FTS timeouts too tight
for sites with low access bandwidth and high latency when CMS moved to
files that were larger than 4GB.  In the final experiment the raw data
files should be between 5GB and 10GB, so CMS will need to revisit the
transfer timeouts again.

\subsection{Tier-2}

The number of Tier-2 centers participating in the challenge was larger
than the original goals, but a lot of effort was needed to make Tier-2
transfers work.  Some sites accepted data only from particular sites.
Early in the challenge PhEDEx dynamic Routing led to unpredictable
Tier-1-to-Tier-2 paths through intermediate Tier-1s. Early in the
challenge the PhEDEx operations team modified the path cost metrics in
PhEDEx to avoid multi-hop transfers and make the route more static and
prescribed, which makes transfers more look like baseline computing
model.
 
The poor transfer quality on the PhEDEx monitoring plot is not
necessarily a Tier-2 site issue. Some Tier-1s could better import data
from Tier-0 than export to Tier-2s.  One item that was identified is
that lots of transfer requests could clog the queues and lead to
component failures.  The FTS system is designed to throttle transfer
requests but developers initially focusing on protection of import
rather than export.  CMS is continuing the discussion on architecture
and implementation of throttling in FTS with the developers.

One area where the general lesson about operations load was felt the
strongest was the data management at the Tier-2 centers.  The data
stored at a Tier-2 center is defined by the supported community and a
clear need for tools to allow the Tier-2 to control the resident data
was identified during the challenge.
Revision:	1.4
Committed:	Thu Jan 18 16:41:22 2007 UTC (18 years, 3 months ago) by wildish
Content type:	application/x-tex
Branch:	MAIN
Changes since 1.3:	+4 -0 lines
Log Message:	minor text improvements
#	User	Rev	Content
1	fisk	1.1	\section{Lessons Learned}
2
3	fisk	1.2	As a complete exercise CSA06 was extremely successful. The technical
4			metrics were all met and some were exceeded by large factors. While
5			there is still considerable work do to, especially in the integration
6			with data acquisition and on-line computing, the intended
7			functionality was demonstrated in the challenge. With the success
8			there are valuable lessons for CMS as we transition to operations and
9			stable running.
10
11	fisk	1.1	\subsection{General}
12
13	fisk	1.2	There are a number of general lessons CMS can take away from the
14			challenge. The first is in the area of the transition to operations.
15			CMS needs development work to ease the operations load. CSA06 was
16			very successful, but it required a higher level of effort and
17			attention than could reasonably be expended for an experiment running
18			for years. As CMS transitions from development into successful scale
19			demonstrations to stable operations it is to be expected that
20			development activities will be identified to reduce the operational
21			load. During the challenge several specific areas were identified
22			that will be listed in the sections below. More fine-grained operator
23			control was identified for several elements, while more automation was
24			identified for others.
25
26			Another general lesson was that the strong engagement with Worldwide
27			LHC Computing Grid (WLCG) and the computing sites themselves was
28			extremely useful. There were a few problems encountered with grid
29			services which were addressed very promptly. Sites generally
30			responded to problems and solved them. An item related to the lesson
31			about operations is the communication of problems that needed to be
32			addressed. The sites and services were generally repaired promptly as
33			soon as problems were identified, but frequently it was attentive
34			operators and not automated systems that saw the problem first.
35
36			Scale testing continues to be an extremely important activity.
37			Initial scaling issues were identified and solved in several CSA06
38			components. Most of the problems identified were related to
39			components that were relatively new, or were used in at a new scale
40			without being thoroughly tested. CMS was lucky that all of the
41			scaling problems seen in CSA06 were straightforward to solve and were
42			fixed promptly. CMS needs to achieve nearly a factor of four in scale
43			for some of the components before high energy running in 2008, and the
44			sooner scaling issues are identified the more time will be available
45			to solve them.
46
47	fisk	1.1	\subsection{Offline Software}
48	fisk	1.2	There were three important lessons from the offline software experience
49			of CSA06. The first is that the software in the configuration used
50			was able to sustain greater than a 25\% load for the prompt
51			reconstruction activity. The software error rate, performance and
52			memory footprint were well within expectations contributing to smooth
53			running of the Tier-0 farm. The performance was somewhat faster than
54			the time budget allows, but several of the slower and less stable
55			reconstruction algorithms were intentionally left out of the prompt
56			reconstruction workflow. As the reconstruction software evolves the
57			performance and stability should be watched.
58
59			The second lesson was that the ability to promptly create and
60			distribute a new software release was invaluable. During CSA06
61			operations CMS released four versions of the software to address
62			issues that were encountered and to add functionality. The ability to
63			make a release promptly and to equally promptly install it on the
64			remote sites was extremely helpful in meeting challenge goals.
65
66			The last lesson is that CMS needs a more formal validation process and
67			checklist for the application before a release is tagged. A problem
68			in the re-reconstruction application was identified in the final two
69			weeks of the challenge. With a more rigorous validation procedure the
70			problem might have been seen in the opening days of the challenge
71			giving more time to solve it. While it is impossible to test for all
72			possible conditions, a validated list of validation checks would be
73			useful.
74	fisk	1.1
75			\subsection{Production and Grid Tools}
76
77	fisk	1.2	\subsubsection{Organized Processing}
78			There are a number of lessons to take away from the experience with
79			production and grid tools. The first is that the Prod\_Agent worked
80			well in the pre-challenge production. CMS was able to meet the very
81			ambitious goals of 25M events per month of simulated event
82			production. The one pass production chain contributed to the high
83			efficiency of the production application during July and August. The
84			production was performed by four teams, which is a decrease in
85			operations effort over previous production exercises. CMS will need
86			to maintain the efficiency and the flexibility as the simulation
87			becomes more complicated for the physics validation.
88
89			The Prod\_Agent infrastructure also worked well for accessing existing
90			data and applying the user selections. The teams operating the agents
91			were able to apply multiple selections simultaneously. The merging
92			and data registration components worked well and could be reused from
93			the simulated event production workflow.
94
95			Even though the exercises were successful there is clearly room for
96			improvement. CMS needs to continue to better automation of workflow
97			for re-reconstruction, selection and skimming of events. The
98			infrastructure of request, to validation, to scheduling, to large
99			scale execution has components that involve people. The human
100			interactions can be reduced and more automated workflows can be
101			implemented. In the CSA06 workflows the work assignments and output
102			destinations were conveyed by e-mail. During CSA06 the production
103			teams were responsible for combining the skims,testing the
104			configurations, and executing the skims.
105
106			In addition to improving the automation for schedule-able items like
107			skimming, we also need to improve the transparency. In general users
108			and groups need a consistent entry point to see the status of the
109			requests and the location of the output. The request system, the data
110			transfer system, and the dataset booking systems need to be tied
111			together for consistent end-to-end user views.
112
113			The re-reconstruction activity was, by design, a demonstration of
114			functionality and not a demonstration of the final production
115			workflow. There is work left to do to make to ensuring every event is
116			re-reconstructed and processing failures are tracked and addressed.
117
118
119			\subsubsection{User Analysis Workflows}
120
121			The largest success for the analysis workflows was the successful
122			demonstration that the gLite and Condor-G job submission systems can
123			achieve the goal of 50k jobs per day. Integration and scale testing
124			continues to be very important. CMS integrated CRAB with the gLite
125			bulk submission only shortly before the challenge began. There had
126			been testing of the underlying infrastructure through the CMS WLCG
127			Integration task force but no scale testing with the CMS submission
128			system. The two problems in achieving scale were both related to the
129			CMS implementation and not in the underlying infrastructure and both
130			issues were promptly addressed by the CMS developers. As the number
131			of people participating and the number of jobs increases the
132			importance of scale testing will only increase.
133
134			In order to reach the target submission rate CMS needed to make heavy
135			use of load generating ``job robots''. While the robots generate
136			workflows that closely resemble user analysis jobs, the robots are not
137			a substitute for an active user community for testing. For the next
138			series of challenges CMS should ensure a larger number of individuals
139			performing analysis.
140
141			Though only about 10\% of the total job submissions the user driven
142			analysis in the challenge was successful with CRAB functioning well on
143			both EGEE and OSG sites. This document highlights some of the types
144			of analysis that were successfully completed. Nevertheless, there are
145			a number of lessons. The first is that CMS needs to improve the user
146			support model. Currently user support is provided by a mailing list
147			in a community support model that works well for a size of the
148			community currently being supported. It is not clear if this informal
149			support will scale to the larger collaboration. It is possible for
150			requests to fall through the cracks. CMS should look at hybrid
151			support models that assign and track tickets while ensuring that a
152			large enough community of people see the support requests to continue
153			to provide a broad base of supporters.
154
155	fisk	1.1	\subsection{Offline Database and Frontier}
156
157	fisk	1.2	The offline database infrastructure was successful in the challenge
158			though the initial attempt in the Tier-0 workflow identified scaling
159			limitations in the CMS web cache configuration for Frontier and
160			stability issues in the application code. Both of these were promptly
161			addressed, but underscore the need for validation and scale testing.
162
163			The other offline database lesson is related the way CMS stores
164			calibration constants in the database and the frequency with which
165			they are invalidated. Currently CMS stores the calibration
166			information as a large number of small objects, which are treated as
167			independent queries by the offline databases and they invalidated
168			daily. The first application of the day can expect almost an hour
169			updating the database information in the offline cache, which is not
170			reasonable in the long run.
171
172			\subsection{Data Management}
173
174			The general lesson from data management is that CMS needs to ensuring
175			that all the data management components have a consistent picture of
176			the data. The synchronization of the various views needs to be better
177			automated. CMS has data management information in the dataset
178			bookkeeping system (DBS), the data transfer system (PhEDEx), and the
179			dataset location service. CMS was able to fall out of sync in the
180			various data management components. Maintaining consistency currently
181			involves some manual operations.
182
183			A specific element that was identified in CSA06 was the need to
184			examine the DBS performance in the presence of merging output. The
185			initial performance needs estimates did not include this use-case,
186			which introduces a heavy load on the DBS. For many output streams the
187			performance of the bookkeeping system limited the performance to
188			prepare data selections. The performance limitation is being
189			addressed in the next generation of the DBS.
190
191			Data publication and the trivial file catalog resolution of the
192			logical to physical file names both worked well. The trivial file
193			catalog scaled well and applications were able to consistently
194			discover data file locations with minimal additional services required
195			at the sites.
196
197	fisk	1.1	\subsection{Tier-0}
198
199	wildish	1.3	The Tier-0 workflow and dataflow management tools performed better than
200			required for CSA06, showing no significant problems throughout the
201			challenge. The flexibility of the message-based architecture allowed
202			adaptation of the running system to the changing operational conditions,
203			as the challenge progressed, without any interruption of service.
204
205			No inherent scaling problems were found, and key Tier-0 components
206			(hardware, software and people) were far from being stressed during the
207			challenge. The system achieved the low latency response required for
208			real data-taking.
209
210			Most of the full range of complexity of the final system was explored
211			during the challenge. Other aspects were already explored with the 'July
212			prototype'. The design of the Tier-0 can therefore be deemed validated.
213	fisk	1.2
214	wildish	1.4	Operationally, the Tier-0 can be installed, configured, and run by
215			non-experts already. The Tier-0 internal goals of exploring the operations
216			during CSA06 have therefore also been met.
217
218	fisk	1.1	\subsection{Tier-1}
219
220	fisk	1.2	While 6 of the 7 Tier-1 centers met the complete goals for full
221			participating in the challenge with successful transfers on 90\% of
222			the days, several Tier-1 centers had problems importing and exporting
223			data simultaneously. The Tier-1 centers either experienced unstable
224			data export or limited performance. The majority of Tier-1 sites
225			demonstrated successful migration of data to tape, but there is a
226			substantial work left to demonstrate CMS can write the full data rate
227			to tape at Tier-1 centers and serve the data to all Tier-2 centers
228			when requested.
229
230			A specific technical item was identified in the FTS timeouts too tight
231			for sites with low access bandwidth and high latency when CMS moved to
232			files that were larger than 4GB. In the final experiment the raw data
233			files should be between 5GB and 10GB, so CMS will need to revisit the
234			transfer timeouts again.
235
236	fisk	1.1	\subsection{Tier-2}
237
238	fisk	1.2	The number of Tier-2 centers participating in the challenge was larger
239			than the original goals, but a lot of effort was needed to make Tier-2
240			transfers work. Some sites accepted data only from particular sites.
241			Early in the challenge PhEDEx dynamic Routing led to unpredictable
242			Tier-1-to-Tier-2 paths through intermediate Tier-1s. Early in the
243			challenge the PhEDEx operations team modified the path cost metrics in
244			PhEDEx to avoid multi-hop transfers and make the route more static and
245			prescribed, which makes transfers more look like baseline computing
246			model.
247
248			The poor transfer quality on the PhEDEx monitoring plot is not
249			necessarily a Tier-2 site issue. Some Tier-1s could better import data
250			from Tier-0 than export to Tier-2s. One item that was identified is
251			that lots of transfer requests could clog the queues and lead to
252			component failures. The FTS system is designed to throttle transfer
253			requests but developers initially focusing on protection of import
254			rather than export. CMS is continuing the discussion on architecture
255			and implementation of throttling in FTS with the developers.
256
257			One area where the general lesson about operations load was felt the
258			strongest was the data management at the Tier-2 centers. The data
259			stored at a Tier-2 center is defined by the supported community and a
260			clear need for tools to allow the Tier-2 to control the resident data
261			was identified during the challenge.