COMP/CSA06DOC/lessons.tex

\section{Lessons Learned}

As a complete exercise CSA06 was extremely successful.  The technical
metrics were all met and some were exceeded by large factors.  While
there is still considerable work do to, especially in the integration
with data acquisition and on-line computing, the intended
functionality was demonstrated in the challenge.  With the success
there are valuable lessons for CMS as we transition to operations and
stable running.

\subsection{General}

There are a number of general lessons CMS can take away from the
challenge.  The first is in the area of the transition to operations.
CMS needs development work to ease the operations load.  CSA06 was
very successful, but it required a higher level of effort and
attention than could reasonably be expended for an experiment running
for years.  As CMS transitions from development into successful scale
demonstrations to stable operations it is to be expected that
development activities will be identified to reduce the operational
load.  During the challenge several specific areas were identified
that will be listed in the sections below.  More fine-grained operator
control was identified for several elements, while more automation was
identified for others.

Another general lesson was that the strong engagement with Worldwide
LHC Computing Grid (WLCG) and the computing sites themselves was
extremely useful.  There were a few problems encountered with grid
services which were addressed very promptly.  Sites generally
responded to problems and solved them.  An item related to the lesson
about operations is the communication of problems that needed to be
addressed.  The sites and services were generally repaired promptly as
soon as problems were identified, but frequently it was attentive
operators and not automated systems that saw the problem first.

Scale testing continues to be an extremely important activity.
Initial scaling issues were identified and solved in several CSA06
components.  Most of the problems identified were related to
components that were relatively new, or were used in at a new scale
without being thoroughly tested.  CMS was lucky that all of the
scaling problems seen in CSA06 were straightforward to solve and were
fixed promptly.  CMS needs to achieve nearly a factor of four in scale
for some of the components before high energy running in 2008, and the
sooner scaling issues are identified the more time will be available
to solve them.

\subsection{Offline Software}
There were three important lessons from the offline software experience
of CSA06.  The first is that the software in the configuration used
was able to sustain greater than a 25\% load for the prompt
reconstruction activity.  The software error rate, performance and
memory footprint were well within expectations contributing to smooth
running of the Tier-0 farm.  The performance was somewhat faster than
the time budget allows, but several of the slower and less stable
reconstruction algorithms were intentionally left out of the prompt
reconstruction workflow.  As the reconstruction software evolves the
performance and stability should be watched.

The second lesson was that the ability to promptly create and
distribute a new software release was invaluable.  During CSA06
operations CMS released four versions of the software to address
issues that were encountered and to add functionality.  The ability to
make a release promptly and to equally promptly install it on the
remote sites was extremely helpful in meeting challenge goals.

The last lesson is that CMS needs a more formal validation process and
checklist for the application before a release is tagged.  A problem
in the re-reconstruction application was identified in the final two
weeks of the challenge.  With a more rigorous validation procedure the
problem might have been seen in the opening days of the challenge
giving more time to solve it.  While it is impossible to test for all
possible conditions, a validated list of validation checks would be
useful.

\subsection{Production and Grid Tools}

\subsubsection{Organized Processing}
There are a number of lessons to take away from the experience with
production and grid tools.  The first is that the Prod\_Agent worked
well in the pre-challenge production.  CMS was able to meet the very
ambitious goals of 25M events per month of simulated event
production.  The one pass production chain contributed to the high
efficiency of the production application during July and August.  The
production was performed by four teams, which is a decrease in
operations effort over previous production exercises.  CMS will need
to maintain the efficiency and the flexibility as the simulation
becomes more complicated for the physics validation.

The Prod\_Agent infrastructure also worked well for accessing existing
data and applying the user selections.  The teams operating the agents
were able to apply multiple selections simultaneously.  The merging
and data registration components worked well and could be reused from
the simulated event production workflow.

Even though the exercises were successful there is clearly room for
improvement.  CMS needs to continue to improve automation of workflow
for re-reconstruction, selection and skimming of events.  The
infrastructure of request, to validation, to scheduling, to large
scale execution has components that involve people.  The human
interactions can be reduced and more automated workflows can be
implemented.  In the CSA06 workflows the work assignments and output
destinations were conveyed by e-mail.  During CSA06 the production
teams were responsible for combining the skims, testing the
configurations, and executing the skims.

In addition to improving the automation for scheduleable items like
skimming, we also need to improve the transparency.  In general users
and groups need a consistent entry point to see the status of the
requests and the location of the output.  The request system, the data
transfer system, and the dataset booking systems need to be tied
together for consistent end-to-end user views.

The re-reconstruction activity was, by design, a demonstration of
functionality and not a demonstration of the final production
workflow.  There is work left to do to make to ensuring every event is
re-reconstructed and processing failures are tracked and addressed.


\subsubsection{User Analysis Workflows}

The largest success for the analysis workflows was the successful
demonstration that the gLite and Condor-G job submission systems can
achieve the goal of 50k jobs per day.  Integration and scale testing
continues to be very important.  CMS integrated CRAB with the gLite
bulk submission only shortly before the challenge began.  There had
been testing of the underlying infrastructure through the CMS WLCG
Integration task force but no scale testing with the CMS submission
system.  The two problems in achieving scale were both related to the
CMS implementation and not in the underlying infrastructure and both
issues were promptly addressed by the CMS developers.  As the number
of people participating and the number of jobs increases the
importance of scale testing will only increase.

In order to reach the target submission rate CMS needed to make heavy
use of load generating ``job robots''.  While the robots generate
workflows that closely resemble user analysis jobs, the robots are not
a substitute for an active user community for testing.  For the next
series of challenges CMS should ensure a larger number of individuals
performing analysis.

Though only about 10\% of the total job submissions, the user driven
analysis in the challenge was successful with CRAB functioning well on
both EGEE and OSG sites.  This document highlights some of the types
of analysis that were successfully completed.  Nevertheless, there are
a number of lessons.  The first is that CMS needs to improve the user
support model.  Currently user support is provided by a mailing list
in a community support model that works well for a size of the
community currently being supported.  It is not clear if this informal
support will scale to the larger collaboration.  It is possible for
requests to fall through the cracks.  CMS should look at hybrid
support models that assign and track tickets while ensuring that a
large enough community of people see the support requests to continue
to provide a broad base of supporters.

\subsection{Offline Database and Frontier}

The offline database infrastructure was successful in the challenge.
The calibration data could be distributed to remote locations from a
single database instance at CERN using the Frontier infrastructure.
The initial attempt in the Tier-0 workflow identified scaling
limitations in the CMS web cache configuration for Frontier and
stability issues in the application code.  Both of these were promptly
addressed, but they underscore the need for validation and scale
testing.

The other offline database lesson is related the way CMS stores
calibration constants in the database and the frequency with which
they are invalidated.  Currently CMS stores the calibration
information as a large number of small objects, which are treated as
independent queries by the offline databases and they invalidated
daily.  The first application of the day can expect almost an hour
updating the database information in the offline cache, which is not
reasonable in the long term.

\subsection{Data Management}

The CMS data management solution relying on central components for
data bookkeeping, data location, and data transfer management and site
components for data resolution worked well and reduced the effort
required by the site operators.  The changes in the CMS event data
modem significantly simplified the access of the data by analysis
applications.

The general lesson from data management is that CMS needs to ensure
that all the data management components have a consistent picture of
the data.  The synchronization of the various views needs to be better
automated.  CMS has data management information in the dataset
bookkeeping system (DBS), the data transfer system (PhEDEx), and the
dataset location service.  CMS was able to fall out of sync in the
various data management components.  Maintaining consistency currently
involves some manual operations.

A specific element that was identified in CSA06 was the need to
examine the DBS performance in the presence of merging output.  The
initial performance needs estimates did not include this use-case,
which introduces a heavy load on the DBS.  For many output streams the
performance of the bookkeeping system limited the performance to
prepare data selections.  The performance limitation is being
addressed in the next generation of the DBS.

Data publication and the trivial file catalog resolution of the
logical to physical file names both worked well.  The trivial file
catalog scaled well and applications were able to consistently
discover data file locations with minimal additional services required
at the sites.

\subsection{Workflow Management}

Workflow management components both at CERN and at remote centers were
able to perform the achieve the required level of activity expected in
the challenge.   There is some overlap in the implementation of the Tier-0 workflow and
the Prod\_Agent workflow used at the Tier-1 and Tier-2 centers, which
should be re-examined after the challenge with an eye for long term
maintainability and support.


\subsection{Central Services}

Central services and facilities at CERN from IT and WLCG, including
the batch resources and FTS, were carefully monitored and problems
were solved.  CASTOR support at CERN was excellent.  As an export
system, CASTOR2 performed at a higher rate and more stably than in
past CMS exercises.  CMS ran into an issue with the SRM release for
files greater than 2GB in DPM file which was solved the next day.


\subsection{Tier-0}

The Tier-0 workflow and dataflow management tools performed better than
required for CSA06, showing no significant problems throughout the
challenge. The flexibility of the message-based architecture allowed
adaptation of the running system to the changing operational conditions,
as the challenge progressed, without any interruption of service.

No inherent scaling problems were found, and key Tier-0 components
(hardware, software and people) were far from being stressed during the
challenge. The system achieved the low latency response required for 
real data-taking.

Most of the full range of complexity of the final system was explored
during the challenge. Other aspects were already explored with the ``July
prototype''. The design of the Tier-0 can therefore be deemed validated.

Operationally, the Tier-0 can be installed, configured, and run by
non-experts already. The Tier-0 internal goals of exploring the operations
during CSA06 have therefore also been met.


\subsection{Tier-1}

While 6 of the 7 Tier-1 centers met the complete goals for full
participating in the challenge with successful transfers on 90\% of
the days.  The transfer quality, defined in CMS as the number of times
a transfer was attempted before successful, was significantly improved
for CERN to Tier-1 transfers during the challenge as compared to
previous service challenge exercises.  There are several elements to
improve in the final year of experiment preparation.  Several Tier-1
centers had problems importing and exporting data simultaneously.  The
Tier-1 centers either experienced unstable data export or limited
performance.  The majority of Tier-1 sites demonstrated successful
migration of data to tape, but there is a substantial work left to
demonstrate CMS can write the full data rate to tape at Tier-1 centers
and serve the data to all Tier-2 centers when requested.

A specific technical item was identified in the FTS timeouts too tight
for sites with low access bandwidth and high latency when CMS moved to
files that were larger than 4GB.  In the final experiment the raw data
files should be between 5GB and 10GB, so CMS will need to revisit the
transfer timeouts again.

\subsection{Tier-2}

The number of Tier-2 centers participating in the challenge was larger
than the original goal and a broader variety of activities was
successfully performed by the Tier-2 centers.  An item to improve is
the amount of effort required to make Tier-2 transfers work.  Some
sites accepted data only from particular sites.  Early in the
challenge PhEDEx dynamic Routing led to unpredictable Tier-1-to-Tier-2
paths through intermediate Tier-1s. Early in the challenge the PhEDEx
operations team modified the path cost metrics in PhEDEx to avoid
multi-hop transfers and make the route more static and prescribed,
which makes transfers more look like baseline computing model.
 
The poor transfer quality on the PhEDEx monitoring plot is not
necessarily a Tier-2 site issue. Some Tier-1s could better import data
from Tier-0 than export to Tier-2s.  One item that was identified is
that lots of transfer requests could clog the queues and lead to
component failures.  The FTS system is designed to throttle transfer
requests but developers initially focusing on protection of import
rather than export.  CMS is continuing the discussion on architecture
and implementation of throttling in FTS with the developers.

One area where the general lesson about operations load was felt the
strongest was the data management at the Tier-2 centers.  The data
stored at a Tier-2 center is defined by the supported community and a
clear need for tools to allow the Tier-2 to control the resident data
was identified during the challenge.
Revision:	1.8
Committed:	Sun Jan 28 02:48:35 2007 UTC (18 years, 3 months ago) by fisk
Content type:	application/x-tex
Branch:	MAIN
Changes since 1.7:	+42 -25 lines
Log Message:	Added positive statements in lessons for all sections
#	User	Rev	Content
1	fisk	1.1	\section{Lessons Learned}
2
3	fisk	1.2	As a complete exercise CSA06 was extremely successful. The technical
4			metrics were all met and some were exceeded by large factors. While
5			there is still considerable work do to, especially in the integration
6			with data acquisition and on-line computing, the intended
7			functionality was demonstrated in the challenge. With the success
8			there are valuable lessons for CMS as we transition to operations and
9			stable running.
10
11	fisk	1.1	\subsection{General}
12
13	fisk	1.2	There are a number of general lessons CMS can take away from the
14			challenge. The first is in the area of the transition to operations.
15			CMS needs development work to ease the operations load. CSA06 was
16			very successful, but it required a higher level of effort and
17			attention than could reasonably be expended for an experiment running
18			for years. As CMS transitions from development into successful scale
19			demonstrations to stable operations it is to be expected that
20			development activities will be identified to reduce the operational
21			load. During the challenge several specific areas were identified
22			that will be listed in the sections below. More fine-grained operator
23			control was identified for several elements, while more automation was
24			identified for others.
25
26			Another general lesson was that the strong engagement with Worldwide
27			LHC Computing Grid (WLCG) and the computing sites themselves was
28			extremely useful. There were a few problems encountered with grid
29			services which were addressed very promptly. Sites generally
30			responded to problems and solved them. An item related to the lesson
31			about operations is the communication of problems that needed to be
32			addressed. The sites and services were generally repaired promptly as
33			soon as problems were identified, but frequently it was attentive
34			operators and not automated systems that saw the problem first.
35
36			Scale testing continues to be an extremely important activity.
37			Initial scaling issues were identified and solved in several CSA06
38			components. Most of the problems identified were related to
39			components that were relatively new, or were used in at a new scale
40			without being thoroughly tested. CMS was lucky that all of the
41			scaling problems seen in CSA06 were straightforward to solve and were
42			fixed promptly. CMS needs to achieve nearly a factor of four in scale
43			for some of the components before high energy running in 2008, and the
44			sooner scaling issues are identified the more time will be available
45			to solve them.
46
47	fisk	1.1	\subsection{Offline Software}
48	fisk	1.2	There were three important lessons from the offline software experience
49			of CSA06. The first is that the software in the configuration used
50			was able to sustain greater than a 25\% load for the prompt
51			reconstruction activity. The software error rate, performance and
52			memory footprint were well within expectations contributing to smooth
53			running of the Tier-0 farm. The performance was somewhat faster than
54			the time budget allows, but several of the slower and less stable
55			reconstruction algorithms were intentionally left out of the prompt
56			reconstruction workflow. As the reconstruction software evolves the
57			performance and stability should be watched.
58
59			The second lesson was that the ability to promptly create and
60			distribute a new software release was invaluable. During CSA06
61			operations CMS released four versions of the software to address
62			issues that were encountered and to add functionality. The ability to
63			make a release promptly and to equally promptly install it on the
64			remote sites was extremely helpful in meeting challenge goals.
65
66			The last lesson is that CMS needs a more formal validation process and
67			checklist for the application before a release is tagged. A problem
68			in the re-reconstruction application was identified in the final two
69			weeks of the challenge. With a more rigorous validation procedure the
70			problem might have been seen in the opening days of the challenge
71			giving more time to solve it. While it is impossible to test for all
72			possible conditions, a validated list of validation checks would be
73			useful.
74	fisk	1.1
75			\subsection{Production and Grid Tools}
76
77	fisk	1.2	\subsubsection{Organized Processing}
78			There are a number of lessons to take away from the experience with
79			production and grid tools. The first is that the Prod\_Agent worked
80			well in the pre-challenge production. CMS was able to meet the very
81			ambitious goals of 25M events per month of simulated event
82			production. The one pass production chain contributed to the high
83			efficiency of the production application during July and August. The
84			production was performed by four teams, which is a decrease in
85			operations effort over previous production exercises. CMS will need
86			to maintain the efficiency and the flexibility as the simulation
87			becomes more complicated for the physics validation.
88
89			The Prod\_Agent infrastructure also worked well for accessing existing
90			data and applying the user selections. The teams operating the agents
91			were able to apply multiple selections simultaneously. The merging
92			and data registration components worked well and could be reused from
93			the simulated event production workflow.
94
95			Even though the exercises were successful there is clearly room for
96	acosta	1.5	improvement. CMS needs to continue to improve automation of workflow
97	fisk	1.2	for re-reconstruction, selection and skimming of events. The
98			infrastructure of request, to validation, to scheduling, to large
99			scale execution has components that involve people. The human
100			interactions can be reduced and more automated workflows can be
101			implemented. In the CSA06 workflows the work assignments and output
102			destinations were conveyed by e-mail. During CSA06 the production
103	acosta	1.5	teams were responsible for combining the skims, testing the
104	fisk	1.2	configurations, and executing the skims.
105
106	acosta	1.5	In addition to improving the automation for scheduleable items like
107	fisk	1.2	skimming, we also need to improve the transparency. In general users
108			and groups need a consistent entry point to see the status of the
109			requests and the location of the output. The request system, the data
110			transfer system, and the dataset booking systems need to be tied
111			together for consistent end-to-end user views.
112
113			The re-reconstruction activity was, by design, a demonstration of
114			functionality and not a demonstration of the final production
115			workflow. There is work left to do to make to ensuring every event is
116			re-reconstructed and processing failures are tracked and addressed.
117
118
119			\subsubsection{User Analysis Workflows}
120
121			The largest success for the analysis workflows was the successful
122			demonstration that the gLite and Condor-G job submission systems can
123			achieve the goal of 50k jobs per day. Integration and scale testing
124			continues to be very important. CMS integrated CRAB with the gLite
125			bulk submission only shortly before the challenge began. There had
126			been testing of the underlying infrastructure through the CMS WLCG
127			Integration task force but no scale testing with the CMS submission
128			system. The two problems in achieving scale were both related to the
129			CMS implementation and not in the underlying infrastructure and both
130			issues were promptly addressed by the CMS developers. As the number
131			of people participating and the number of jobs increases the
132			importance of scale testing will only increase.
133
134			In order to reach the target submission rate CMS needed to make heavy
135			use of load generating ``job robots''. While the robots generate
136			workflows that closely resemble user analysis jobs, the robots are not
137			a substitute for an active user community for testing. For the next
138			series of challenges CMS should ensure a larger number of individuals
139			performing analysis.
140
141	acosta	1.5	Though only about 10\% of the total job submissions, the user driven
142	fisk	1.2	analysis in the challenge was successful with CRAB functioning well on
143			both EGEE and OSG sites. This document highlights some of the types
144			of analysis that were successfully completed. Nevertheless, there are
145			a number of lessons. The first is that CMS needs to improve the user
146			support model. Currently user support is provided by a mailing list
147			in a community support model that works well for a size of the
148			community currently being supported. It is not clear if this informal
149			support will scale to the larger collaboration. It is possible for
150			requests to fall through the cracks. CMS should look at hybrid
151			support models that assign and track tickets while ensuring that a
152			large enough community of people see the support requests to continue
153			to provide a broad base of supporters.
154
155	fisk	1.1	\subsection{Offline Database and Frontier}
156
157	fisk	1.8	The offline database infrastructure was successful in the challenge.
158			The calibration data could be distributed to remote locations from a
159			single database instance at CERN using the Frontier infrastructure.
160			The initial attempt in the Tier-0 workflow identified scaling
161	fisk	1.2	limitations in the CMS web cache configuration for Frontier and
162			stability issues in the application code. Both of these were promptly
163	fisk	1.8	addressed, but they underscore the need for validation and scale
164			testing.
165	fisk	1.2
166			The other offline database lesson is related the way CMS stores
167			calibration constants in the database and the frequency with which
168			they are invalidated. Currently CMS stores the calibration
169			information as a large number of small objects, which are treated as
170			independent queries by the offline databases and they invalidated
171			daily. The first application of the day can expect almost an hour
172			updating the database information in the offline cache, which is not
173	acosta	1.5	reasonable in the long term.
174	fisk	1.2
175			\subsection{Data Management}
176
177	fisk	1.8	The CMS data management solution relying on central components for
178			data bookkeeping, data location, and data transfer management and site
179			components for data resolution worked well and reduced the effort
180			required by the site operators. The changes in the CMS event data
181			modem significantly simplified the access of the data by analysis
182			applications.
183
184	acosta	1.5	The general lesson from data management is that CMS needs to ensure
185	fisk	1.2	that all the data management components have a consistent picture of
186			the data. The synchronization of the various views needs to be better
187			automated. CMS has data management information in the dataset
188			bookkeeping system (DBS), the data transfer system (PhEDEx), and the
189			dataset location service. CMS was able to fall out of sync in the
190			various data management components. Maintaining consistency currently
191			involves some manual operations.
192
193			A specific element that was identified in CSA06 was the need to
194			examine the DBS performance in the presence of merging output. The
195			initial performance needs estimates did not include this use-case,
196			which introduces a heavy load on the DBS. For many output streams the
197			performance of the bookkeeping system limited the performance to
198			prepare data selections. The performance limitation is being
199			addressed in the next generation of the DBS.
200
201			Data publication and the trivial file catalog resolution of the
202			logical to physical file names both worked well. The trivial file
203			catalog scaled well and applications were able to consistently
204			discover data file locations with minimal additional services required
205			at the sites.
206
207	acosta	1.7	\subsection{Workflow Management}
208
209	fisk	1.8	Workflow management components both at CERN and at remote centers were
210			able to perform the achieve the required level of activity expected in
211			the challenge. There is some overlap in the implementation of the Tier-0 workflow and
212			the Prod\_Agent workflow used at the Tier-1 and Tier-2 centers, which
213			should be re-examined after the challenge with an eye for long term
214	acosta	1.7	maintainability and support.
215
216
217			\subsection{Central Services}
218
219			Central services and facilities at CERN from IT and WLCG, including
220			the batch resources and FTS, were carefully monitored and problems
221	fisk	1.8	were solved. CASTOR support at CERN was excellent. As an export
222			system, CASTOR2 performed at a higher rate and more stably than in
223			past CMS exercises. CMS ran into an issue with the SRM release for
224			files greater than 2GB in DPM file which was solved the next day.
225	acosta	1.7
226
227	fisk	1.1	\subsection{Tier-0}
228
229	wildish	1.3	The Tier-0 workflow and dataflow management tools performed better than
230			required for CSA06, showing no significant problems throughout the
231			challenge. The flexibility of the message-based architecture allowed
232			adaptation of the running system to the changing operational conditions,
233			as the challenge progressed, without any interruption of service.
234
235			No inherent scaling problems were found, and key Tier-0 components
236			(hardware, software and people) were far from being stressed during the
237			challenge. The system achieved the low latency response required for
238			real data-taking.
239
240			Most of the full range of complexity of the final system was explored
241	acosta	1.6	during the challenge. Other aspects were already explored with the ``July
242			prototype''. The design of the Tier-0 can therefore be deemed validated.
243	fisk	1.2
244	wildish	1.4	Operationally, the Tier-0 can be installed, configured, and run by
245			non-experts already. The Tier-0 internal goals of exploring the operations
246			during CSA06 have therefore also been met.
247
248	acosta	1.6
249	fisk	1.1	\subsection{Tier-1}
250
251	fisk	1.2	While 6 of the 7 Tier-1 centers met the complete goals for full
252			participating in the challenge with successful transfers on 90\% of
253	fisk	1.8	the days. The transfer quality, defined in CMS as the number of times
254			a transfer was attempted before successful, was significantly improved
255			for CERN to Tier-1 transfers during the challenge as compared to
256			previous service challenge exercises. There are several elements to
257			improve in the final year of experiment preparation. Several Tier-1
258			centers had problems importing and exporting data simultaneously. The
259			Tier-1 centers either experienced unstable data export or limited
260			performance. The majority of Tier-1 sites demonstrated successful
261			migration of data to tape, but there is a substantial work left to
262			demonstrate CMS can write the full data rate to tape at Tier-1 centers
263			and serve the data to all Tier-2 centers when requested.
264	fisk	1.2
265			A specific technical item was identified in the FTS timeouts too tight
266			for sites with low access bandwidth and high latency when CMS moved to
267			files that were larger than 4GB. In the final experiment the raw data
268			files should be between 5GB and 10GB, so CMS will need to revisit the
269			transfer timeouts again.
270
271	fisk	1.1	\subsection{Tier-2}
272
273	fisk	1.2	The number of Tier-2 centers participating in the challenge was larger
274	fisk	1.8	than the original goal and a broader variety of activities was
275			successfully performed by the Tier-2 centers. An item to improve is
276			the amount of effort required to make Tier-2 transfers work. Some
277			sites accepted data only from particular sites. Early in the
278			challenge PhEDEx dynamic Routing led to unpredictable Tier-1-to-Tier-2
279			paths through intermediate Tier-1s. Early in the challenge the PhEDEx
280			operations team modified the path cost metrics in PhEDEx to avoid
281			multi-hop transfers and make the route more static and prescribed,
282			which makes transfers more look like baseline computing model.
283	fisk	1.2
284			The poor transfer quality on the PhEDEx monitoring plot is not
285			necessarily a Tier-2 site issue. Some Tier-1s could better import data
286			from Tier-0 than export to Tier-2s. One item that was identified is
287			that lots of transfer requests could clog the queues and lead to
288			component failures. The FTS system is designed to throttle transfer
289			requests but developers initially focusing on protection of import
290			rather than export. CMS is continuing the discussion on architecture
291			and implementation of throttling in FTS with the developers.
292
293			One area where the general lesson about operations load was felt the
294			strongest was the data management at the Tier-2 centers. The data
295			stored at a Tier-2 center is defined by the supported community and a
296			clear need for tools to allow the Tier-2 to control the resident data
297			was identified during the challenge.