COMP/CSA06DOC/offlinedb.tex

\section{Offline Database and Frontier}
\subsection{Frontier}
 The Frontier infrastructure was installed and tested prior to CSA06 and
 was used for the T0 operation and at T1 and T2 Centers. The goal was
 to observe the behavior of the frontier central servers at CERN,
 referred to as the ``launchpad'', and monitor the squids deployed at
 each participating site.  The setup at CERN is shown in
 Fig.~\ref{fig:frontier-setup}. There were three production servers,
 each running in tandem a tomcat server and squid server in
 accelerator mode. Load balancing and failover among the three servers
 is done via DBS round robin, and this worked flawlessly. The squids
 were configured in cache peer sharing mode which reduces traffic to
 the database for non cached objects.   


\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/frontier-setup}}
    \caption{Frontier overview of launchpad and connection to WAN and T0 Farm.}
    \label{fig:frontier-setup}
  \end{center}
\end{figure} 


Monitoring was in place to observe the activity for each squid through its SNMP interface and plots for 1) request rate, 2) data throughput and 3) number of cached objects was available for each installed squid. Lemon was used to monitor CPU, Network I/O, and other important machine operating parameters on the servers at CERN. 

Initial tests with 200 T0 clients running CMSSW\_0\_8\_1 were
successful for the calibrations available at the time, ECAL and
HCAL. However, when CMSSW\_1\_0\_3 was tried a significant fraction
($\sim 5\%$) of jobs ending with segmentation faults was
observed. Several additional problems emerged associated with the Si
alignment when the software was run for the first time on the T0
system for the CSA exercise.
The Si Alignment C++ object comprises a vector of vectors,
which are translated in POOL-ORA into a very large number of tiny
queries to the database. This makes loading the object quite slow, and
frontier somewhat slower than direct oracle access. Due to the large
number of calls to frontier, the squid access logs filled more quickly
than we had observed in our previous testing and we were forced to
temporarily turn them off.  

A patch was found for the segmentation fault problem and this was implement and released in CMSSW\_1\_0\_6. The root cause of the seg faults was non-threadsafe code in the SEAL library. By commenting out logging in the CORAL Frontier access libraries it was found the failure rate could be reduced to a few per mil. Additional work is underway to solve this problem. 


\subsubsection{Performance under T0 processing load}
After the problems were resolved, there was a week of extensive operation and the T0 farm was ramped up to 1000 nodes. The number of requests and data throughput is shown in  Fig.~\ref{fig:frontier-t0-requests} and Fig.~\ref{fig:frontier-t0-throughput}. This shows how the system behaved under loads ranging from 200 (Sunday) to 1000 (Wednesday) concurrent clients. These figures are for one of the three frontier server machine, although the other two servers looked very similar indicating that the load balancing was working as expected. The blue line in these plots indicate the requests that were not in the squid cache and had to be retrieved from the central database.  The observed throughputs for each of three servers was at a maximum of around 660kB/s, which indicates that the 100Mbps network was not a bottleneck. The total throughput for the three servers was 1.8 MB/s.

 Fig.~\ref{fig:launchpad-t0-cpu} shows the server CPU for one of the servers during this same time period. Spikes are observed when new objects are brought into the cache, but there are no severe loads observed.  Fig.~\ref{fig:launchpad-t0-1000node-cpu} shows the CPU load  for the same server under steady load during the 1000 client T0 test and it remains below 10\% for the duration. The I/O during this same time is shown in  Fig.~\ref{fig:launchpad-t0-1000node-io}. The fact that the input to the server is almost two-thirds that of the output was somewhat surprising, but is the result of the HTTP and TCP overhead, and it is significant because its size is the same order as the payload itself for the small objects.  

\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/frontier-t0-requests}}
    \caption{Requests to one of the three frontier servers from the T0 processing farm. The number of T0 nodes is ramped up form 200 to 1000 nodes during the time shown on the chart.}
    \label{fig:frontier-t0-requests}
  \end{center}
\end{figure} 
\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/frontier-t0-throughput}}
    \caption{Data throughput for one of the three frontier servers from the T0 processing farm. The number of T0 nodes is ramped up form 200 to 1000 nodes during the time shown on the chart.}
    \label{fig:frontier-t0-throughput}
  \end{center}
\end{figure} 
\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/launchpad-t0-cpu}}
    \caption{Frontier server CPU usage during the ramp up of T0 activity.}
    \label{fig:launchpad-t0-cpu}
  \end{center}
\end{figure} 
\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/launchpad-t0-1000node-cpu}}
    \caption{Steady state CPU usage on Frontier server node during 1000 node T0 operation.}
    \label{fig:launchpad-t0-1000node-cpu}
  \end{center}
\end{figure} 
\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/launchpad-t0-1000node-io}}
    \caption{Steady state IO on Frontier server node during 1000 node T0 operation.}
    \label{fig:launchpad-t0-1000node-io}
  \end{center}
\end{figure} 


\subsubsection{Tier N operation}
 
In addition to the launchpad at CERN, there were 28 Tier-1 and Tier-2 sites where squid was installed and properly configured. Each of these squids is monitored through the SNMP interface and activity and history is available at the web site  http://cdfdbfrontier4.fnal.gov:8888/indexcms.html. No remarkable issues were observed during this testing, however the large number of tiny objects problem makes a typical client startup take 15 minutes or more. For data not in the local squid caches, the startup was observed to be as long as 40 minutes.  


\subsubsection{Si Alignment Object Characteristics}
To understand the characteristics of the Si object better we looked at the size and number of objects that were being requested. 
A single run of the RECO081\_onlyCkf.cfg with the patched FrontierAccess had the following frontier statistics:
\begin{verbatim}
28116 queries 
138 no-cache queries 
342 queries of the database version 
27502 unique queries

These are the largest payloads (full size = uncompressed): 

  1369 byte (full size 12630),            25033 byte (full size 152389) 
 20221 byte (full size 157849),           54251 byte (full size 575482) 
 57316 byte (full size 597911),          109821 byte (full size 843757) 
392046 byte (full size 2948642),         419859 byte (full size 3250885) 
411531 byte (full size 3555809),         431981 byte (full size 6728489) 

\end{verbatim}

Everything else is under 4000 bytes full size, and 99\% of the total
queries are 317 bytes full size or smaller. The data was compressed by
Frontier for the network transfer, the ``full size'' numbers refer to
the uncompressed size. 


The performance effect of this very large number (27k+) of small
    requests per job is being investigated.  Job startup time for
    frontier is about 25\% longer than direct oracle access at CERN (when
    running one job at a time and when the squid cache has been
    preloaded).  We have prototyped reusing a single persistent TCP
    connection for all the frontier queries, but it only appears to
    account for about half of that difference in job startup time.  Even
    with the persistent TCP connection, the small packets keeps the
    maximum network throughput with many parallel jobs down to around 1
    Megabyte/second.  By contrast, we have seen as high as 35
    Megabytes/second throughput with larger queries over Gigabit
    ethernet (at Fermilab).  The large number of requests are also
    responsible for producing squid access log entries at the rate of
    about 2GB/hour when 400 jobs run in parallel.  The bottom line is
    that many tiny objects is not good for overall performance and must
    be avoided.


\subsubsection{Site Configuration}
Admins at each site are responsible for configuring their squid(s) to coincide with the hardware being used. One important question was whether the instrumentation we have is sufficient to diagnose specific site problems and help the site administrators fix them. One example we encountered during the CSA06 tests was an improperly configured cache at one of the sites. 
We noticed cached objects (\# in cache) had ''hair'' as seen in Fig.~\ref{fig:frontier-bari-cache-problem}. The requests per minute chart,  Fig.~\ref{fig:frontier-bari-cache-problem-requests}, showed that there was a correlation to the unusual features. The precise problem was that their squid configuration had a very  small disk cache, causing the  objects in the cache to be ``thrashed'' quickly out.

The other important part of the site configuration is the so-called site-local-config file. This file is a bootstrap for jobs running at the site, and contains the frontier server URL and local squid proxy URLs. Many local-site-configs have been debugged and fixed over the course of CSA06. The CMS job robot started submitting jobs that include frontier access, to an ever increasing number of Tier-1 and Tier-2 sites.


\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/frontier-bari-cache-problem}}
    \caption{Squid configuration problem caused cache thrashing as indicated by the ``hair'' on the number of cached objects chart.}
    \label{fig:frontier-bari-cache-problem}
  \end{center}
\end{figure} 
\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/frontier-bari-cache-problem-requests}}
    \caption{Requests rate during cache configuration problem.}
    \label{fig:frontier-bari-cache-problem-requests}
  \end{center}
\end{figure} 

\subsubsection{Cache Coherency}
One issue of concern for the objects in the Squid cache is cache coherency with the object stored in the central database. CMS has agreed to a policy of never changing objects that are stored into the central database, and ultimately this and other cache refresh options will be implemented. During the startup period, however, it was desired to have a mechanism that would provide periodic cache refresh in case the object was changed. This mechanism is implemented as an expiration time included in the HTTP header of each object which causes it to expire at 5 AM CERN time (3:00 AM UTC) the next day. The effect of this can be seen in Fig.~\ref{fig:frontier-cach-expire-objects} and  Fig.~\ref{fig:frontier-cach-expire-requests}. At 22:00 UTC the cache was dumped by hand, and the servelet installed that writes the expiration time in the header. Subsequently, it is observed that the objects expire and are refreshed between 3:00 and 4:00 UTC time.

\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-objects}}
    \caption{Object count on launchpad server when refresh is done by hand, and through the expiration at 3 am UTC.}
    \label{fig:frontier-cach-expire-objects}
  \end{center}
\end{figure} 
\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-requests}} 
    \caption{Requests on launchpad server during cache refresh and expiration.}
    \label{fig:frontier-cach-expire-requests}
  \end{center}
\end{figure} 
\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-cpu}}
    \caption{CPU usage on launchpad server during cache refresh and expiration.}
    \label{fig:frontier-cach-expire-cpu}
  \end{center}
\end{figure} 
\begin{figure}[hbtp]
  \begin{center}
    \resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-io}}
    \caption{I/O on launchpad server during cache refresh and expiration.}
    \label{fig:frontier-cach-expire-io}
  \end{center}
\end{figure} 

This is an adequate solution for the short term and solves the cache coherency problem to within a few hours. However, reloading every cached object at every site all over the world will have significant performance implications and we must have a better solution for the final system. The impact on the launchpad servers is apparent in Fig.~\ref{fig:frontier-cach-expire-cpu} and Fig.~\ref{fig:frontier-cach-expire-io}, which show spikes in the the CPU and I/O for one of the Frontier server machines as the caches are reloaded.
\subsubsection{Conclusion}

CSA06 Calibration and Alignment DB access via The Frontier infrastructure has been successfully exercised for up to 1000 clients on the  T0 farm, and at  T1 and T2 sites. The monitoring we have in place is extremely useful to observe the activity and understand performance at several levels of the system.  
The activity helped to uncover several issues that need additional work including:
\begin{itemize}
\item Threading problem found that causes seg faults.
\item Lots of tiny objects in the SI alignment need to be consolidated.
\item Logging of Squid access information can be copious.
\item TCP connection overhead should be improved.
\item Squid config and local-site-config.
\item Cache coherency concern has temporary solution, but needs more work.
\end{itemize}
These areas and others will be addressed in the future. The configuration of the service at CERN is not final and work is needed to provide a dedicated Squid for the T0 farm. More work at Tier-1 centers will be done to provide failover solutions that will make the service more reliable at that layer, although there were no problems encountered over the course of the CSA06 tests.


Revision:	1.6
Committed:	Tue Jan 16 02:51:03 2007 UTC (18 years, 3 months ago) by acosta
Content type:	application/x-tex
Branch:	MAIN
CVS Tags:	HEAD
Changes since 1.5:	+29 -5 lines
Log Message:	edits from DA
#	Content
1	\section{Offline Database and Frontier}
2	\subsection{Frontier}
3	The Frontier infrastructure was installed and tested prior to CSA06 and
4	was used for the T0 operation and at T1 and T2 Centers. The goal was
5	to observe the behavior of the frontier central servers at CERN,
6	referred to as the ``launchpad'', and monitor the squids deployed at
7	each participating site. The setup at CERN is shown in
8	Fig.~\ref{fig:frontier-setup}. There were three production servers,
9	each running in tandem a tomcat server and squid server in
10	accelerator mode. Load balancing and failover among the three servers
11	is done via DBS round robin, and this worked flawlessly. The squids
12	were configured in cache peer sharing mode which reduces traffic to
13	the database for non cached objects.
14
15
16
17	\begin{figure}[hbtp]
18	\begin{center}
19	\resizebox{15cm}{!}{\includegraphics{figs/frontier-setup}}
20	\caption{Frontier overview of launchpad and connection to WAN and T0 Farm.}
21	\label{fig:frontier-setup}
22	\end{center}
23	\end{figure}
24
25
26
27	Monitoring was in place to observe the activity for each squid through its SNMP interface and plots for 1) request rate, 2) data throughput and 3) number of cached objects was available for each installed squid. Lemon was used to monitor CPU, Network I/O, and other important machine operating parameters on the servers at CERN.
28
29	Initial tests with 200 T0 clients running CMSSW\_0\_8\_1 were
30	successful for the calibrations available at the time, ECAL and
31	HCAL. However, when CMSSW\_1\_0\_3 was tried a significant fraction
32	($\sim 5\%$) of jobs ending with segmentation faults was
33	observed. Several additional problems emerged associated with the Si
34	alignment when the software was run for the first time on the T0
35	system for the CSA exercise.
36	The Si Alignment C++ object comprises a vector of vectors,
37	which are translated in POOL-ORA into a very large number of tiny
38	queries to the database. This makes loading the object quite slow, and
39	frontier somewhat slower than direct oracle access. Due to the large
40	number of calls to frontier, the squid access logs filled more quickly
41	than we had observed in our previous testing and we were forced to
42	temporarily turn them off.
43
44	A patch was found for the segmentation fault problem and this was implement and released in CMSSW\_1\_0\_6. The root cause of the seg faults was non-threadsafe code in the SEAL library. By commenting out logging in the CORAL Frontier access libraries it was found the failure rate could be reduced to a few per mil. Additional work is underway to solve this problem.
45
46
47
48	\subsubsection{Performance under T0 processing load}
49	After the problems were resolved, there was a week of extensive operation and the T0 farm was ramped up to 1000 nodes. The number of requests and data throughput is shown in Fig.~\ref{fig:frontier-t0-requests} and Fig.~\ref{fig:frontier-t0-throughput}. This shows how the system behaved under loads ranging from 200 (Sunday) to 1000 (Wednesday) concurrent clients. These figures are for one of the three frontier server machine, although the other two servers looked very similar indicating that the load balancing was working as expected. The blue line in these plots indicate the requests that were not in the squid cache and had to be retrieved from the central database. The observed throughputs for each of three servers was at a maximum of around 660kB/s, which indicates that the 100Mbps network was not a bottleneck. The total throughput for the three servers was 1.8 MB/s.
50
51	Fig.~\ref{fig:launchpad-t0-cpu} shows the server CPU for one of the servers during this same time period. Spikes are observed when new objects are brought into the cache, but there are no severe loads observed. Fig.~\ref{fig:launchpad-t0-1000node-cpu} shows the CPU load for the same server under steady load during the 1000 client T0 test and it remains below 10\% for the duration. The I/O during this same time is shown in Fig.~\ref{fig:launchpad-t0-1000node-io}. The fact that the input to the server is almost two-thirds that of the output was somewhat surprising, but is the result of the HTTP and TCP overhead, and it is significant because its size is the same order as the payload itself for the small objects.
52
53	\begin{figure}[hbtp]
54	\begin{center}
55	\resizebox{15cm}{!}{\includegraphics{figs/frontier-t0-requests}}
56	\caption{Requests to one of the three frontier servers from the T0 processing farm. The number of T0 nodes is ramped up form 200 to 1000 nodes during the time shown on the chart.}
57	\label{fig:frontier-t0-requests}
58	\end{center}
59	\end{figure}
60	\begin{figure}[hbtp]
61	\begin{center}
62	\resizebox{15cm}{!}{\includegraphics{figs/frontier-t0-throughput}}
63	\caption{Data throughput for one of the three frontier servers from the T0 processing farm. The number of T0 nodes is ramped up form 200 to 1000 nodes during the time shown on the chart.}
64	\label{fig:frontier-t0-throughput}
65	\end{center}
66	\end{figure}
67	\begin{figure}[hbtp]
68	\begin{center}
69	\resizebox{15cm}{!}{\includegraphics{figs/launchpad-t0-cpu}}
70	\caption{Frontier server CPU usage during the ramp up of T0 activity.}
71	\label{fig:launchpad-t0-cpu}
72	\end{center}
73	\end{figure}
74	\begin{figure}[hbtp]
75	\begin{center}
76	\resizebox{15cm}{!}{\includegraphics{figs/launchpad-t0-1000node-cpu}}
77	\caption{Steady state CPU usage on Frontier server node during 1000 node T0 operation.}
78	\label{fig:launchpad-t0-1000node-cpu}
79	\end{center}
80	\end{figure}
81	\begin{figure}[hbtp]
82	\begin{center}
83	\resizebox{15cm}{!}{\includegraphics{figs/launchpad-t0-1000node-io}}
84	\caption{Steady state IO on Frontier server node during 1000 node T0 operation.}
85	\label{fig:launchpad-t0-1000node-io}
86	\end{center}
87	\end{figure}
88
89
90	\subsubsection{Tier N operation}
91
92	In addition to the launchpad at CERN, there were 28 Tier-1 and Tier-2 sites where squid was installed and properly configured. Each of these squids is monitored through the SNMP interface and activity and history is available at the web site http://cdfdbfrontier4.fnal.gov:8888/indexcms.html. No remarkable issues were observed during this testing, however the large number of tiny objects problem makes a typical client startup take 15 minutes or more. For data not in the local squid caches, the startup was observed to be as long as 40 minutes.
93
94
95
96
97	\subsubsection{Si Alignment Object Characteristics}
98	To understand the characteristics of the Si object better we looked at the size and number of objects that were being requested.
99	A single run of the RECO081\_onlyCkf.cfg with the patched FrontierAccess had the following frontier statistics:
100	\begin{verbatim}
101	28116 queries
102	138 no-cache queries
103	342 queries of the database version
104	27502 unique queries
105
106	These are the largest payloads (full size = uncompressed):
107
108	1369 byte (full size 12630), 25033 byte (full size 152389)
109	20221 byte (full size 157849), 54251 byte (full size 575482)
110	57316 byte (full size 597911), 109821 byte (full size 843757)
111	392046 byte (full size 2948642), 419859 byte (full size 3250885)
112	411531 byte (full size 3555809), 431981 byte (full size 6728489)
113
114	\end{verbatim}
115
116	Everything else is under 4000 bytes full size, and 99\% of the total
117	queries are 317 bytes full size or smaller. The data was compressed by
118	Frontier for the network transfer, the ``full size'' numbers refer to
119	the uncompressed size.
120
121
122	The performance effect of this very large number (27k+) of small
123	requests per job is being investigated. Job startup time for
124	frontier is about 25\% longer than direct oracle access at CERN (when
125	running one job at a time and when the squid cache has been
126	preloaded). We have prototyped reusing a single persistent TCP
127	connection for all the frontier queries, but it only appears to
128	account for about half of that difference in job startup time. Even
129	with the persistent TCP connection, the small packets keeps the
130	maximum network throughput with many parallel jobs down to around 1
131	Megabyte/second. By contrast, we have seen as high as 35
132	Megabytes/second throughput with larger queries over Gigabit
133	ethernet (at Fermilab). The large number of requests are also
134	responsible for producing squid access log entries at the rate of
135	about 2GB/hour when 400 jobs run in parallel. The bottom line is
136	that many tiny objects is not good for overall performance and must
137	be avoided.
138
139
140	\subsubsection{Site Configuration}
141	Admins at each site are responsible for configuring their squid(s) to coincide with the hardware being used. One important question was whether the instrumentation we have is sufficient to diagnose specific site problems and help the site administrators fix them. One example we encountered during the CSA06 tests was an improperly configured cache at one of the sites.
142	We noticed cached objects (\# in cache) had ''hair'' as seen in Fig.~\ref{fig:frontier-bari-cache-problem}. The requests per minute chart, Fig.~\ref{fig:frontier-bari-cache-problem-requests}, showed that there was a correlation to the unusual features. The precise problem was that their squid configuration had a very small disk cache, causing the objects in the cache to be ``thrashed'' quickly out.
143
144	The other important part of the site configuration is the so-called site-local-config file. This file is a bootstrap for jobs running at the site, and contains the frontier server URL and local squid proxy URLs. Many local-site-configs have been debugged and fixed over the course of CSA06. The CMS job robot started submitting jobs that include frontier access, to an ever increasing number of Tier-1 and Tier-2 sites.
145
146
147	\begin{figure}[hbtp]
148	\begin{center}
149	\resizebox{15cm}{!}{\includegraphics{figs/frontier-bari-cache-problem}}
150	\caption{Squid configuration problem caused cache thrashing as indicated by the ``hair'' on the number of cached objects chart.}
151	\label{fig:frontier-bari-cache-problem}
152	\end{center}
153	\end{figure}
154	\begin{figure}[hbtp]
155	\begin{center}
156	\resizebox{15cm}{!}{\includegraphics{figs/frontier-bari-cache-problem-requests}}
157	\caption{Requests rate during cache configuration problem.}
158	\label{fig:frontier-bari-cache-problem-requests}
159	\end{center}
160	\end{figure}
161
162	\subsubsection{Cache Coherency}
163	One issue of concern for the objects in the Squid cache is cache coherency with the object stored in the central database. CMS has agreed to a policy of never changing objects that are stored into the central database, and ultimately this and other cache refresh options will be implemented. During the startup period, however, it was desired to have a mechanism that would provide periodic cache refresh in case the object was changed. This mechanism is implemented as an expiration time included in the HTTP header of each object which causes it to expire at 5 AM CERN time (3:00 AM UTC) the next day. The effect of this can be seen in Fig.~\ref{fig:frontier-cach-expire-objects} and Fig.~\ref{fig:frontier-cach-expire-requests}. At 22:00 UTC the cache was dumped by hand, and the servelet installed that writes the expiration time in the header. Subsequently, it is observed that the objects expire and are refreshed between 3:00 and 4:00 UTC time.
164
165	\begin{figure}[hbtp]
166	\begin{center}
167	\resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-objects}}
168	\caption{Object count on launchpad server when refresh is done by hand, and through the expiration at 3 am UTC.}
169	\label{fig:frontier-cach-expire-objects}
170	\end{center}
171	\end{figure}
172	\begin{figure}[hbtp]
173	\begin{center}
174	\resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-requests}}
175	\caption{Requests on launchpad server during cache refresh and expiration.}
176	\label{fig:frontier-cach-expire-requests}
177	\end{center}
178	\end{figure}
179	\begin{figure}[hbtp]
180	\begin{center}
181	\resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-cpu}}
182	\caption{CPU usage on launchpad server during cache refresh and expiration.}
183	\label{fig:frontier-cach-expire-cpu}
184	\end{center}
185	\end{figure}
186	\begin{figure}[hbtp]
187	\begin{center}
188	\resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-io}}
189	\caption{I/O on launchpad server during cache refresh and expiration.}
190	\label{fig:frontier-cach-expire-io}
191	\end{center}
192	\end{figure}
193
194	This is an adequate solution for the short term and solves the cache coherency problem to within a few hours. However, reloading every cached object at every site all over the world will have significant performance implications and we must have a better solution for the final system. The impact on the launchpad servers is apparent in Fig.~\ref{fig:frontier-cach-expire-cpu} and Fig.~\ref{fig:frontier-cach-expire-io}, which show spikes in the the CPU and I/O for one of the Frontier server machines as the caches are reloaded.
195	\subsubsection{Conclusion}
196
197	CSA06 Calibration and Alignment DB access via The Frontier infrastructure has been successfully exercised for up to 1000 clients on the T0 farm, and at T1 and T2 sites. The monitoring we have in place is extremely useful to observe the activity and understand performance at several levels of the system.
198	The activity helped to uncover several issues that need additional work including:
199	\begin{itemize}
200	\item Threading problem found that causes seg faults.
201	\item Lots of tiny objects in the SI alignment need to be consolidated.
202	\item Logging of Squid access information can be copious.
203	\item TCP connection overhead should be improved.
204	\item Squid config and local-site-config.
205	\item Cache coherency concern has temporary solution, but needs more work.
206	\end{itemize}
207	These areas and others will be addressed in the future. The configuration of the service at CERN is not final and work is needed to provide a dedicated Squid for the T0 farm. More work at Tier-1 centers will be done to provide failover solutions that will make the service more reliable at that layer, although there were no problems encountered over the course of the CSA06 tests.
208
209