1 |
\section{Offline Database and Frontier}
|
2 |
\subsection{Frontier}
|
3 |
The Frontier infrastructure was installed and tested prior to CSA06 and
|
4 |
was used for the T0 operation and at T1 and T2 Centers. The goal was
|
5 |
to observe the behavior of the frontier central servers at CERN,
|
6 |
referred to as the ``launchpad'', and monitor the squids deployed at
|
7 |
each participating site. The setup at CERN is shown in
|
8 |
Fig.~\ref{fig:frontier-setup}. There were three production servers,
|
9 |
each running in tandem a tomcat server and squid server in
|
10 |
accelerator mode. Load balancing and failover among the three servers
|
11 |
is done via DBS round robin, and this worked flawlessly. The squids
|
12 |
were configured in cache peer sharing mode which reduces traffic to
|
13 |
the database for non cached objects.
|
14 |
|
15 |
|
16 |
|
17 |
\begin{figure}[hbtp]
|
18 |
\begin{center}
|
19 |
\resizebox{15cm}{!}{\includegraphics{figs/frontier-setup}}
|
20 |
\caption{Frontier overview of launchpad and connection to WAN and T0 Farm.}
|
21 |
\label{fig:frontier-setup}
|
22 |
\end{center}
|
23 |
\end{figure}
|
24 |
|
25 |
|
26 |
|
27 |
Monitoring was in place to observe the activity for each squid through its SNMP interface and plots for 1) request rate, 2) data throughput and 3) number of cached objects was available for each installed squid. Lemon was used to monitor CPU, Network I/O, and other important machine operating parameters on the servers at CERN.
|
28 |
|
29 |
Initial tests with 200 T0 clients running CMSSW\_0\_8\_1 were
|
30 |
successful for the calibrations available at the time, ECAL and
|
31 |
HCAL. However, when CMSSW\_1\_0\_3 was tried a significant fraction
|
32 |
($\sim 5\%$) of jobs ending with segmentation faults was
|
33 |
observed. Several additional problems emerged associated with the Si
|
34 |
alignment when the software was run for the first time on the T0
|
35 |
system for the CSA exercise.
|
36 |
The Si Alignment C++ object comprises a vector of vectors,
|
37 |
which are translated in POOL-ORA into a very large number of tiny
|
38 |
queries to the database. This makes loading the object quite slow, and
|
39 |
frontier somewhat slower than direct oracle access. Due to the large
|
40 |
number of calls to frontier, the squid access logs filled more quickly
|
41 |
than we had observed in our previous testing and we were forced to
|
42 |
temporarily turn them off.
|
43 |
|
44 |
A patch was found for the segmentation fault problem and this was implement and released in CMSSW\_1\_0\_6. The root cause of the seg faults was non-threadsafe code in the SEAL library. By commenting out logging in the CORAL Frontier access libraries it was found the failure rate could be reduced to a few per mil. Additional work is underway to solve this problem.
|
45 |
|
46 |
|
47 |
|
48 |
\subsubsection{Performance under T0 processing load}
|
49 |
After the problems were resolved, there was a week of extensive operation and the T0 farm was ramped up to 1000 nodes. The number of requests and data throughput is shown in Fig.~\ref{fig:frontier-t0-requests} and Fig.~\ref{fig:frontier-t0-throughput}. This shows how the system behaved under loads ranging from 200 (Sunday) to 1000 (Wednesday) concurrent clients. These figures are for one of the three frontier server machine, although the other two servers looked very similar indicating that the load balancing was working as expected. The blue line in these plots indicate the requests that were not in the squid cache and had to be retrieved from the central database. The observed throughputs for each of three servers was at a maximum of around 660kB/s, which indicates that the 100Mbps network was not a bottleneck. The total throughput for the three servers was 1.8 MB/s.
|
50 |
|
51 |
Fig.~\ref{fig:launchpad-t0-cpu} shows the server CPU for one of the servers during this same time period. Spikes are observed when new objects are brought into the cache, but there are no severe loads observed. Fig.~\ref{fig:launchpad-t0-1000node-cpu} shows the CPU load for the same server under steady load during the 1000 client T0 test and it remains below 10\% for the duration. The I/O during this same time is shown in Fig.~\ref{fig:launchpad-t0-1000node-io}. The fact that the input to the server is almost two-thirds that of the output was somewhat surprising, but is the result of the HTTP and TCP overhead, and it is significant because its size is the same order as the payload itself for the small objects.
|
52 |
|
53 |
\begin{figure}[hbtp]
|
54 |
\begin{center}
|
55 |
\resizebox{15cm}{!}{\includegraphics{figs/frontier-t0-requests}}
|
56 |
\caption{Requests to one of the three frontier servers from the T0 processing farm. The number of T0 nodes is ramped up form 200 to 1000 nodes during the time shown on the chart.}
|
57 |
\label{fig:frontier-t0-requests}
|
58 |
\end{center}
|
59 |
\end{figure}
|
60 |
\begin{figure}[hbtp]
|
61 |
\begin{center}
|
62 |
\resizebox{15cm}{!}{\includegraphics{figs/frontier-t0-throughput}}
|
63 |
\caption{Data throughput for one of the three frontier servers from the T0 processing farm. The number of T0 nodes is ramped up form 200 to 1000 nodes during the time shown on the chart.}
|
64 |
\label{fig:frontier-t0-throughput}
|
65 |
\end{center}
|
66 |
\end{figure}
|
67 |
\begin{figure}[hbtp]
|
68 |
\begin{center}
|
69 |
\resizebox{15cm}{!}{\includegraphics{figs/launchpad-t0-cpu}}
|
70 |
\caption{Frontier server CPU usage during the ramp up of T0 activity.}
|
71 |
\label{fig:launchpad-t0-cpu}
|
72 |
\end{center}
|
73 |
\end{figure}
|
74 |
\begin{figure}[hbtp]
|
75 |
\begin{center}
|
76 |
\resizebox{15cm}{!}{\includegraphics{figs/launchpad-t0-1000node-cpu}}
|
77 |
\caption{Steady state CPU usage on Frontier server node during 1000 node T0 operation.}
|
78 |
\label{fig:launchpad-t0-1000node-cpu}
|
79 |
\end{center}
|
80 |
\end{figure}
|
81 |
\begin{figure}[hbtp]
|
82 |
\begin{center}
|
83 |
\resizebox{15cm}{!}{\includegraphics{figs/launchpad-t0-1000node-io}}
|
84 |
\caption{Steady state IO on Frontier server node during 1000 node T0 operation.}
|
85 |
\label{fig:launchpad-t0-1000node-io}
|
86 |
\end{center}
|
87 |
\end{figure}
|
88 |
|
89 |
|
90 |
\subsubsection{Tier N operation}
|
91 |
|
92 |
In addition to the launchpad at CERN, there were 28 Tier-1 and Tier-2 sites where squid was installed and properly configured. Each of these squids is monitored through the SNMP interface and activity and history is available at the web site http://cdfdbfrontier4.fnal.gov:8888/indexcms.html. No remarkable issues were observed during this testing, however the large number of tiny objects problem makes a typical client startup take 15 minutes or more. For data not in the local squid caches, the startup was observed to be as long as 40 minutes.
|
93 |
|
94 |
|
95 |
|
96 |
|
97 |
\subsubsection{Si Alignment Object Characteristics}
|
98 |
To understand the characteristics of the Si object better we looked at the size and number of objects that were being requested.
|
99 |
A single run of the RECO081\_onlyCkf.cfg with the patched FrontierAccess had the following frontier statistics:
|
100 |
\begin{verbatim}
|
101 |
28116 queries
|
102 |
138 no-cache queries
|
103 |
342 queries of the database version
|
104 |
27502 unique queries
|
105 |
|
106 |
These are the largest payloads (full size = uncompressed):
|
107 |
|
108 |
1369 byte (full size 12630), 25033 byte (full size 152389)
|
109 |
20221 byte (full size 157849), 54251 byte (full size 575482)
|
110 |
57316 byte (full size 597911), 109821 byte (full size 843757)
|
111 |
392046 byte (full size 2948642), 419859 byte (full size 3250885)
|
112 |
411531 byte (full size 3555809), 431981 byte (full size 6728489)
|
113 |
|
114 |
\end{verbatim}
|
115 |
|
116 |
Everything else is under 4000 bytes full size, and 99\% of the total
|
117 |
queries are 317 bytes full size or smaller. The data was compressed by
|
118 |
Frontier for the network transfer, the ``full size'' numbers refer to
|
119 |
the uncompressed size.
|
120 |
|
121 |
|
122 |
The performance effect of this very large number (27k+) of small
|
123 |
requests per job is being investigated. Job startup time for
|
124 |
frontier is about 25\% longer than direct oracle access at CERN (when
|
125 |
running one job at a time and when the squid cache has been
|
126 |
preloaded). We have prototyped reusing a single persistent TCP
|
127 |
connection for all the frontier queries, but it only appears to
|
128 |
account for about half of that difference in job startup time. Even
|
129 |
with the persistent TCP connection, the small packets keeps the
|
130 |
maximum network throughput with many parallel jobs down to around 1
|
131 |
Megabyte/second. By contrast, we have seen as high as 35
|
132 |
Megabytes/second throughput with larger queries over Gigabit
|
133 |
ethernet (at Fermilab). The large number of requests are also
|
134 |
responsible for producing squid access log entries at the rate of
|
135 |
about 2GB/hour when 400 jobs run in parallel. The bottom line is
|
136 |
that many tiny objects is not good for overall performance and must
|
137 |
be avoided.
|
138 |
|
139 |
|
140 |
\subsubsection{Site Configuration}
|
141 |
Admins at each site are responsible for configuring their squid(s) to coincide with the hardware being used. One important question was whether the instrumentation we have is sufficient to diagnose specific site problems and help the site administrators fix them. One example we encountered during the CSA06 tests was an improperly configured cache at one of the sites.
|
142 |
We noticed cached objects (\# in cache) had ''hair'' as seen in Fig.~\ref{fig:frontier-bari-cache-problem}. The requests per minute chart, Fig.~\ref{fig:frontier-bari-cache-problem-requests}, showed that there was a correlation to the unusual features. The precise problem was that their squid configuration had a very small disk cache, causing the objects in the cache to be ``thrashed'' quickly out.
|
143 |
|
144 |
The other important part of the site configuration is the so-called site-local-config file. This file is a bootstrap for jobs running at the site, and contains the frontier server URL and local squid proxy URLs. Many local-site-configs have been debugged and fixed over the course of CSA06. The CMS job robot started submitting jobs that include frontier access, to an ever increasing number of Tier-1 and Tier-2 sites.
|
145 |
|
146 |
|
147 |
\begin{figure}[hbtp]
|
148 |
\begin{center}
|
149 |
\resizebox{15cm}{!}{\includegraphics{figs/frontier-bari-cache-problem}}
|
150 |
\caption{Squid configuration problem caused cache thrashing as indicated by the ``hair'' on the number of cached objects chart.}
|
151 |
\label{fig:frontier-bari-cache-problem}
|
152 |
\end{center}
|
153 |
\end{figure}
|
154 |
\begin{figure}[hbtp]
|
155 |
\begin{center}
|
156 |
\resizebox{15cm}{!}{\includegraphics{figs/frontier-bari-cache-problem-requests}}
|
157 |
\caption{Requests rate during cache configuration problem.}
|
158 |
\label{fig:frontier-bari-cache-problem-requests}
|
159 |
\end{center}
|
160 |
\end{figure}
|
161 |
|
162 |
\subsubsection{Cache Coherency}
|
163 |
One issue of concern for the objects in the Squid cache is cache coherency with the object stored in the central database. CMS has agreed to a policy of never changing objects that are stored into the central database, and ultimately this and other cache refresh options will be implemented. During the startup period, however, it was desired to have a mechanism that would provide periodic cache refresh in case the object was changed. This mechanism is implemented as an expiration time included in the HTTP header of each object which causes it to expire at 5 AM CERN time (3:00 AM UTC) the next day. The effect of this can be seen in Fig.~\ref{fig:frontier-cach-expire-objects} and Fig.~\ref{fig:frontier-cach-expire-requests}. At 22:00 UTC the cache was dumped by hand, and the servelet installed that writes the expiration time in the header. Subsequently, it is observed that the objects expire and are refreshed between 3:00 and 4:00 UTC time.
|
164 |
|
165 |
\begin{figure}[hbtp]
|
166 |
\begin{center}
|
167 |
\resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-objects}}
|
168 |
\caption{Object count on launchpad server when refresh is done by hand, and through the expiration at 3 am UTC.}
|
169 |
\label{fig:frontier-cach-expire-objects}
|
170 |
\end{center}
|
171 |
\end{figure}
|
172 |
\begin{figure}[hbtp]
|
173 |
\begin{center}
|
174 |
\resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-requests}}
|
175 |
\caption{Requests on launchpad server during cache refresh and expiration.}
|
176 |
\label{fig:frontier-cach-expire-requests}
|
177 |
\end{center}
|
178 |
\end{figure}
|
179 |
\begin{figure}[hbtp]
|
180 |
\begin{center}
|
181 |
\resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-cpu}}
|
182 |
\caption{CPU usage on launchpad server during cache refresh and expiration.}
|
183 |
\label{fig:frontier-cach-expire-cpu}
|
184 |
\end{center}
|
185 |
\end{figure}
|
186 |
\begin{figure}[hbtp]
|
187 |
\begin{center}
|
188 |
\resizebox{15cm}{!}{\includegraphics{figs/frontier-cach-expire-io}}
|
189 |
\caption{I/O on launchpad server during cache refresh and expiration.}
|
190 |
\label{fig:frontier-cach-expire-io}
|
191 |
\end{center}
|
192 |
\end{figure}
|
193 |
|
194 |
This is an adequate solution for the short term and solves the cache coherency problem to within a few hours. However, reloading every cached object at every site all over the world will have significant performance implications and we must have a better solution for the final system. The impact on the launchpad servers is apparent in Fig.~\ref{fig:frontier-cach-expire-cpu} and Fig.~\ref{fig:frontier-cach-expire-io}, which show spikes in the the CPU and I/O for one of the Frontier server machines as the caches are reloaded.
|
195 |
\subsubsection{Conclusion}
|
196 |
|
197 |
CSA06 Calibration and Alignment DB access via The Frontier infrastructure has been successfully exercised for up to 1000 clients on the T0 farm, and at T1 and T2 sites. The monitoring we have in place is extremely useful to observe the activity and understand performance at several levels of the system.
|
198 |
The activity helped to uncover several issues that need additional work including:
|
199 |
\begin{itemize}
|
200 |
\item Threading problem found that causes seg faults.
|
201 |
\item Lots of tiny objects in the SI alignment need to be consolidated.
|
202 |
\item Logging of Squid access information can be copious.
|
203 |
\item TCP connection overhead should be improved.
|
204 |
\item Squid config and local-site-config.
|
205 |
\item Cache coherency concern has temporary solution, but needs more work.
|
206 |
\end{itemize}
|
207 |
These areas and others will be addressed in the future. The configuration of the service at CERN is not final and work is needed to provide a dedicated Squid for the T0 farm. More work at Tier-1 centers will be done to provide failover solutions that will make the service more reliable at that layer, although there were no problems encountered over the course of the CSA06 tests.
|
208 |
|
209 |
|