1 |
>> >> I haven't mentioned it yet, but several times I've seen the website |
2 |
>> >> perform fine all day until I browse to it myself and then all of a |
3 |
>> >> sudden it's super slow for me and my third-party monitor. WTF??? |
4 |
>> > |
5 |
>> > I had a similar problems once when routing through a IPsec VPN |
6 |
>> > tunnnel. I needed to reduce MTU in front of the tunnel to make it |
7 |
>> > work correctly. But I think your problem is different. |
8 |
>> |
9 |
>> |
10 |
>> I'm not using IPsec or a VPN. |
11 |
>> |
12 |
>> |
13 |
>> > Does the http server backlog on the other side? Do you have |
14 |
>> > performance graphs for other parts of the system to see them in |
15 |
>> > relation? Maybe some router on the path doesn't work as expected. |
16 |
>> |
17 |
>> |
18 |
>> I've attached a graph of http response time, CPU usage, and TCP |
19 |
>> queueing over the past week. It seems clear from watching top, iotop, |
20 |
>> and free than my CPU is always the bottleneck on my server. |
21 |
> |
22 |
> What kind of application stack is running in the http server? CPU is a |
23 |
> bottleneck you cannot always circumvent by throwing more CPUs at the |
24 |
> problem. Maybe that stack needs tuning... |
25 |
> |
26 |
> At the point when requests start queuing up in the http server, the load |
27 |
> on the server will exponentially rise. It's like a traffic jam on a |
28 |
> multi lane high way. If one car brakes, thinks may still work. If a car |
29 |
> in every lane brakes, you suddenly have a huge traffic jam backlogging |
30 |
> a few miles. And it takes time to recover from that. You need to solve |
31 |
> the cause for "braking" in the first place and add some alternative |
32 |
> routes for "cars that never brake" (static files and cacheable |
33 |
> content). Each lane corresponds to one CPU. Adding just more lanes when |
34 |
> you have just 4 CPUs will only make the lanes slower. The key is to |
35 |
> drastically lower the response times which are much too high if I look |
36 |
> at your graphs. What do memory and IO say? |
37 |
|
38 |
|
39 |
It turned out this was a combination of two problems which made it |
40 |
much more difficult to figure out. |
41 |
|
42 |
First of all I didn't have enough apache2 processes. That seems like |
43 |
it should have been obvious but it wasn't for two reasons. Firstly, |
44 |
my apache2 processes are always idle or nearly idle, even when traffic |
45 |
levels are high. But it must be the case that each request made to |
46 |
nginx which is then handed off to apache2 monopolizes an apache2 |
47 |
process even though my backend application server is the one using all |
48 |
the CPU instead of apache2. The other thing that made it difficult to |
49 |
track down was the way munin graphs apache2 processes. On my graph, |
50 |
busy and free processes only appeared as tiny dots at the bottom |
51 |
because apache2's ServerLimit is drawn on the same graph which is many |
52 |
times greater than the number of busy and free processes. It would be |
53 |
better to draw MaxClients instead of ServerLimit since I think |
54 |
MaxClients is more likely to be tuned. It at least appears in the |
55 |
default config file on Gentoo. Since busy and free apache2 processes |
56 |
were virtually invisible on the munin graph, I wasn't able to |
57 |
correlate their ebb and flow with my server's response times. |
58 |
|
59 |
Once I fixed the apache2 problem, I was sure I had it nailed. That's |
60 |
when I emailed here a few days ago to say I think I got it. But it |
61 |
turned out there was another problem and that was Odoo (formerly known |
62 |
as OpenERP) which is also running in a reverse proxy configuration |
63 |
behind nginx. Whenever someone uses Odoo on my server, it absolutely |
64 |
destroys performance for my non-Odoo website. That would have been |
65 |
really easy to test and I did test stopping the odoo service early on, |
66 |
but I ruled it out when the problem persisted after stopping Odoo |
67 |
which I now realize must have been because of the apache2 problem. |
68 |
|
69 |
So this was much more difficult to figure out due to the fact that I |
70 |
had multiple problems interacting with each other. |
71 |
|
72 |
- Grant |