Linearly increasing delays in HTTPS proxy CONNECTS / 3.5.20

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Linearly increasing delays in HTTPS proxy CONNECTS / 3.5.20

Ilari Laitinen
Hi!

I am experiencing a slowdown between CONNECT requests and corresponding "200 Connection established" responses. This happens when performance testing a load-balanced pair of squid servers with hundreds of simultaneous, independent, proxied HTTPS requests per second per server.

Most new connections get delayed by an increasing amount of time, raising often to more than 5 seconds. This obviously fails the performance test. Existing connections seem to be unaffected. The problematic situation lasts for a couple of seconds with strikingly linearly increasing delays. Expected operation resumes afterwards with a notable spike in traffic. Sometimes the delay stays at a specific level for a couple of seconds before the issue resolves itself.

squid 3.5.20 on CentOS 7

I have recorded the traffic using tcpdump and it boils down to the following. Everything else is very, very fast even during these slowdown periods.


Legend:
“source” is one of the servers where the load test originates
“squid” is the server where squid runs
“target” is a cloud service

source —> squid [SYN]
source <— squid [SYN, ACK]
source —> squid [ACK]
source —> squid [CONNECT target:443 HTTP/1.1]
source <— squid [ACK]

[Unexpected delay here]

squid —> target [SYN]
squid <— target [SYN, ACK]
squid —> target [ACK]
source <— squid [HTTP/1.1 200 Connection established]
source —> squid [ACK]

[Rest of the connection here without such delays]


System load stays at comfortable level (aroung 0.6 even while experiencing the issue), memory is not an issue. From
SNMP data, I noticed small but consistent spikes in squid's disk cache usage coinciding with the issue at hand. This seemed strange, given there was no other traffic during the tests and proxied HTTPS means there's nothing to cache (right?). I nevertheless tried switching the cache from ufs to aufs and also using the no-cache option with ufs. Didn’t help. (And the spikes remained…)

I tried increasing the net.ipv4.ip_local_port_range sysctl value (it was set to default). That didn’t help, either.

The servers are located in a IPv4-only local network. Every outgoing request is supposed to be IPv4. The servers do have IPv6 interfaces but there is no traffic at all. Squid periodically queries AAAA records. Is it possible that new connections get queued while squid is busy trying to use IPv6 after receiving the new AAAAs? I have very little control over the environment. Is dns_v4_first worth a try in my scenario?

Is there something I’m missing?
What should I look into next? Could setting up "workers N” help, for example?

Best Regards,

--
Ilari Laitinen

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: Linearly increasing delays in HTTPS proxy CONNECTS / 3.5.20

Alex Rousskov
On 8/20/19 9:12 AM, Ilari Laitinen wrote:

> I am experiencing a slowdown between CONNECT requests and corresponding "200 Connection established" responses.

I assume you are not using any SslBump features but please correct me if
I am wrong.


> source —> squid [SYN]
> source <— squid [SYN, ACK]
> source —> squid [ACK]
> source —> squid [CONNECT target:443 HTTP/1.1]
> source <— squid [ACK]
>
> [Unexpected delay here]
>
> squid —> target [SYN]
> squid <— target [SYN, ACK]
> squid —> target [ACK]
> source <— squid [HTTP/1.1 200 Connection established]
> source —> squid [ACK]
>
> [Rest of the connection here without such delays]


> I noticed small but consistent spikes in squid's disk cache usage
> coinciding with the issue at hand. This seemed strange, given there
> was no other traffic during the tests and proxied HTTPS means there's
> nothing to cache (right?).

Correct. To avoid suspecting disks, configure Squid to log to a
RAM-based partition and remove cache_dirs [until you solve the problem].


> The servers are located in a IPv4-only local network. Every outgoing
> request is supposed to be IPv4. The servers do have IPv6 interfaces
> but there is no traffic at all. Squid periodically queries AAAA
> records. Is it possible that new connections get queued while squid
> is busy trying to use IPv6 after receiving the new AAAAs?

If "no traffic at all" means "zero IPv6 packets", then it is not
possible. Otherwise, it is possible (only the latest Squid (i.e. future
v5) does not have this kind of problem).


> I have very little control over the environment. Is dns_v4_first
> worth a try in my scenario?

It is not a reliable solution, but it would not hurt as far as
performance is concerned.


> What should I look into next?

1. Check system logs.

2. Check atop output while the problem is present. If this is a resource
bottleneck, atop may expose it.

3. If there is IPv6 traffic, to eliminate IPv6 as a suspect, you can
disable IPv6 on the box, use Squid built without IPv6 support, or even
use a DNS forwarder that, for example, rejects all AAAA queries. All
these solutions can and should be validated by examining actual IPv6
traffic. And none of them are needed if there is no IPv6 traffic at all.

4. With delays ranging into _seconds_ it should be fairly easy for a
capable Squid developer to figure out what your Squid is doing by
looking at Squid debugging logs. You can post a link to compressed
cache.log here for analysis, but you should first simplify your workload
so that it has CONNECT tunnels and nothing else (if you have not
already) and enable debugging when the problem is present (e.g., use
"squid -k debug" although it is currently better to send the right
signal manually).


> Could setting up "workers N” help, for example?

The answer depends on your definition of "help": Large number of workers
may mask the problem to the point where it no longer bothers you, but I
would not make the setup a lot more complex until you know where the
current bottleneck is. In fact, I would go into the opposite direction
of making the setup as simple as possible!



HTH,

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: Linearly increasing delays in HTTPS proxy CONNECTS / 3.5.20

Ilari Laitinen
>> I noticed small but consistent spikes in squid's disk cache usage
>> coinciding with the issue at hand. This seemed strange, given there
>> was no other traffic during the tests and proxied HTTPS means there's
>> nothing to cache (right?).
>
> Correct. To avoid suspecting disks, configure Squid to log to a
> RAM-based partition and remove cache_dirs [until you solve the problem].

The behaviour persisted with this configuration. Not a disk bottleneck, then.

>> The servers are located in a IPv4-only local network. Every outgoing
>> request is supposed to be IPv4. The servers do have IPv6 interfaces
>> but there is no traffic at all. Squid periodically queries AAAA
>> records. Is it possible that new connections get queued while squid
>> is busy trying to use IPv6 after receiving the new AAAAs?
>
> If "no traffic at all" means "zero IPv6 packets", then it is not
> possible. Otherwise, it is possible (only the latest Squid (i.e. future
> v5) does not have this kind of problem).
>
>> I have very little control over the environment. Is dns_v4_first
>> worth a try in my scenario?
>
> It is not a reliable solution, but it would not hurt as far as
> performance is concerned.

Tried dns_v4_first, the problem persisted. I also noticed that the platform (in general) tries to resolve ipv6 first, but the TCP dumps have no ipv6 packages at all. This is baffling, because there were indeed some unrelated open ipv6 connections on the Squid server (reported by netstat).

>> What should I look into next?
>
> 1. Check system logs.

Nothing out of the ordinary.

> 2. Check atop output while the problem is present. If this is a resource
> bottleneck, atop may expose it.

No bottlenecks identified this way.

> 3. If there is IPv6 traffic, to eliminate IPv6 as a suspect, you can
> disable IPv6 on the box, use Squid built without IPv6 support, or even
> use a DNS forwarder that, for example, rejects all AAAA queries. All
> these solutions can and should be validated by examining actual IPv6
> traffic. And none of them are needed if there is no IPv6 traffic at all.

This is something I may need to look into further.

> 4. With delays ranging into _seconds_ it should be fairly easy for a
> capable Squid developer to figure out what your Squid is doing by
> looking at Squid debugging logs. You can post a link to compressed
> cache.log here for analysis, but you should first simplify your workload
> so that it has CONNECT tunnels and nothing else (if you have not
> already) and enable debugging when the problem is present (e.g., use
> "squid -k debug" although it is currently better to send the right
> signal manually).

I unfortunately cannot share the debug log because it contains some sensitive information. We nevertheless recorded what ended up being a huge sample.

I suspect Squid might be waiting for local TCP ports from the kernel (or something related). Why am I thinking this?

Right now, there are four different IP addresses returned for the target cloud service. For practical purposes, they are returned in a random order. The traffic would ideally be spread over all of them. Unfortunately it is evident both from the debug log and from the TCP dump that Squid is using only one of the addresses at a time. The amount of connections in the TIME_WAIT state for that single IP address gets very close to the maximum defined by the net.ipv4.ip_local_port_range sysctl. After a while (a minute or so in the recording) this address changes presumably in response to a new DNS query result.

Could this be the bottleneck? Is there a way to configure Squid to use all returned IP addresses?

One possible workaround that I can think of is setting a short positive_dns_ttl, but this doesn’t fully guarantee an even distribution, now does it?

>> Could setting up "workers N” help, for example?
>
> The answer depends on your definition of "help": Large number of workers
> may mask the problem to the point where it no longer bothers you, but I
> would not make the setup a lot more complex until you know where the
> current bottleneck is. In fact, I would go into the opposite direction
> of making the setup as simple as possible!

Thanks, this is what I wanted to hear. :)


Best,

--
Ilari Laitinen

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: Linearly increasing delays in HTTPS proxy CONNECTS / 3.5.20

Alex Rousskov
On 8/30/19 8:16 AM, Ilari Laitinen wrote:

> I also noticed that the platform (in general) tries to resolve ipv6
> first, but the TCP dumps have no ipv6 packages at all. This is
> baffling, because there were indeed some unrelated open ipv6
> connections on the Squid server (reported by netstat).

You may be able to validate your packet collection rules by adjusting
them to include those known IPv6 connections/ports. Perhaps you are just
not collecting IPv6 traffic.

It is also possible that Squid gets AAAA records but never uses them
because Squid thinks that IPv6 is disabled on your server.


> I unfortunately cannot share the debug log because it contains some
> sensitive information. We nevertheless recorded what ended up being a
> huge sample.

If you hire a Squid developer to help you, they should be willing to
sign a reasonable NDA and/or view data on your servers, without copying.
IMHO, it does not make much sense to sit on a likely valuable direct
information while, at the same time, spending a lot of time to find
distant echoes of that same information elsewhere!


> I suspect Squid might be waiting for local TCP ports from the kernel
> (or something related).

IIRC, ephemeral source port allocator is instantaneous -- Squid either
gets a port or a port allocation error, without waiting. When we
overload the server with high-performance tests (without an explicit
port manager), we see port allocation errors rather than stalled tests.
However, perhaps that is not true in your OS/environment.


> Right now, there are four different IP addresses returned for the
> target cloud service. For practical purposes, they are returned in a
> random order. The traffic would ideally be spread over all of them.
> Unfortunately it is evident both from the debug log and from the TCP
> dump that Squid is using only one of the addresses at a time. The
> amount of connections in the TIME_WAIT state for that single IP
> address gets very close to the maximum defined by the
> net.ipv4.ip_local_port_range sysctl. After a while (a minute or so in
> the recording) this address changes presumably in response to a new
> DNS query result.

In theory, Squid should round-robin across all destination IP addresses
for a single host name. If your Squid v3 does not, it is probably a
Squid bug that can be fixed [by upgrading].

Said that, IIRC, the notion of "round robin" is rather vague in Squid
because there are several places where an IP may be requested for the
same host name inside the same transaction. I would not be surprised if
that low-level round-robin behavior results in the same IP being used
for most transactions in some environments (until an error or a new DNS
query reshuffles the IPs). Debugging logs may expose this problem.


> Could this be the bottleneck?

I would expect that the lack of ports would lead to errors, not stalled
transactions. However, there may be some hidden dependency that I am
missing. For example, lack of ports leads to errors, the errors are not
logged where you can see them, but lead to excessive DNS retries and/or
Squid bugs that lead to delays.


> One possible workaround that I can think of is setting a short
> positive_dns_ttl, but this doesn’t fully guarantee an even
> distribution, now does it?

No, it does not. Moreover, Squid v3 had some TTL handling bugs that were
fixed (in v4 and later code) by the Happy Eyeballs project. Taking all
the known problems into the account, it is difficult for me to predict
the effect of changing TTLs. Said that, it does not hurt to try! Maybe
you will be lucky, and a simple configuration change will remove the
cause of increasing transaction delays.


HTH,

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: Linearly increasing delays in HTTPS proxy CONNECTS / 3.5.20

Ilari Laitinen
> On 30 Aug 2019, at 16.40, Alex Rousskov <[hidden email]> wrote:
>
> On 8/30/19 8:16 AM, Ilari Laitinen wrote:
>
>> I suspect Squid might be waiting for local TCP ports from the kernel
>> (or something related).
>
> IIRC, ephemeral source port allocator is instantaneous -- Squid either
> gets a port or a port allocation error, without waiting. When we
> overload the server with high-performance tests (without an explicit
> port manager), we see port allocation errors rather than stalled tests.
> However, perhaps that is not true in your OS/environment.

Ah yes, of course. I actually saw this error first-hand when setting up the test environment.

>> Right now, there are four different IP addresses returned for the
>> target cloud service. For practical purposes, they are returned in a
>> random order. The traffic would ideally be spread over all of them.
>> Unfortunately it is evident both from the debug log and from the TCP
>> dump that Squid is using only one of the addresses at a time. The
>> amount of connections in the TIME_WAIT state for that single IP
>> address gets very close to the maximum defined by the
>> net.ipv4.ip_local_port_range sysctl. After a while (a minute or so in
>> the recording) this address changes presumably in response to a new
>> DNS query result.
>
> In theory, Squid should round-robin across all destination IP addresses
> for a single host name. If your Squid v3 does not, it is probably a
> Squid bug that can be fixed [by upgrading].
>
> Said that, IIRC, the notion of "round robin" is rather vague in Squid
> because there are several places where an IP may be requested for the
> same host name inside the same transaction. I would not be surprised if
> that low-level round-robin behavior results in the same IP being used
> for most transactions in some environments (until an error or a new DNS
> query reshuffles the IPs). Debugging logs may expose this problem.

I looked into this further. All our tcp dumps so far show the same: Squid uses (almost always) exactly one remote address at a time. The increasing delays start right after Squid has switched to a new remote IP and last precisely until another switch happens (typically the next one). The problem does not occur every time and is not limited to a single target IP.

Now that I know that this is not expected and is possibly related to a bug, I’ll look into upgrading Squid from the platform default.

>> One possible workaround that I can think of is setting a short
>> positive_dns_ttl, but this doesn’t fully guarantee an even
>> distribution, now does it?
>
> No, it does not. Moreover, Squid v3 had some TTL handling bugs that were
> fixed (in v4 and later code) by the Happy Eyeballs project. Taking all
> the known problems into the account, it is difficult for me to predict
> the effect of changing TTLs. Said that, it does not hurt to try! Maybe
> you will be lucky, and a simple configuration change will remove the
> cause of increasing transaction delays.

Thank you very much for your informed and timely replies!

I’ll report our results here.

Best,

--
Ilari Laitinen


_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users