Quantcast

squid 3.2.0.5 smp scaling issues

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

squid 3.2.0.5 smp scaling issues

david-2
test setup

box A running apache and ab

test against local IP address >13000 requests/sec

box B running squid, 8 2.3 GHz Opteron cores with 16G ram

non acl/cache-peer related lines in the config are (including typos from
me manually entering this)

http_port 8000
icp_port 0
visible_hostname gromit1
cache_effective_user proxy
cache_effective_group proxy
appaend_domain .invalid.server.name
pid_filename /var/run/squid.pid
cache_dir null /tmp
client_db off
cache_access_log syslog squid
cache_log /var/log/squid/cache.log
cache_store_log none
coredump_dir none
no_cache deny all


results when requesting short html page
squid 3.0.STABLE12 4200 requests/sec
squid 3.1.11 2100 requests/sec
squid 3.2.0.5 1 worker 1400 requests/sec
squid 3.2.0.5 2 workers 2100 requests/sec
squid 3.2.0.5 3 workers 2500 requests/sec
squid 3.2.0.5 4 workers 2900 requests/sec
squid 3.2.0.5 5 workers 2900 requests/sec
squid 3.2.0.5 6 workers 2500 requests/sec
squid 3.2.0.5 7 workers 2000 requests/sec
squid 3.2.0.5 8 workers 1900 requests/sec

in all these tests the squid process was using 100% of the cpu

I tried it pulling a large file (100K instead of <50 bytes) on the thought
that this may be bottlenecking on accepting the connections but with
something that took more time to service the connections it could do
better however what I found is that with 8 workers all 8 were using <50%
of the CPU at 1000 requests/sec

local machine would do 7000 requests/sec to itself

1 worker 500 requests/sec
2 workers 957 requests/sec

from there it remained about 1000 requests/sec with the cpu
utilization slowly dropping off (but not dropping as fast as it should
with the number of cores available)

so it looks like there is some significant bottleneck in version 3.2 that
makes the SMP support fairly ineffective.


in reading the wiki page at wili.squid-cache.org/Features/SmpScale I see
you worrying about fairness between workers. If you have put in code to
try and ensure fairness, you may want to remove it and see what happens to
performance. what you are describing on that page in terms of fairness is
what I would expect form a 'first-come-first-served' approach to multiple
processes grabbing new connections. The worker that last ran is hot in the
cache and so has an 'unfair' advantage in noticing and processing the new
request, but as that worker gets busier, it will be spending more time
servicing the request and the other processes will get more of a chance to
grab the new connection, so it will appear unfair under light load, but
become more fair under heavy load.

David Lang
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: squid 3.2.0.5 smp scaling issues

david-2
re-sending and adding -dev list

performance drops going from 3.0 -> 3.1 -> 3.2 and in addition squid 3.2
scales poorly (only goes up to 2x single-threaded performance going up to
4 cores and drops off again after that)

this makes it so that I actually get better performance on 3.0 than on
3.2, even with multiple workers

David Lang

On Mon, 21 Mar 2011, [hidden email] wrote:

> Date: Mon, 21 Mar 2011 19:26:38 -0700 (PDT)
> From: [hidden email]
> To: [hidden email]
> Subject: [squid-users] squid 3.2.0.5 smp scaling issues
>
> test setup
>
> box A running apache and ab
>
> test against local IP address >13000 requests/sec
>
> box B running squid, 8 2.3 GHz Opteron cores with 16G ram
>
> non acl/cache-peer related lines in the config are (including typos from me
> manually entering this)
>
> http_port 8000
> icp_port 0
> visible_hostname gromit1
> cache_effective_user proxy
> cache_effective_group proxy
> appaend_domain .invalid.server.name
> pid_filename /var/run/squid.pid
> cache_dir null /tmp
> client_db off
> cache_access_log syslog squid
> cache_log /var/log/squid/cache.log
> cache_store_log none
> coredump_dir none
> no_cache deny all
>
>
> results when requesting short html page squid 3.0.STABLE12 4200 requests/sec
> squid 3.1.11 2100 requests/sec
> squid 3.2.0.5 1 worker 1400 requests/sec
> squid 3.2.0.5 2 workers 2100 requests/sec
> squid 3.2.0.5 3 workers 2500 requests/sec
> squid 3.2.0.5 4 workers 2900 requests/sec
> squid 3.2.0.5 5 workers 2900 requests/sec
> squid 3.2.0.5 6 workers 2500 requests/sec
> squid 3.2.0.5 7 workers 2000 requests/sec
> squid 3.2.0.5 8 workers 1900 requests/sec
>
> in all these tests the squid process was using 100% of the cpu
>
> I tried it pulling a large file (100K instead of <50 bytes) on the thought
> that this may be bottlenecking on accepting the connections but with
> something that took more time to service the connections it could do better
> however what I found is that with 8 workers all 8 were using <50% of the CPU
> at 1000 requests/sec
>
> local machine would do 7000 requests/sec to itself
>
> 1 worker 500 requests/sec
> 2 workers 957 requests/sec
>
> from there it remained about 1000 requests/sec with the cpu utilization
> slowly dropping off (but not dropping as fast as it should with the number of
> cores available)
>
> so it looks like there is some significant bottleneck in version 3.2 that
> makes the SMP support fairly ineffective.
>
>
> in reading the wiki page at wili.squid-cache.org/Features/SmpScale I see you
> worrying about fairness between workers. If you have put in code to try and
> ensure fairness, you may want to remove it and see what happens to
> performance. what you are describing on that page in terms of fairness is
> what I would expect form a 'first-come-first-served' approach to multiple
> processes grabbing new connections. The worker that last ran is hot in the
> cache and so has an 'unfair' advantage in noticing and processing the new
> request, but as that worker gets busier, it will be spending more time
> servicing the request and the other processes will get more of a chance to
> grab the new connection, so it will appear unfair under light load, but
> become more fair under heavy load.
>
> David Lang
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: squid 3.2.0.5 smp scaling issues

david-2
still no response from anyone.

Is there any interest in investigating this issue? or should I just write
off squid for future use due to it's performance degrading?

David Lang

On Sat, 26 Mar 2011, [hidden email] wrote:

> Subject: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>
> re-sending and adding -dev list
>
> performance drops going from 3.0 -> 3.1 -> 3.2 and in addition squid 3.2
> scales poorly (only goes up to 2x single-threaded performance going up to 4
> cores and drops off again after that)
>
> this makes it so that I actually get better performance on 3.0 than on 3.2,
> even with multiple workers
>
> David Lang
>
> On Mon, 21 Mar 2011, [hidden email] wrote:
>
>> Date: Mon, 21 Mar 2011 19:26:38 -0700 (PDT)
>> From: [hidden email]
>> To: [hidden email]
>> Subject: [squid-users] squid 3.2.0.5 smp scaling issues
>>
>> test setup
>>
>> box A running apache and ab
>>
>> test against local IP address >13000 requests/sec
>>
>> box B running squid, 8 2.3 GHz Opteron cores with 16G ram
>>
>> non acl/cache-peer related lines in the config are (including typos from me
>> manually entering this)
>>
>> http_port 8000
>> icp_port 0
>> visible_hostname gromit1
>> cache_effective_user proxy
>> cache_effective_group proxy
>> appaend_domain .invalid.server.name
>> pid_filename /var/run/squid.pid
>> cache_dir null /tmp
>> client_db off
>> cache_access_log syslog squid
>> cache_log /var/log/squid/cache.log
>> cache_store_log none
>> coredump_dir none
>> no_cache deny all
>>
>>
>> results when requesting short html page squid 3.0.STABLE12 4200
>> requests/sec
>> squid 3.1.11 2100 requests/sec
>> squid 3.2.0.5 1 worker 1400 requests/sec
>> squid 3.2.0.5 2 workers 2100 requests/sec
>> squid 3.2.0.5 3 workers 2500 requests/sec
>> squid 3.2.0.5 4 workers 2900 requests/sec
>> squid 3.2.0.5 5 workers 2900 requests/sec
>> squid 3.2.0.5 6 workers 2500 requests/sec
>> squid 3.2.0.5 7 workers 2000 requests/sec
>> squid 3.2.0.5 8 workers 1900 requests/sec
>>
>> in all these tests the squid process was using 100% of the cpu
>>
>> I tried it pulling a large file (100K instead of <50 bytes) on the thought
>> that this may be bottlenecking on accepting the connections but with
>> something that took more time to service the connections it could do better
>> however what I found is that with 8 workers all 8 were using <50% of the
>> CPU at 1000 requests/sec
>>
>> local machine would do 7000 requests/sec to itself
>>
>> 1 worker 500 requests/sec
>> 2 workers 957 requests/sec
>>
>> from there it remained about 1000 requests/sec with the cpu utilization
>> slowly dropping off (but not dropping as fast as it should with the number
>> of cores available)
>>
>> so it looks like there is some significant bottleneck in version 3.2 that
>> makes the SMP support fairly ineffective.
>>
>>
>> in reading the wiki page at wili.squid-cache.org/Features/SmpScale I see
>> you worrying about fairness between workers. If you have put in code to try
>> and ensure fairness, you may want to remove it and see what happens to
>> performance. what you are describing on that page in terms of fairness is
>> what I would expect form a 'first-come-first-served' approach to multiple
>> processes grabbing new connections. The worker that last ran is hot in the
>> cache and so has an 'unfair' advantage in noticing and processing the new
>> request, but as that worker gets busier, it will be spending more time
>> servicing the request and the other processes will get more of a chance to
>> grab the new connection, so it will appear unfair under light load, but
>> become more fair under heavy load.
>>
>> David Lang
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: squid 3.2.0.5 smp scaling issues

Amos Jeffries
Administrator
On 03/04/11 12:52, [hidden email] wrote:
> still no response from anyone.
>
> Is there any interest in investigating this issue? or should I just
> write off squid for future use due to it's performance degrading?

It is a very ambiguous issue..
  * We have your report with some nice rate benchmarks indicating regression
  * We have two others saying me-too with less details
  * We have an independent report indicating that 3.1 is faster than
2.7. With benchmarks to prove it.
  * We have several independent reports indicating that 3.2 is faster
than 3.1. One like yours with benchmark proof.
  * We have someone responding to your report saying the CPU type
affects things in a large way (likely due to SMP using CPU-level features)
  * We have our own internal testing which shows also a mix of results
with the variance being dependent on which component of Squid is tested.

Your test in particular is testing both the large object pass-thru
(proxy only) capacity and the parser CPU ceiling.

Could you try your test on 3.2.0.6 and 3.1.12 please? They both now have
a server-facing buffer change which should directly affect your test
results in a good way.

Amos
--
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.12
   Beta testers wanted for 3.2.0.6
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: squid 3.2.0.5 smp scaling issues

david-2
On Mon, 4 Apr 2011, Amos Jeffries wrote:

> On 03/04/11 12:52, [hidden email] wrote:
>> still no response from anyone.
>>
>> Is there any interest in investigating this issue? or should I just
>> write off squid for future use due to it's performance degrading?
>
> It is a very ambiguous issue..
> * We have your report with some nice rate benchmarks indicating regression
> * We have two others saying me-too with less details
> * We have an independent report indicating that 3.1 is faster than 2.7. With
> benchmarks to prove it.
> * We have several independent reports indicating that 3.2 is faster than
> 3.1. One like yours with benchmark proof.
> * We have someone responding to your report saying the CPU type affects
> things in a large way (likely due to SMP using CPU-level features)
> * We have our own internal testing which shows also a mix of results with
> the variance being dependent on which component of Squid is tested.
>
> Your test in particular is testing both the large object pass-thru (proxy
> only) capacity and the parser CPU ceiling.
>
> Could you try your test on 3.2.0.6 and 3.1.12 please? They both now have a
> server-facing buffer change which should directly affect your test results in
> a good way.

thanks for the response, part of my frustration was just not hearing
anything back.

I'll do the tests on the new version shortly (hopefully on monday)

if there are other tests that people would like me to perform on the
hardware I have available, please let me know.

right now I am just testing proxy/ACL with no caching, but I am testing
four traffic types

1. small static files
2. large static files
3. small dynamic files (returning the exact same data as 1, but only after
a fixed delay)
4. large dynamic files.

while I see a dramatic difference in the performance on the different
tests, so far the ratios between the different versions have been
consistant across all four scenerios.

David Lang
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: squid 3.2.0.5 smp scaling issues

david-2
sorry for the delay. I got a chance to do some more testing (slightly
different environment on the apache server, so these numbers are a
little lower for the same versions than the last ones I posted)

results when requesting short html page


squid 3.0.STABLE12 4000 requests/sec
squid 3.1.11 1500 requests/sec
squid 3.1.12 1530 requests/sec
squid 3.2.0.5 1 worker 1300 requests/sec
squid 3.2.0.5 2 workers 2050 requests/sec
squid 3.2.0.5 3 workers 2700 requests/sec
squid 3.2.0.5 4 workers 2950 requests/sec
squid 3.2.0.5 5 workers 2900 requests/sec
squid 3.2.0.5 6 workers 2530 requests/sec
squid 3.2.0.6 1 worker 1400 requests/sec
squid 3.2.0.6 2 workers 2050 requests/sec
squid 3.2.0.6 3 workers 2730 requests/sec
squid 3.2.0.6 4 workers 2950 requests/sec
squid 3.2.0.6 5 workers 2830 requests/sec
squid 3.2.0.6 6 workers 2530 requests/sec
squid 3.2.0.6 7 workers 2160 requests/sec instead of all processes being at 100% several were at 99%
squid 3.2.0.6 8 workers 1950 requests/sec instead of all processes being at 100% some were as low as 92%

so the new versions are really about the same

moving to large requests cut these numbers by about 1/3, but the squid
processes were not maxing out the CPU

one issue I saw, I had to reduce the number of concurrent connections or I
would have requests time out (3.2 vs earlier versions), on 3.2 I had to
have -c on ab at ~100-150 where I could go significantly higher on 3.1 and
3.0

David Lang





















On Mon, 4 Apr 2011, [hidden email] wrote:

> On Mon, 4 Apr 2011, Amos Jeffries wrote:
>
>> On 03/04/11 12:52, [hidden email] wrote:
>>> still no response from anyone.
>>>
>>> Is there any interest in investigating this issue? or should I just
>>> write off squid for future use due to it's performance degrading?
>>
>> It is a very ambiguous issue..
>> * We have your report with some nice rate benchmarks indicating regression
>> * We have two others saying me-too with less details
>> * We have an independent report indicating that 3.1 is faster than 2.7.
>> With benchmarks to prove it.
>> * We have several independent reports indicating that 3.2 is faster than
>> 3.1. One like yours with benchmark proof.
>> * We have someone responding to your report saying the CPU type affects
>> things in a large way (likely due to SMP using CPU-level features)
>> * We have our own internal testing which shows also a mix of results with
>> the variance being dependent on which component of Squid is tested.
>>
>> Your test in particular is testing both the large object pass-thru (proxy
>> only) capacity and the parser CPU ceiling.
>>
>> Could you try your test on 3.2.0.6 and 3.1.12 please? They both now have a
>> server-facing buffer change which should directly affect your test results
>> in a good way.
>
> thanks for the response, part of my frustration was just not hearing anything
> back.
>
> I'll do the tests on the new version shortly (hopefully on monday)
>
> if there are other tests that people would like me to perform on the hardware
> I have available, please let me know.
>
> right now I am just testing proxy/ACL with no caching, but I am testing four
> traffic types
>
> 1. small static files
> 2. large static files
> 3. small dynamic files (returning the exact same data as 1, but only after a
> fixed delay)
> 4. large dynamic files.
>
> while I see a dramatic difference in the performance on the different tests,
> so far the ratios between the different versions have been consistant across
> all four scenerios.
>
> David Lang
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: squid 3.2.0.5 smp scaling issues

Amos Jeffries
Administrator
On 08/04/11 14:32, [hidden email] wrote:

> sorry for the delay. I got a chance to do some more testing (slightly
> different environment on the apache server, so these numbers are a
> little lower for the same versions than the last ones I posted)
>
> results when requesting short html page
>
>
> squid 3.0.STABLE12 4000 requests/sec
> squid 3.1.11 1500 requests/sec
> squid 3.1.12 1530 requests/sec
> squid 3.2.0.5 1 worker 1300 requests/sec
> squid 3.2.0.5 2 workers 2050 requests/sec
> squid 3.2.0.5 3 workers 2700 requests/sec
> squid 3.2.0.5 4 workers 2950 requests/sec
> squid 3.2.0.5 5 workers 2900 requests/sec
> squid 3.2.0.5 6 workers 2530 requests/sec
> squid 3.2.0.6 1 worker 1400 requests/sec
> squid 3.2.0.6 2 workers 2050 requests/sec
> squid 3.2.0.6 3 workers 2730 requests/sec
> squid 3.2.0.6 4 workers 2950 requests/sec
> squid 3.2.0.6 5 workers 2830 requests/sec
> squid 3.2.0.6 6 workers 2530 requests/sec
> squid 3.2.0.6 7 workers 2160 requests/sec instead of all processes being
> at 100% several were at 99%
> squid 3.2.0.6 8 workers 1950 requests/sec instead of all processes being
> at 100% some were as low as 92%
>
> so the new versions are really about the same
>
> moving to large requests cut these numbers by about 1/3, but the squid
> processes were not maxing out the CPU
>
> one issue I saw, I had to reduce the number of concurrent connections or
> I would have requests time out (3.2 vs earlier versions), on 3.2 I had
> to have -c on ab at ~100-150 where I could go significantly higher on
> 3.1 and 3.0
>
> David Lang
>

Thank you.
  So with small files 2% on 3.1 and ~7% on 3.2 with a single worker. But
under 1% on multiple 3.2 workers.
  And overloading/flooding the I/O bandwidth on large files.

NP: when overloading I/O one cannot compare to runs with different
sizes. Only with runs of the same traffic. Also only the CPU max load is
reliable there, since requests/sec bottlenecks behind the I/O.
  So... your measure that CPU dropped is a good sign for large files.

Amos
--
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.12
   Beta testers wanted for 3.2.0.6
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: squid 3.2.0.5 smp scaling issues

david-2
A couple more things about the ACLs used in my test

all of them are allow ACLs (no deny rules to worry about precidence of)
except for a deny-all at the bottom

the ACL line that permits the test source to the test destination has zero
overlap with the rest of the rules

every rule has an IP based restriction (even the ones with url_regex are
source -> URL regex)

I moved the ACL that allows my test from the bottom of the ruleset to the
top and the resulting performance numbers were up as if the other ACLs
didn't exist. As such it is very clear that 3.2 is evaluating every rule.

I changed one of the url_regex rules to just match one line rather than a
file containing 307 lines to see if that made a difference, and it made no
significant difference. So this indicates to me that it's not having to
fully evaluate every rule (it's able to skip doing the regex if the IP
match doesn't work)

I then changed all the acl lines that used hostnames to have IP addresses
in them, and this also made no significant difference

I then changed all subnet matches to single IP address (just nuked /##
throughout the config file) and this also made no significant difference.



so why are the address matches so expensive

and as noted in the e-mail below, why do these checks not scale nicely
with the number of worker processes? If they did, the fact that one
3.2 process is about 1/3 the speed of a 3.0 process in checking the acls
wouldn't matter nearly as much when it's so easy to get an 8+ core
system.




it seems to me that all accept/deny rules in a set should be able to be
combined into a tree to make searching them very fast.

so for example if you have

accept 1
accept 2
deny 3
deny 4
accept 5

you need to create three trees (one with accept 1 and accept 2, one with
deny3 and deny4, and one with accept 5) and then check each tree to see
if you have a match.

the types of match could be done in order of increasing cost, so if you
have acl entries of type port, src, dst, and url regex, organize the tree
so that you check ports first, then src, then dst, then only if all that
matches do you need to do the regex. This would be very similar to the
shortcut logic that you use today with a single rule where you bail out
when you don't find a match.

you could go with a complex tree structure, but since this only needs to
be changed at boot time, it seems to me that a simple array that you can
do a binary search on will work for the port, src, and dst trees. The url
regex is probably easiest to initially create by just doing a list of
regex strings to match and working down that list, but eventually it may
be best to create a parse tree so that you only have to walk down the
string once to see if you have a match.

you wouldn't quite be able to get this fast as you would have to actually
do two checks, one if you have a match on that level and one for the rules
that don't specify something in the current tree (one check for if the
http_access line specifies a port number and one for if it doesn't for
example)

this sort of acl structure would reduce a complex ruleset down to ~O(log
n) instead of the current O(n) (a really complex ruleset would be log n of
each tree added togeather)

there are cases where this sort of thing would be more expensive than the
current, simple approach, but those would be on simple rulesets which
aren't affected much by a few extra checks.


David Lang


On Fri, 8 Apr 2011, [hidden email] wrote:

> I did some more testing with this, and it looks like the bottleneck is in the
> ACL checking.
>
> if I remove all the ACLs (except the one I actually use for testing), I am
> able to get 16,750 requests/sec with 3.2.0.5 on 8 workers, with them all only
> useing ~30% cpu (I think this is the limit of the apache server I am hitting
> behind squid)
>
> I have the following ACLs defined
>
> port 13
> src 89
> dst 173
> url_regex 338
>
> used in 292 http_access rules
>
> so what has changed since 3.0 in terms of the ACL handling to slow it down so
> much? and why do multiple processes kill scale so badly when they should all
> be busy checking ACLs? (does each process lock the table of ACLs or somehow
> block other threads from doing checks?) This would seem like the problem
> space that would be ideal for multiple processes, each has it's own copy of
> the ACL rules, gets a connection and then does it's own checking with no need
> to communicate with the other processes at all.
>
> now the performance numbers
>
> with the minimal ACLs
>
> 3.2.0.5 with 1 worker gets 3300 requests/sec
> 3.2.0.5 with 2 workers gets 8400 requests/sec
> 3.2.0.5 with 3 workers gets 10,800 requests/sec
> 3.2.0.5 with 4 workers gets 13,600 requests/sec
> 3.2.0.5 with 5 workers gets 15,700 requests/sec
> 3.2.0.5 with 6 workers gets 16,400 requests/sec
> 3.2.0.6 with 1 worker gets 4400 requests/sec
> 3.2.0.6 with 2 workers gets 8400 requests/sec
> 3.2.0.6 with 3 workers gets 11,300 requests/sec
> 3.2.0.6 with 4 workers gets 15,600 requests/sec
> 3.2.0.6 with 5 workers gets 15,800 requests/sec
> 3.2.0.6 with 6 workers gets 16,400 requests/sec
>
> David Lang
>
>
>
> On Fri, 8 Apr 2011, Amos Jeffries wrote:
>
>> Date: Fri, 08 Apr 2011 15:37:24 +1200
>> From: Amos Jeffries <[hidden email]>
>> To: [hidden email]
>> Subject: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>>
>> On 08/04/11 14:32, [hidden email] wrote:
>>> sorry for the delay. I got a chance to do some more testing (slightly
>>> different environment on the apache server, so these numbers are a
>>> little lower for the same versions than the last ones I posted)
>>>
>>> results when requesting short html page
>>>
>>>
>>> squid 3.0.STABLE12 4000 requests/sec
>>> squid 3.1.11 1500 requests/sec
>>> squid 3.1.12 1530 requests/sec
>>> squid 3.2.0.5 1 worker 1300 requests/sec
>>> squid 3.2.0.5 2 workers 2050 requests/sec
>>> squid 3.2.0.5 3 workers 2700 requests/sec
>>> squid 3.2.0.5 4 workers 2950 requests/sec
>>> squid 3.2.0.5 5 workers 2900 requests/sec
>>> squid 3.2.0.5 6 workers 2530 requests/sec
>>> squid 3.2.0.6 1 worker 1400 requests/sec
>>> squid 3.2.0.6 2 workers 2050 requests/sec
>>> squid 3.2.0.6 3 workers 2730 requests/sec
>>> squid 3.2.0.6 4 workers 2950 requests/sec
>>> squid 3.2.0.6 5 workers 2830 requests/sec
>>> squid 3.2.0.6 6 workers 2530 requests/sec
>>> squid 3.2.0.6 7 workers 2160 requests/sec instead of all processes being
>>> at 100% several were at 99%
>>> squid 3.2.0.6 8 workers 1950 requests/sec instead of all processes being
>>> at 100% some were as low as 92%
>>>
>>> so the new versions are really about the same
>>>
>>> moving to large requests cut these numbers by about 1/3, but the squid
>>> processes were not maxing out the CPU
>>>
>>> one issue I saw, I had to reduce the number of concurrent connections or
>>> I would have requests time out (3.2 vs earlier versions), on 3.2 I had
>>> to have -c on ab at ~100-150 where I could go significantly higher on
>>> 3.1 and 3.0
>>>
>>> David Lang
>>>
>>
>> Thank you.
>> So with small files 2% on 3.1 and ~7% on 3.2 with a single worker. But
>> under 1% on multiple 3.2 workers.
>> And overloading/flooding the I/O bandwidth on large files.
>>
>> NP: when overloading I/O one cannot compare to runs with different sizes.
>> Only with runs of the same traffic. Also only the CPU max load is reliable
>> there, since requests/sec bottlenecks behind the I/O.
>> So... your measure that CPU dropped is a good sign for large files.
>>
>> Amos
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: squid 3.2.0.5 smp scaling issues

Amos Jeffries
Administrator
On 09/04/11 14:27, [hidden email] wrote:

> A couple more things about the ACLs used in my test
>
> all of them are allow ACLs (no deny rules to worry about precidence of)
> except for a deny-all at the bottom
>
> the ACL line that permits the test source to the test destination has
> zero overlap with the rest of the rules
>
> every rule has an IP based restriction (even the ones with url_regex are
> source -> URL regex)
>
> I moved the ACL that allows my test from the bottom of the ruleset to
> the top and the resulting performance numbers were up as if the other
> ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
> rule.
>
> I changed one of the url_regex rules to just match one line rather than
> a file containing 307 lines to see if that made a difference, and it
> made no significant difference. So this indicates to me that it's not
> having to fully evaluate every rule (it's able to skip doing the regex
> if the IP match doesn't work)
>
> I then changed all the acl lines that used hostnames to have IP
> addresses in them, and this also made no significant difference
>
> I then changed all subnet matches to single IP address (just nuked /##
> throughout the config file) and this also made no significant difference.
>

Squid has always worked this way. It will *test* every rule from the top
down to the one that matches. Also testing each line left-to-right until
one fails or the whole line matches.

>
> so why are the address matches so expensive
>

3.0 and older IP address is a 32-bit comparison.
3.1 and newer IP address is a 128-bit comparison with memcmp().

If something like a word-wise comparison can be implemented faster than
memcmp() we would welcome it.


> and as noted in the e-mail below, why do these checks not scale nicely
> with the number of worker processes? If they did, the fact that one 3.2
> process is about 1/3 the speed of a 3.0 process in checking the acls
> wouldn't matter nearly as much when it's so easy to get an 8+ core system.
>

There you have the unknown.

>
> it seems to me that all accept/deny rules in a set should be able to be
> combined into a tree to make searching them very fast.
>
> so for example if you have
>
> accept 1
> accept 2
> deny 3
> deny 4
> accept 5
>
> you need to create three trees (one with accept 1 and accept 2, one with
> deny3 and deny4, and one with accept 5) and then check each tree to see
> if you have a match.
>
> the types of match could be done in order of increasing cost, so if you

The config file is specific structure configured by admin under
guaranteed rules of operation for access lines (top-down, left-to-right,
first-match-wins) to perform boolean-logic calculations using ACL sets.
  Sorting access line rules is not an option.
  Sorting ACL values and tree-forming them is already done (regex being
the one exception AFAIK).
  Sorting position-wise on a single access line is also ruled out by
interactions with deny_info, auth and external ACL types.


> have acl entries of type port, src, dst, and url regex, organize the
> tree so that you check ports first, then src, then dst, then only if all
> that matches do you need to do the regex. This would be very similar to
> the shortcut logic that you use today with a single rule where you bail
> out when you don't find a match.
>
> you could go with a complex tree structure, but since this only needs to
> be changed at boot time,

Um, "boot"/startup time and arbitrary "-k reconfigure" times.
With a reverse-configuration display dump on any cache manager request.

> it seems to me that a simple array that you can
> do a binary search on will work for the port, src, and dst trees. The
> url regex is probably easiest to initially create by just doing a list
> of regex strings to match and working down that list, but eventually it

This is already how we do these. But with a splay tree instead of binary.

> may be best to create a parse tree so that you only have to walk down
> the string once to see if you have a match.

That would be nice. Care to implement?
  You just have to get the regex library to adjust its pre-compiled
patterns with OR into (existing|new) whenever a new pattern string is
added to an ACL.

>
> you wouldn't quite be able to get this fast as you would have to
> actually do two checks, one if you have a match on that level and one
> for the rules that don't specify something in the current tree (one
> check for if the http_access line specifies a port number and one for if
> it doesn't for example)

We get around this problem by using C++ types. ACLChecklist walks the
tree holding the current location, expected result, and all details
available about the transaction. Each node in the tree has a match()
function which gets called at most once per walk. Each ACL data type
provides its own match() algorithm.

That is why the following config is invalid:
  acl foo src 1.2.3.4
  acl foo port 80

>
> this sort of acl structure would reduce a complex ruleset down to ~O(log
> n) instead of the current O(n) (a really complex ruleset would be log n
> of each tree added togeather)
>
> there are cases where this sort of thing would be more expensive than
> the current, simple approach, but those would be on simple rulesets
> which aren't affected much by a few extra checks.

Um, You have pretty much described our existing code. Even with the
details of *how* hidden away the small bit exposed to admin is fairly
complex.


  I am thinking you did not understand ACLs very well earlier when
designing your config rules. With large rulsets that can easily lead to
an inefficient config and the worst-case results you seem to have achieved.
  If you care to share your actual configuration file contents I'm happy
to read through and point out any optimizations that can be made.
  Though you may want to use the above info and see if you can find any
optimizations first.



The ACL storage types are all defined in the src/acl/*Data.* source
files. If you wish to work on finding us some faster types or even
faster matching algorithm for an existing type that would be welcome.
  We do ask for some unit/micro-benchmarks of the old vs new match()
function so we know the change is an actual improvement.

Amos
--
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.12
   Beta testers wanted for 3.2.0.6
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: squid 3.2.0.5 smp scaling issues

david-2
On Sat, 9 Apr 2011, Amos Jeffries wrote:

> On 09/04/11 14:27, [hidden email] wrote:
>> A couple more things about the ACLs used in my test
>>
>> all of them are allow ACLs (no deny rules to worry about precidence of)
>> except for a deny-all at the bottom
>>
>> the ACL line that permits the test source to the test destination has
>> zero overlap with the rest of the rules
>>
>> every rule has an IP based restriction (even the ones with url_regex are
>> source -> URL regex)
>>
>> I moved the ACL that allows my test from the bottom of the ruleset to
>> the top and the resulting performance numbers were up as if the other
>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
>> rule.
>>
>> I changed one of the url_regex rules to just match one line rather than
>> a file containing 307 lines to see if that made a difference, and it
>> made no significant difference. So this indicates to me that it's not
>> having to fully evaluate every rule (it's able to skip doing the regex
>> if the IP match doesn't work)
>>
>> I then changed all the acl lines that used hostnames to have IP
>> addresses in them, and this also made no significant difference
>>
>> I then changed all subnet matches to single IP address (just nuked /##
>> throughout the config file) and this also made no significant difference.
>>
>
> Squid has always worked this way. It will *test* every rule from the top down
> to the one that matches. Also testing each line left-to-right until one fails
> or the whole line matches.
>
>>
>> so why are the address matches so expensive
>>
>
> 3.0 and older IP address is a 32-bit comparison.
> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>
> If something like a word-wise comparison can be implemented faster than
> memcmp() we would welcome it.

I wonder if there should be a different version that's used when IPv6 is
disabled. this is a pretty large hit.

if the data is aligned properly, on a 64 bit system this should still only
be 2 compares. do you do any alignment on the data now?

>> and as noted in the e-mail below, why do these checks not scale nicely
>> with the number of worker processes? If they did, the fact that one 3.2
>> process is about 1/3 the speed of a 3.0 process in checking the acls
>> wouldn't matter nearly as much when it's so easy to get an 8+ core system.
>>
>
> There you have the unknown.

I think this is a fairly critical thing to figure out.

>>
>> it seems to me that all accept/deny rules in a set should be able to be
>> combined into a tree to make searching them very fast.
>>
>> so for example if you have
>>
>> accept 1
>> accept 2
>> deny 3
>> deny 4
>> accept 5
>>
>> you need to create three trees (one with accept 1 and accept 2, one with
>> deny3 and deny4, and one with accept 5) and then check each tree to see
>> if you have a match.
>>
>> the types of match could be done in order of increasing cost, so if you
>
> The config file is specific structure configured by admin under guaranteed
> rules of operation for access lines (top-down, left-to-right,
> first-match-wins) to perform boolean-logic calculations using ACL sets.
> Sorting access line rules is not an option.
> Sorting ACL values and tree-forming them is already done (regex being the
> one exception AFAIK).
> Sorting position-wise on a single access line is also ruled out by
> interactions with deny_info, auth and external ACL types.

It would seem that as long as you don't cross boundries between the
different types, you should be able to optimize within a group.

using my example above, you couldn't combine the 'accept 5' with any of
the other accepts, but you could combine accept 1 and 2 and combine deny 3
and 4 togeather.

now, I know that I don't fully understand all the possible ACL types, so
this may not work for some of them, but I believe that a fairly common use
case is to have either a lot of allow rules, or a lot of deny rules as a
block (either a list of sites you are allowed to access, or a list of
sites that are blocked), so an ability to optimize these use cases may be
well worth it.

>> have acl entries of type port, src, dst, and url regex, organize the
>> tree so that you check ports first, then src, then dst, then only if all
>> that matches do you need to do the regex. This would be very similar to
>> the shortcut logic that you use today with a single rule where you bail
>> out when you don't find a match.
>>
>> you could go with a complex tree structure, but since this only needs to
>> be changed at boot time,
>
> Um, "boot"/startup time and arbitrary "-k reconfigure" times.
> With a reverse-configuration display dump on any cache manager request.

still a pretty rare case, and one where you can build a completely new
ruleset and swap it out. My point was that this isn't something that you
have to be able to update dynamically.

>> it seems to me that a simple array that you can
>> do a binary search on will work for the port, src, and dst trees. The
>> url regex is probably easiest to initially create by just doing a list
>> of regex strings to match and working down that list, but eventually it
>
> This is already how we do these. But with a splay tree instead of binary.

I wondered about that. I've gotten splay tree related warning messages.

>> may be best to create a parse tree so that you only have to walk down
>> the string once to see if you have a match.
>
> That would be nice. Care to implement?
> You just have to get the regex library to adjust its pre-compiled patterns
> with OR into (existing|new) whenever a new pattern string is added to an ACL.

I'm actually watching the libnorm project closely, with an intention of
leveraging it for this sort of thing. It's a project being developed by
the maintainer of rsyslog trying to efficiently match log entries. it
doesn't support regex notation for defining it's rules at this point, but
I've poked Rainer in that direction, so we'll see how things go.

>> you wouldn't quite be able to get this fast as you would have to
>> actually do two checks, one if you have a match on that level and one
>> for the rules that don't specify something in the current tree (one
>> check for if the http_access line specifies a port number and one for if
>> it doesn't for example)
>
> We get around this problem by using C++ types. ACLChecklist walks the tree
> holding the current location, expected result, and all details available
> about the transaction. Each node in the tree has a match() function which
> gets called at most once per walk. Each ACL data type provides its own
> match() algorithm.
>
> That is why the following config is invalid:
> acl foo src 1.2.3.4
> acl foo port 80

Ok, makes sense. I'll have to dig into that, thanks for the pointer of the
function to look for.

>>
>> this sort of acl structure would reduce a complex ruleset down to ~O(log
>> n) instead of the current O(n) (a really complex ruleset would be log n
>> of each tree added togeather)
>>
>> there are cases where this sort of thing would be more expensive than
>> the current, simple approach, but those would be on simple rulesets
>> which aren't affected much by a few extra checks.
>
> Um, You have pretty much described our existing code. Even with the details
> of *how* hidden away the small bit exposed to admin is fairly complex.

good ;-) I'm glad when I suggest an approach and find that the project is
already doing things the way I think is best.

>
> I am thinking you did not understand ACLs very well earlier when designing
> your config rules. With large rulsets that can easily lead to an inefficient
> config and the worst-case results you seem to have achieved.
> If you care to share your actual configuration file contents I'm happy to
> read through and point out any optimizations that can be made.
> Though you may want to use the above info and see if you can find any
> optimizations first.

I'll have to do some sanitizeing of the rules before I can send it out.
I'll try and figure out how to do this without destroying the ability to
check things.

the basic problem is that this is a whitelist of what is allowed to be
accessed. I know that there are some problems with rules that overlap, but
that's not a large part of the issue (usually if rules overlap, the more
general rule is wrong, but the application developers have done something
stupid and it needs to be in there until they fix the application)

> The ACL storage types are all defined in the src/acl/*Data.* source files. If
> you wish to work on finding us some faster types or even faster matching
> algorithm for an existing type that would be welcome.
> We do ask for some unit/micro-benchmarks of the old vs new match() function
> so we know the change is an actual improvement.

I don't know how much I will get a chance to do any coding on this (as
opposed to being available to test anything that you folks try, testing I
will make time for), but I definantly agree on the need to do benchmarks.

again, thanks for the info.

David Lang
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Res: [squid-users] squid 3.2.0.5 smp scaling issues

x-14-2
Hi David,

could you run and publish your benchmark with squid 2.7 ???
i'd like to know if is there any regression between 2.7 and 3.x series.

thanks.

Marcos


----- Mensagem original ----
De: "[hidden email]" <[hidden email]>
Para: Amos Jeffries <[hidden email]>
Cc: [hidden email]; [hidden email]
Enviadas: Sábado, 9 de Abril de 2011 12:56:12
Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues

On Sat, 9 Apr 2011, Amos Jeffries wrote:

> On 09/04/11 14:27, [hidden email] wrote:
>> A couple more things about the ACLs used in my test
>>
>> all of them are allow ACLs (no deny rules to worry about precidence of)
>> except for a deny-all at the bottom
>>
>> the ACL line that permits the test source to the test destination has
>> zero overlap with the rest of the rules
>>
>> every rule has an IP based restriction (even the ones with url_regex are
>> source -> URL regex)
>>
>> I moved the ACL that allows my test from the bottom of the ruleset to
>> the top and the resulting performance numbers were up as if the other
>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
>> rule.
>>
>> I changed one of the url_regex rules to just match one line rather than
>> a file containing 307 lines to see if that made a difference, and it
>> made no significant difference. So this indicates to me that it's not
>> having to fully evaluate every rule (it's able to skip doing the regex
>> if the IP match doesn't work)
>>
>> I then changed all the acl lines that used hostnames to have IP
>> addresses in them, and this also made no significant difference
>>
>> I then changed all subnet matches to single IP address (just nuked /##
>> throughout the config file) and this also made no significant difference.
>>
>
> Squid has always worked this way. It will *test* every rule from the top down
>to the one that matches. Also testing each line left-to-right until one fails or
>the whole line matches.
>
>>
>> so why are the address matches so expensive
>>
>
> 3.0 and older IP address is a 32-bit comparison.
> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>
> If something like a word-wise comparison can be implemented faster than
>memcmp() we would welcome it.

I wonder if there should be a different version that's used when IPv6 is
disabled. this is a pretty large hit.

if the data is aligned properly, on a 64 bit system this should still only be 2
compares. do you do any alignment on the data now?

>> and as noted in the e-mail below, why do these checks not scale nicely
>> with the number of worker processes? If they did, the fact that one 3.2
>> process is about 1/3 the speed of a 3.0 process in checking the acls
>> wouldn't matter nearly as much when it's so easy to get an 8+ core system.
>>
>
> There you have the unknown.

I think this is a fairly critical thing to figure out.

>>
>> it seems to me that all accept/deny rules in a set should be able to be
>> combined into a tree to make searching them very fast.
>>
>> so for example if you have
>>
>> accept 1
>> accept 2
>> deny 3
>> deny 4
>> accept 5
>>
>> you need to create three trees (one with accept 1 and accept 2, one with
>> deny3 and deny4, and one with accept 5) and then check each tree to see
>> if you have a match.
>>
>> the types of match could be done in order of increasing cost, so if you
>
> The config file is specific structure configured by admin under guaranteed
>rules of operation for access lines (top-down, left-to-right, first-match-wins)
>to perform boolean-logic calculations using ACL sets.
> Sorting access line rules is not an option.
> Sorting ACL values and tree-forming them is already done (regex being the one
>exception AFAIK).
> Sorting position-wise on a single access line is also ruled out by interactions
>with deny_info, auth and external ACL types.

It would seem that as long as you don't cross boundries between the different
types, you should be able to optimize within a group.

using my example above, you couldn't combine the 'accept 5' with any of the
other accepts, but you could combine accept 1 and 2 and combine deny 3 and 4
togeather.

now, I know that I don't fully understand all the possible ACL types, so this
may not work for some of them, but I believe that a fairly common use case is to
have either a lot of allow rules, or a lot of deny rules as a block (either a
list of sites you are allowed to access, or a list of sites that are blocked),
so an ability to optimize these use cases may be well worth it.

>> have acl entries of type port, src, dst, and url regex, organize the
>> tree so that you check ports first, then src, then dst, then only if all
>> that matches do you need to do the regex. This would be very similar to
>> the shortcut logic that you use today with a single rule where you bail
>> out when you don't find a match.
>>
>> you could go with a complex tree structure, but since this only needs to
>> be changed at boot time,
>
> Um, "boot"/startup time and arbitrary "-k reconfigure" times.
> With a reverse-configuration display dump on any cache manager request.

still a pretty rare case, and one where you can build a completely new ruleset
and swap it out. My point was that this isn't something that you have to be able
to update dynamically.

>> it seems to me that a simple array that you can
>> do a binary search on will work for the port, src, and dst trees. The
>> url regex is probably easiest to initially create by just doing a list
>> of regex strings to match and working down that list, but eventually it
>
> This is already how we do these. But with a splay tree instead of binary.

I wondered about that. I've gotten splay tree related warning messages.

>> may be best to create a parse tree so that you only have to walk down
>> the string once to see if you have a match.
>
> That would be nice. Care to implement?
> You just have to get the regex library to adjust its pre-compiled patterns with
>OR into (existing|new) whenever a new pattern string is added to an ACL.

I'm actually watching the libnorm project closely, with an intention of
leveraging it for this sort of thing. It's a project being developed by the
maintainer of rsyslog trying to efficiently match log entries. it doesn't
support regex notation for defining it's rules at this point, but I've poked
Rainer in that direction, so we'll see how things go.

>> you wouldn't quite be able to get this fast as you would have to
>> actually do two checks, one if you have a match on that level and one
>> for the rules that don't specify something in the current tree (one
>> check for if the http_access line specifies a port number and one for if
>> it doesn't for example)
>
> We get around this problem by using C++ types. ACLChecklist walks the tree
>holding the current location, expected result, and all details available about
>the transaction. Each node in the tree has a match() function which gets called
>at most once per walk. Each ACL data type provides its own match() algorithm.
>
> That is why the following config is invalid:
> acl foo src 1.2.3.4
> acl foo port 80

Ok, makes sense. I'll have to dig into that, thanks for the pointer of the
function to look for.

>>
>> this sort of acl structure would reduce a complex ruleset down to ~O(log
>> n) instead of the current O(n) (a really complex ruleset would be log n
>> of each tree added togeather)
>>
>> there are cases where this sort of thing would be more expensive than
>> the current, simple approach, but those would be on simple rulesets
>> which aren't affected much by a few extra checks.
>
> Um, You have pretty much described our existing code. Even with the details of
>*how* hidden away the small bit exposed to admin is fairly complex.

good ;-) I'm glad when I suggest an approach and find that the project is
already doing things the way I think is best.

>
> I am thinking you did not understand ACLs very well earlier when designing your
>config rules. With large rulsets that can easily lead to an inefficient config
>and the worst-case results you seem to have achieved.
> If you care to share your actual configuration file contents I'm happy to read
>through and point out any optimizations that can be made.
> Though you may want to use the above info and see if you can find any
>optimizations first.

I'll have to do some sanitizeing of the rules before I can send it out. I'll try
and figure out how to do this without destroying the ability to check things.

the basic problem is that this is a whitelist of what is allowed to be accessed.
I know that there are some problems with rules that overlap, but that's not a
large part of the issue (usually if rules overlap, the more general rule is
wrong, but the application developers have done something stupid and it needs to
be in there until they fix the application)

> The ACL storage types are all defined in the src/acl/*Data.* source files. If
>you wish to work on finding us some faster types or even faster matching
>algorithm for an existing type that would be welcome.
> We do ask for some unit/micro-benchmarks of the old vs new match() function so
>we know the change is an actual improvement.

I don't know how much I will get a chance to do any coding on this (as opposed
to being available to test anything that you folks try, testing I will make time
for), but I definantly agree on the need to do benchmarks.

again, thanks for the info.

David Lang
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

david-2
sorry, haven't had time to do that yet. I will try and get this done
today.

David Lang

On Wed, 13 Apr 2011, Marcos wrote:

> Date: Wed, 13 Apr 2011 04:11:09 -0700 (PDT)
> From: Marcos <[hidden email]>
> To: [hidden email], Amos Jeffries <[hidden email]>
> Cc: [hidden email], [hidden email]
> Subject: Res: [squid-users] squid 3.2.0.5 smp scaling issues
>
> Hi David,
>
> could you run and publish your benchmark with squid 2.7 ???
> i'd like to know if is there any regression between 2.7 and 3.x series.
>
> thanks.
>
> Marcos
>
>
> ----- Mensagem original ----
> De: "[hidden email]" <[hidden email]>
> Para: Amos Jeffries <[hidden email]>
> Cc: [hidden email]; [hidden email]
> Enviadas: S?bado, 9 de Abril de 2011 12:56:12
> Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>
> On Sat, 9 Apr 2011, Amos Jeffries wrote:
>
>> On 09/04/11 14:27, [hidden email] wrote:
>>> A couple more things about the ACLs used in my test
>>>
>>> all of them are allow ACLs (no deny rules to worry about precidence of)
>>> except for a deny-all at the bottom
>>>
>>> the ACL line that permits the test source to the test destination has
>>> zero overlap with the rest of the rules
>>>
>>> every rule has an IP based restriction (even the ones with url_regex are
>>> source -> URL regex)
>>>
>>> I moved the ACL that allows my test from the bottom of the ruleset to
>>> the top and the resulting performance numbers were up as if the other
>>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
>>> rule.
>>>
>>> I changed one of the url_regex rules to just match one line rather than
>>> a file containing 307 lines to see if that made a difference, and it
>>> made no significant difference. So this indicates to me that it's not
>>> having to fully evaluate every rule (it's able to skip doing the regex
>>> if the IP match doesn't work)
>>>
>>> I then changed all the acl lines that used hostnames to have IP
>>> addresses in them, and this also made no significant difference
>>>
>>> I then changed all subnet matches to single IP address (just nuked /##
>>> throughout the config file) and this also made no significant difference.
>>>
>>
>> Squid has always worked this way. It will *test* every rule from the top down
>> to the one that matches. Also testing each line left-to-right until one fails or
>> the whole line matches.
>>
>>>
>>> so why are the address matches so expensive
>>>
>>
>> 3.0 and older IP address is a 32-bit comparison.
>> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>>
>> If something like a word-wise comparison can be implemented faster than
>> memcmp() we would welcome it.
>
> I wonder if there should be a different version that's used when IPv6 is
> disabled. this is a pretty large hit.
>
> if the data is aligned properly, on a 64 bit system this should still only be 2
> compares. do you do any alignment on the data now?
>
>>> and as noted in the e-mail below, why do these checks not scale nicely
>>> with the number of worker processes? If they did, the fact that one 3.2
>>> process is about 1/3 the speed of a 3.0 process in checking the acls
>>> wouldn't matter nearly as much when it's so easy to get an 8+ core system.
>>>
>>
>> There you have the unknown.
>
> I think this is a fairly critical thing to figure out.
>
>>>
>>> it seems to me that all accept/deny rules in a set should be able to be
>>> combined into a tree to make searching them very fast.
>>>
>>> so for example if you have
>>>
>>> accept 1
>>> accept 2
>>> deny 3
>>> deny 4
>>> accept 5
>>>
>>> you need to create three trees (one with accept 1 and accept 2, one with
>>> deny3 and deny4, and one with accept 5) and then check each tree to see
>>> if you have a match.
>>>
>>> the types of match could be done in order of increasing cost, so if you
>>
>> The config file is specific structure configured by admin under guaranteed
>> rules of operation for access lines (top-down, left-to-right, first-match-wins)
>> to perform boolean-logic calculations using ACL sets.
>> Sorting access line rules is not an option.
>> Sorting ACL values and tree-forming them is already done (regex being the one
>> exception AFAIK).
>> Sorting position-wise on a single access line is also ruled out by interactions
>> with deny_info, auth and external ACL types.
>
> It would seem that as long as you don't cross boundries between the different
> types, you should be able to optimize within a group.
>
> using my example above, you couldn't combine the 'accept 5' with any of the
> other accepts, but you could combine accept 1 and 2 and combine deny 3 and 4
> togeather.
>
> now, I know that I don't fully understand all the possible ACL types, so this
> may not work for some of them, but I believe that a fairly common use case is to
> have either a lot of allow rules, or a lot of deny rules as a block (either a
> list of sites you are allowed to access, or a list of sites that are blocked),
> so an ability to optimize these use cases may be well worth it.
>
>>> have acl entries of type port, src, dst, and url regex, organize the
>>> tree so that you check ports first, then src, then dst, then only if all
>>> that matches do you need to do the regex. This would be very similar to
>>> the shortcut logic that you use today with a single rule where you bail
>>> out when you don't find a match.
>>>
>>> you could go with a complex tree structure, but since this only needs to
>>> be changed at boot time,
>>
>> Um, "boot"/startup time and arbitrary "-k reconfigure" times.
>> With a reverse-configuration display dump on any cache manager request.
>
> still a pretty rare case, and one where you can build a completely new ruleset
> and swap it out. My point was that this isn't something that you have to be able
> to update dynamically.
>
>>> it seems to me that a simple array that you can
>>> do a binary search on will work for the port, src, and dst trees. The
>>> url regex is probably easiest to initially create by just doing a list
>>> of regex strings to match and working down that list, but eventually it
>>
>> This is already how we do these. But with a splay tree instead of binary.
>
> I wondered about that. I've gotten splay tree related warning messages.
>
>>> may be best to create a parse tree so that you only have to walk down
>>> the string once to see if you have a match.
>>
>> That would be nice. Care to implement?
>> You just have to get the regex library to adjust its pre-compiled patterns with
>> OR into (existing|new) whenever a new pattern string is added to an ACL.
>
> I'm actually watching the libnorm project closely, with an intention of
> leveraging it for this sort of thing. It's a project being developed by the
> maintainer of rsyslog trying to efficiently match log entries. it doesn't
> support regex notation for defining it's rules at this point, but I've poked
> Rainer in that direction, so we'll see how things go.
>
>>> you wouldn't quite be able to get this fast as you would have to
>>> actually do two checks, one if you have a match on that level and one
>>> for the rules that don't specify something in the current tree (one
>>> check for if the http_access line specifies a port number and one for if
>>> it doesn't for example)
>>
>> We get around this problem by using C++ types. ACLChecklist walks the tree
>> holding the current location, expected result, and all details available about
>> the transaction. Each node in the tree has a match() function which gets called
>> at most once per walk. Each ACL data type provides its own match() algorithm.
>>
>> That is why the following config is invalid:
>> acl foo src 1.2.3.4
>> acl foo port 80
>
> Ok, makes sense. I'll have to dig into that, thanks for the pointer of the
> function to look for.
>
>>>
>>> this sort of acl structure would reduce a complex ruleset down to ~O(log
>>> n) instead of the current O(n) (a really complex ruleset would be log n
>>> of each tree added togeather)
>>>
>>> there are cases where this sort of thing would be more expensive than
>>> the current, simple approach, but those would be on simple rulesets
>>> which aren't affected much by a few extra checks.
>>
>> Um, You have pretty much described our existing code. Even with the details of
>> *how* hidden away the small bit exposed to admin is fairly complex.
>
> good ;-) I'm glad when I suggest an approach and find that the project is
> already doing things the way I think is best.
>
>>
>> I am thinking you did not understand ACLs very well earlier when designing your
>> config rules. With large rulsets that can easily lead to an inefficient config
>> and the worst-case results you seem to have achieved.
>> If you care to share your actual configuration file contents I'm happy to read
>> through and point out any optimizations that can be made.
>> Though you may want to use the above info and see if you can find any
>> optimizations first.
>
> I'll have to do some sanitizeing of the rules before I can send it out. I'll try
> and figure out how to do this without destroying the ability to check things.
>
> the basic problem is that this is a whitelist of what is allowed to be accessed.
> I know that there are some problems with rules that overlap, but that's not a
> large part of the issue (usually if rules overlap, the more general rule is
> wrong, but the application developers have done something stupid and it needs to
> be in there until they fix the application)
>
>> The ACL storage types are all defined in the src/acl/*Data.* source files. If
>> you wish to work on finding us some faster types or even faster matching
>> algorithm for an existing type that would be welcome.
>> We do ask for some unit/micro-benchmarks of the old vs new match() function so
>> we know the change is an actual improvement.
>
> I don't know how much I will get a chance to do any coding on this (as opposed
> to being available to test anything that you folks try, testing I will make time
> for), but I definantly agree on the need to do benchmarks.
>
> again, thanks for the info.
>
> David Lang
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

david-2
Ok, I finally got a chance to test 2.7STABLE9

it performs about the same as squid 3.0, possibly a little better.

with my somewhat stripped down config (smaller regex patterns, replacing
CIDR blocks and names that would need to be looked up in /etc/hosts with
individual IP addresses)

2.7 gives ~4800 requests/sec
3.0 gives ~4600 requests/sec
3.2.0.6 with 1 worker gives ~1300 requests/sec
3.2.0.6 with 5 workers gives ~2800 requests/sec

the numbers for 3.0 are slightly better than what I was getting with the
full ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I
got from the last round of tests (with either the full or simplified
ruleset)

so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the
ability to use multiple worker processes in 3.2 doesn't make up for this.

the time taken seems to almost all be in the ACL avaluation as eliminating
all the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.

one theory is that even though I have IPv6 disabled on this build, the
added space and more expensive checks needed to compare IPv6 addresses
instead of IPv4 addresses accounts for the single worker drop of ~66%.
that seems rather expensive, even though there are 293 http_access lines
(and one of them uses external file contents in it's acls, so it's a total
of ~2400 source/destination pairs, however due to the ability to shortcut
the comparison the number of tests that need to be done should be <400)



In addition, there seems to be some sort of locking betwen the multiple
worker processes in 3.2 when checking the ACLs as the test with almost no
ACLs scales close to 100% per worker while with the ACLs it scales much
more slowly, and above 4-5 workers actually drops off dramatically (to the
point where with 8 workers the throughput is down to about what you get
with 1-2 workers) I don't see any conceptual reason why the ACL checks of
the different worker threads should impact each other in any way, let
alone in a way that limits scalability to ~4 workers before adding more
workers is a net loss.

David Lang


> On Wed, 13 Apr 2011, Marcos wrote:
>
>> Hi David,
>>
>> could you run and publish your benchmark with squid 2.7 ???
>> i'd like to know if is there any regression between 2.7 and 3.x series.
>>
>> thanks.
>>
>> Marcos
>>
>>
>> ----- Mensagem original ----
>> De: "[hidden email]" <[hidden email]>
>> Para: Amos Jeffries <[hidden email]>
>> Cc: [hidden email]; [hidden email]
>> Enviadas: S?bado, 9 de Abril de 2011 12:56:12
>> Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>>
>> On Sat, 9 Apr 2011, Amos Jeffries wrote:
>>
>>> On 09/04/11 14:27, [hidden email] wrote:
>>>> A couple more things about the ACLs used in my test
>>>>
>>>> all of them are allow ACLs (no deny rules to worry about precidence of)
>>>> except for a deny-all at the bottom
>>>>
>>>> the ACL line that permits the test source to the test destination has
>>>> zero overlap with the rest of the rules
>>>>
>>>> every rule has an IP based restriction (even the ones with url_regex are
>>>> source -> URL regex)
>>>>
>>>> I moved the ACL that allows my test from the bottom of the ruleset to
>>>> the top and the resulting performance numbers were up as if the other
>>>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
>>>> rule.
>>>>
>>>> I changed one of the url_regex rules to just match one line rather than
>>>> a file containing 307 lines to see if that made a difference, and it
>>>> made no significant difference. So this indicates to me that it's not
>>>> having to fully evaluate every rule (it's able to skip doing the regex
>>>> if the IP match doesn't work)
>>>>
>>>> I then changed all the acl lines that used hostnames to have IP
>>>> addresses in them, and this also made no significant difference
>>>>
>>>> I then changed all subnet matches to single IP address (just nuked /##
>>>> throughout the config file) and this also made no significant difference.
>>>>
>>>
>>> Squid has always worked this way. It will *test* every rule from the top
>>> down to the one that matches. Also testing each line left-to-right until
>>> one fails or the whole line matches.
>>>
>>>>
>>>> so why are the address matches so expensive
>>>>
>>>
>>> 3.0 and older IP address is a 32-bit comparison.
>>> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>>>
>>> If something like a word-wise comparison can be implemented faster than
>>> memcmp() we would welcome it.
>>
>> I wonder if there should be a different version that's used when IPv6 is
>> disabled. this is a pretty large hit.
>>
>> if the data is aligned properly, on a 64 bit system this should still only
>> be 2 compares. do you do any alignment on the data now?
>>
>>>> and as noted in the e-mail below, why do these checks not scale nicely
>>>> with the number of worker processes? If they did, the fact that one 3.2
>>>> process is about 1/3 the speed of a 3.0 process in checking the acls
>>>> wouldn't matter nearly as much when it's so easy to get an 8+ core
>>>> system.
>>>>
>>>
>>> There you have the unknown.
>>
>> I think this is a fairly critical thing to figure out.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

david-2
ping, I haven't seen a response to this additional information that I sent
out last week.

squid 3.1 and 3.2 are a significant regression in performance from squid
2.7 or 3.0

David Lang

On Thu, 14 Apr 2011, [hidden email] wrote:

> Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues
>
> Ok, I finally got a chance to test 2.7STABLE9
>
> it performs about the same as squid 3.0, possibly a little better.
>
> with my somewhat stripped down config (smaller regex patterns, replacing CIDR
> blocks and names that would need to be looked up in /etc/hosts with
> individual IP addresses)
>
> 2.7 gives ~4800 requests/sec
> 3.0 gives ~4600 requests/sec
> 3.2.0.6 with 1 worker gives ~1300 requests/sec
> 3.2.0.6 with 5 workers gives ~2800 requests/sec
>
> the numbers for 3.0 are slightly better than what I was getting with the full
> ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I got from
> the last round of tests (with either the full or simplified ruleset)
>
> so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the
> ability to use multiple worker processes in 3.2 doesn't make up for this.
>
> the time taken seems to almost all be in the ACL avaluation as eliminating
> all the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.
>
> one theory is that even though I have IPv6 disabled on this build, the added
> space and more expensive checks needed to compare IPv6 addresses instead of
> IPv4 addresses accounts for the single worker drop of ~66%. that seems rather
> expensive, even though there are 293 http_access lines (and one of them uses
> external file contents in it's acls, so it's a total of ~2400
> source/destination pairs, however due to the ability to shortcut the
> comparison the number of tests that need to be done should be <400)
>
>
>
> In addition, there seems to be some sort of locking betwen the multiple
> worker processes in 3.2 when checking the ACLs as the test with almost no
> ACLs scales close to 100% per worker while with the ACLs it scales much more
> slowly, and above 4-5 workers actually drops off dramatically (to the point
> where with 8 workers the throughput is down to about what you get with 1-2
> workers) I don't see any conceptual reason why the ACL checks of the
> different worker threads should impact each other in any way, let alone in a
> way that limits scalability to ~4 workers before adding more workers is a net
> loss.
>
> David Lang
>
>
>> On Wed, 13 Apr 2011, Marcos wrote:
>>
>>> Hi David,
>>>
>>> could you run and publish your benchmark with squid 2.7 ???
>>> i'd like to know if is there any regression between 2.7 and 3.x series.
>>>
>>> thanks.
>>>
>>> Marcos
>>>
>>>
>>> ----- Mensagem original ----
>>> De: "[hidden email]" <[hidden email]>
>>> Para: Amos Jeffries <[hidden email]>
>>> Cc: [hidden email]; [hidden email]
>>> Enviadas: S?bado, 9 de Abril de 2011 12:56:12
>>> Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>>>
>>> On Sat, 9 Apr 2011, Amos Jeffries wrote:
>>>
>>>> On 09/04/11 14:27, [hidden email] wrote:
>>>>> A couple more things about the ACLs used in my test
>>>>>
>>>>> all of them are allow ACLs (no deny rules to worry about precidence of)
>>>>> except for a deny-all at the bottom
>>>>>
>>>>> the ACL line that permits the test source to the test destination has
>>>>> zero overlap with the rest of the rules
>>>>>
>>>>> every rule has an IP based restriction (even the ones with url_regex are
>>>>> source -> URL regex)
>>>>>
>>>>> I moved the ACL that allows my test from the bottom of the ruleset to
>>>>> the top and the resulting performance numbers were up as if the other
>>>>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
>>>>> rule.
>>>>>
>>>>> I changed one of the url_regex rules to just match one line rather than
>>>>> a file containing 307 lines to see if that made a difference, and it
>>>>> made no significant difference. So this indicates to me that it's not
>>>>> having to fully evaluate every rule (it's able to skip doing the regex
>>>>> if the IP match doesn't work)
>>>>>
>>>>> I then changed all the acl lines that used hostnames to have IP
>>>>> addresses in them, and this also made no significant difference
>>>>>
>>>>> I then changed all subnet matches to single IP address (just nuked /##
>>>>> throughout the config file) and this also made no significant
>>>>> difference.
>>>>>
>>>>
>>>> Squid has always worked this way. It will *test* every rule from the top
>>>> down to the one that matches. Also testing each line left-to-right until
>>>> one fails or the whole line matches.
>>>>
>>>>>
>>>>> so why are the address matches so expensive
>>>>>
>>>>
>>>> 3.0 and older IP address is a 32-bit comparison.
>>>> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>>>>
>>>> If something like a word-wise comparison can be implemented faster than
>>>> memcmp() we would welcome it.
>>>
>>> I wonder if there should be a different version that's used when IPv6 is
>>> disabled. this is a pretty large hit.
>>>
>>> if the data is aligned properly, on a 64 bit system this should still only
>>> be 2 compares. do you do any alignment on the data now?
>>>
>>>>> and as noted in the e-mail below, why do these checks not scale nicely
>>>>> with the number of worker processes? If they did, the fact that one 3.2
>>>>> process is about 1/3 the speed of a 3.0 process in checking the acls
>>>>> wouldn't matter nearly as much when it's so easy to get an 8+ core
>>>>> system.
>>>>>
>>>>
>>>> There you have the unknown.
>>>
>>> I think this is a fairly critical thing to figure out.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Res: Res: [squid-users] squid 3.2.0.5 smp scaling issues

x-14-2
thanks for your answer David.

i'm seeing too much feature been included at squid 3.x, but it's getting as
slower as new features are added.
i think squid 3.2 with 1 worker should be as fast as 2.7, but it's getting
slower e hungry.


Marcos


----- Mensagem original ----
De: "[hidden email]" <[hidden email]>
Para: Marcos <[hidden email]>
Cc: Amos Jeffries <[hidden email]>; [hidden email];
[hidden email]
Enviadas: Sexta-feira, 22 de Abril de 2011 15:10:44
Assunto: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

ping, I haven't seen a response to this additional information that I sent out
last week.

squid 3.1 and 3.2 are a significant regression in performance from squid 2.7 or
3.0

David Lang

On Thu, 14 Apr 2011, [hidden email] wrote:

> Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues
>
> Ok, I finally got a chance to test 2.7STABLE9
>
> it performs about the same as squid 3.0, possibly a little better.
>
> with my somewhat stripped down config (smaller regex patterns, replacing CIDR
>blocks and names that would need to be looked up in /etc/hosts with individual
>IP addresses)
>
> 2.7 gives ~4800 requests/sec
> 3.0 gives ~4600 requests/sec
> 3.2.0.6 with 1 worker gives ~1300 requests/sec
> 3.2.0.6 with 5 workers gives ~2800 requests/sec
>
> the numbers for 3.0 are slightly better than what I was getting with the full
>ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I got from the
>last round of tests (with either the full or simplified ruleset)
>
> so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the
>ability to use multiple worker processes in 3.2 doesn't make up for this.
>
> the time taken seems to almost all be in the ACL avaluation as eliminating all
>the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.
>
> one theory is that even though I have IPv6 disabled on this build, the added
>space and more expensive checks needed to compare IPv6 addresses instead of IPv4
>addresses accounts for the single worker drop of ~66%. that seems rather
>expensive, even though there are 293 http_access lines (and one of them uses
>external file contents in it's acls, so it's a total of ~2400 source/destination
>pairs, however due to the ability to shortcut the comparison the number of tests
>that need to be done should be <400)
>
>
>
> In addition, there seems to be some sort of locking betwen the multiple worker
>processes in 3.2 when checking the ACLs as the test with almost no ACLs scales
>close to 100% per worker while with the ACLs it scales much more slowly, and
>above 4-5 workers actually drops off dramatically (to the point where with 8
>workers the throughput is down to about what you get with 1-2 workers) I don't
>see any conceptual reason why the ACL checks of the different worker threads
>should impact each other in any way, let alone in a way that limits scalability
>to ~4 workers before adding more workers is a net loss.
>
> David Lang
>
>
>> On Wed, 13 Apr 2011, Marcos wrote:
>>
>>> Hi David,
>>>
>>> could you run and publish your benchmark with squid 2.7 ???
>>> i'd like to know if is there any regression between 2.7 and 3.x series.
>>>
>>> thanks.
>>>
>>> Marcos
>>>
>>>
>>> ----- Mensagem original ----
>>> De: "[hidden email]" <[hidden email]>
>>> Para: Amos Jeffries <[hidden email]>
>>> Cc: [hidden email]; [hidden email]
>>> Enviadas: S?bado, 9 de Abril de 2011 12:56:12
>>> Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>>>
>>> On Sat, 9 Apr 2011, Amos Jeffries wrote:
>>>
>>>> On 09/04/11 14:27, [hidden email] wrote:
>>>>> A couple more things about the ACLs used in my test
>>>>>
>>>>> all of them are allow ACLs (no deny rules to worry about precidence of)
>>>>> except for a deny-all at the bottom
>>>>>
>>>>> the ACL line that permits the test source to the test destination has
>>>>> zero overlap with the rest of the rules
>>>>>
>>>>> every rule has an IP based restriction (even the ones with url_regex are
>>>>> source -> URL regex)
>>>>>
>>>>> I moved the ACL that allows my test from the bottom of the ruleset to
>>>>> the top and the resulting performance numbers were up as if the other
>>>>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
>>>>> rule.
>>>>>
>>>>> I changed one of the url_regex rules to just match one line rather than
>>>>> a file containing 307 lines to see if that made a difference, and it
>>>>> made no significant difference. So this indicates to me that it's not
>>>>> having to fully evaluate every rule (it's able to skip doing the regex
>>>>> if the IP match doesn't work)
>>>>>
>>>>> I then changed all the acl lines that used hostnames to have IP
>>>>> addresses in them, and this also made no significant difference
>>>>>
>>>>> I then changed all subnet matches to single IP address (just nuked /##
>>>>> throughout the config file) and this also made no significant difference.
>>>>>
>>>>
>>>> Squid has always worked this way. It will *test* every rule from the top down
>>>>to the one that matches. Also testing each line left-to-right until one fails or
>>>>the whole line matches.
>>>>
>>>>>
>>>>> so why are the address matches so expensive
>>>>>
>>>>
>>>> 3.0 and older IP address is a 32-bit comparison.
>>>> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>>>>
>>>> If something like a word-wise comparison can be implemented faster than
>>>>memcmp() we would welcome it.
>>>
>>> I wonder if there should be a different version that's used when IPv6 is
>>>disabled. this is a pretty large hit.
>>>
>>> if the data is aligned properly, on a 64 bit system this should still only be 2
>>>compares. do you do any alignment on the data now?
>>>
>>>>> and as noted in the e-mail below, why do these checks not scale nicely
>>>>> with the number of worker processes? If they did, the fact that one 3.2
>>>>> process is about 1/3 the speed of a 3.0 process in checking the acls
>>>>> wouldn't matter nearly as much when it's so easy to get an 8+ core system.
>>>>>
>>>>
>>>> There you have the unknown.
>>>
>>> I think this is a fairly critical thing to figure out.
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

Alex Rousskov
In reply to this post by david-2
On 04/14/2011 09:06 PM, [hidden email] wrote:

> Ok, I finally got a chance to test 2.7STABLE9
>
> it performs about the same as squid 3.0, possibly a little better.
>
> with my somewhat stripped down config (smaller regex patterns, replacing
> CIDR blocks and names that would need to be looked up in /etc/hosts with
> individual IP addresses)
>
> 2.7 gives ~4800 requests/sec
> 3.0 gives ~4600 requests/sec
> 3.2.0.6 with 1 worker gives ~1300 requests/sec
> 3.2.0.6 with 5 workers gives ~2800 requests/sec

Glad you did not see a significant regression between v2.7 and v3.0. We
have heard rather different stories. Every environment is different, and
many lab tests are misguided, of course, but it is still good to hear
positive reports.

The difference between v3.2 and v3.0 is known and have been discussed on
squid-dev. A few specific culprits are also known, but more need to be
identified. We are working on identifying these performance bugs and
reducing that difference.

As for 1 versus 5 worker difference, it seems to be specific to your
environment (as discussed below).


> the numbers for 3.0 are slightly better than what I was getting with the
> full ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I
> got from the last round of tests (with either the full or simplified
> ruleset)
>
> so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and
> the ability to use multiple worker processes in 3.2 doesn't make up for
> this.
>
> the time taken seems to almost all be in the ACL avaluation as
> eliminating all the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.

If ACLs are the major culprit in your environment, then this is most
likely not a problem in Squid source code. AFAIK, there are no locks or
other synchronization primitives/overheads when it comes to Squid ACLs.
The solution may lie in optimizing some 3rd-party libraries (used by
ACLs) or in optimizing how they are used by Squid, depending on what
ACLs you use. As far as Squid-specific code is concerned, you should see
nearly linear ACL scale with the number of workers.


> one theory is that even though I have IPv6 disabled on this build, the
> added space and more expensive checks needed to compare IPv6 addresses
> instead of IPv4 addresses accounts for the single worker drop of ~66%.
> that seems rather expensive, even though there are 293 http_access lines
> (and one of them uses external file contents in it's acls, so it's a
> total of ~2400 source/destination pairs, however due to the ability to
> shortcut the comparison the number of tests that need to be done should
> be <400)

Yes, IPv6 is one of the known major performance regression culprits, but
IPv6 ACLs should still scale linearly with the number of workers, AFAICT.

Please note that I am not an ACL expert. I am just talking from the
overall Squid SMP design point of view and from our testing/deployment
experience point of view.


> In addition, there seems to be some sort of locking betwen the multiple
> worker processes in 3.2 when checking the ACLs

There are pretty much no locks in the current official SMP code. This
will change as we start adding shared caches in a week or so, but even
then the ACLs will remain lock-free. There could be some internal
locking in the 3rd-party libraries used by ACLs (regex and such), but I
do not know much about them.


HTH,

Alex.


>> On Wed, 13 Apr 2011, Marcos wrote:
>>
>>> Hi David,
>>>
>>> could you run and publish your benchmark with squid 2.7 ???
>>> i'd like to know if is there any regression between 2.7 and 3.x series.
>>>
>>> thanks.
>>>
>>> Marcos
>>>
>>>
>>> ----- Mensagem original ----
>>> De: "[hidden email]" <[hidden email]>
>>> Para: Amos Jeffries <[hidden email]>
>>> Cc: [hidden email]; [hidden email]
>>> Enviadas: S?bado, 9 de Abril de 2011 12:56:12
>>> Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>>>
>>> On Sat, 9 Apr 2011, Amos Jeffries wrote:
>>>
>>>> On 09/04/11 14:27, [hidden email] wrote:
>>>>> A couple more things about the ACLs used in my test
>>>>>
>>>>> all of them are allow ACLs (no deny rules to worry about precidence
>>>>> of)
>>>>> except for a deny-all at the bottom
>>>>>
>>>>> the ACL line that permits the test source to the test destination has
>>>>> zero overlap with the rest of the rules
>>>>>
>>>>> every rule has an IP based restriction (even the ones with
>>>>> url_regex are
>>>>> source -> URL regex)
>>>>>
>>>>> I moved the ACL that allows my test from the bottom of the ruleset to
>>>>> the top and the resulting performance numbers were up as if the other
>>>>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating
>>>>> every
>>>>> rule.
>>>>>
>>>>> I changed one of the url_regex rules to just match one line rather
>>>>> than
>>>>> a file containing 307 lines to see if that made a difference, and it
>>>>> made no significant difference. So this indicates to me that it's not
>>>>> having to fully evaluate every rule (it's able to skip doing the regex
>>>>> if the IP match doesn't work)
>>>>>
>>>>> I then changed all the acl lines that used hostnames to have IP
>>>>> addresses in them, and this also made no significant difference
>>>>>
>>>>> I then changed all subnet matches to single IP address (just nuked /##
>>>>> throughout the config file) and this also made no significant
>>>>> difference.
>>>>>
>>>>
>>>> Squid has always worked this way. It will *test* every rule from the
>>>> top down to the one that matches. Also testing each line
>>>> left-to-right until one fails or the whole line matches.
>>>>
>>>>>
>>>>> so why are the address matches so expensive
>>>>>
>>>>
>>>> 3.0 and older IP address is a 32-bit comparison.
>>>> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>>>>
>>>> If something like a word-wise comparison can be implemented faster
>>>> than memcmp() we would welcome it.
>>>
>>> I wonder if there should be a different version that's used when IPv6
>>> is disabled. this is a pretty large hit.
>>>
>>> if the data is aligned properly, on a 64 bit system this should still
>>> only be 2 compares. do you do any alignment on the data now?
>>>
>>>>> and as noted in the e-mail below, why do these checks not scale nicely
>>>>> with the number of worker processes? If they did, the fact that one
>>>>> 3.2
>>>>> process is about 1/3 the speed of a 3.0 process in checking the acls
>>>>> wouldn't matter nearly as much when it's so easy to get an 8+ core
>>>>> system.
>>>>>
>>>>
>>>> There you have the unknown.
>>>
>>> I think this is a fairly critical thing to figure out.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

david-2
On Mon, 25 Apr 2011, Alex Rousskov wrote:

> On 04/14/2011 09:06 PM, [hidden email] wrote:
>> Ok, I finally got a chance to test 2.7STABLE9
>>
>> it performs about the same as squid 3.0, possibly a little better.
>>
>> with my somewhat stripped down config (smaller regex patterns, replacing
>> CIDR blocks and names that would need to be looked up in /etc/hosts with
>> individual IP addresses)
>>
>> 2.7 gives ~4800 requests/sec
>> 3.0 gives ~4600 requests/sec
>> 3.2.0.6 with 1 worker gives ~1300 requests/sec
>> 3.2.0.6 with 5 workers gives ~2800 requests/sec
>
> Glad you did not see a significant regression between v2.7 and v3.0. We
> have heard rather different stories. Every environment is different, and
> many lab tests are misguided, of course, but it is still good to hear
> positive reports.
>
> The difference between v3.2 and v3.0 is known and have been discussed on
> squid-dev. A few specific culprits are also known, but more need to be
> identified. We are working on identifying these performance bugs and
> reducing that difference.

let me know if there are any tests that I can run that will help you.

> As for 1 versus 5 worker difference, it seems to be specific to your
> environment (as discussed below).
>
>
>> the numbers for 3.0 are slightly better than what I was getting with the
>> full ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I
>> got from the last round of tests (with either the full or simplified
>> ruleset)
>>
>> so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and
>> the ability to use multiple worker processes in 3.2 doesn't make up for
>> this.
>>
>> the time taken seems to almost all be in the ACL avaluation as
>> eliminating all the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.
>
> If ACLs are the major culprit in your environment, then this is most
> likely not a problem in Squid source code. AFAIK, there are no locks or
> other synchronization primitives/overheads when it comes to Squid ACLs.
> The solution may lie in optimizing some 3rd-party libraries (used by
> ACLs) or in optimizing how they are used by Squid, depending on what
> ACLs you use. As far as Squid-specific code is concerned, you should see
> nearly linear ACL scale with the number of workers.

given that my ACLs are IP/port matches or regex matches (and I've tested
replacing the regex matches with IP matches with no significant change in
performance), what components would be used.

>
>> one theory is that even though I have IPv6 disabled on this build, the
>> added space and more expensive checks needed to compare IPv6 addresses
>> instead of IPv4 addresses accounts for the single worker drop of ~66%.
>> that seems rather expensive, even though there are 293 http_access lines
>> (and one of them uses external file contents in it's acls, so it's a
>> total of ~2400 source/destination pairs, however due to the ability to
>> shortcut the comparison the number of tests that need to be done should
>> be <400)
>
> Yes, IPv6 is one of the known major performance regression culprits, but
> IPv6 ACLs should still scale linearly with the number of workers, AFAICT.
>
> Please note that I am not an ACL expert. I am just talking from the
> overall Squid SMP design point of view and from our testing/deployment
> experience point of view.

that makes sense and is what I would have expected, but in my case (lots
of ACLs) I am seeing a definante problem with more workers not completing
more work, and beyond about 5 workers I am seeing the total work being
completed drop. I can't think of any reason besides locking that this may
be the case.

>> In addition, there seems to be some sort of locking betwen the multiple
>> worker processes in 3.2 when checking the ACLs
>
> There are pretty much no locks in the current official SMP code. This
> will change as we start adding shared caches in a week or so, but even
> then the ACLs will remain lock-free. There could be some internal
> locking in the 3rd-party libraries used by ACLs (regex and such), but I
> do not know much about them.

what are the 3rd party libraries that I would be using?

David Lang

>
> HTH,
>
> Alex.
>
>
>>> On Wed, 13 Apr 2011, Marcos wrote:
>>>
>>>> Hi David,
>>>>
>>>> could you run and publish your benchmark with squid 2.7 ???
>>>> i'd like to know if is there any regression between 2.7 and 3.x series.
>>>>
>>>> thanks.
>>>>
>>>> Marcos
>>>>
>>>>
>>>> ----- Mensagem original ----
>>>> De: "[hidden email]" <[hidden email]>
>>>> Para: Amos Jeffries <[hidden email]>
>>>> Cc: [hidden email]; [hidden email]
>>>> Enviadas: S?bado, 9 de Abril de 2011 12:56:12
>>>> Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>>>>
>>>> On Sat, 9 Apr 2011, Amos Jeffries wrote:
>>>>
>>>>> On 09/04/11 14:27, [hidden email] wrote:
>>>>>> A couple more things about the ACLs used in my test
>>>>>>
>>>>>> all of them are allow ACLs (no deny rules to worry about precidence
>>>>>> of)
>>>>>> except for a deny-all at the bottom
>>>>>>
>>>>>> the ACL line that permits the test source to the test destination has
>>>>>> zero overlap with the rest of the rules
>>>>>>
>>>>>> every rule has an IP based restriction (even the ones with
>>>>>> url_regex are
>>>>>> source -> URL regex)
>>>>>>
>>>>>> I moved the ACL that allows my test from the bottom of the ruleset to
>>>>>> the top and the resulting performance numbers were up as if the other
>>>>>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating
>>>>>> every
>>>>>> rule.
>>>>>>
>>>>>> I changed one of the url_regex rules to just match one line rather
>>>>>> than
>>>>>> a file containing 307 lines to see if that made a difference, and it
>>>>>> made no significant difference. So this indicates to me that it's not
>>>>>> having to fully evaluate every rule (it's able to skip doing the regex
>>>>>> if the IP match doesn't work)
>>>>>>
>>>>>> I then changed all the acl lines that used hostnames to have IP
>>>>>> addresses in them, and this also made no significant difference
>>>>>>
>>>>>> I then changed all subnet matches to single IP address (just nuked /##
>>>>>> throughout the config file) and this also made no significant
>>>>>> difference.
>>>>>>
>>>>>
>>>>> Squid has always worked this way. It will *test* every rule from the
>>>>> top down to the one that matches. Also testing each line
>>>>> left-to-right until one fails or the whole line matches.
>>>>>
>>>>>>
>>>>>> so why are the address matches so expensive
>>>>>>
>>>>>
>>>>> 3.0 and older IP address is a 32-bit comparison.
>>>>> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>>>>>
>>>>> If something like a word-wise comparison can be implemented faster
>>>>> than memcmp() we would welcome it.
>>>>
>>>> I wonder if there should be a different version that's used when IPv6
>>>> is disabled. this is a pretty large hit.
>>>>
>>>> if the data is aligned properly, on a 64 bit system this should still
>>>> only be 2 compares. do you do any alignment on the data now?
>>>>
>>>>>> and as noted in the e-mail below, why do these checks not scale nicely
>>>>>> with the number of worker processes? If they did, the fact that one
>>>>>> 3.2
>>>>>> process is about 1/3 the speed of a 3.0 process in checking the acls
>>>>>> wouldn't matter nearly as much when it's so easy to get an 8+ core
>>>>>> system.
>>>>>>
>>>>>
>>>>> There you have the unknown.
>>>>
>>>> I think this is a fairly critical thing to figure out.
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Res: Res: [squid-users] squid 3.2.0.5 smp scaling issues

david-2
In reply to this post by x-14-2
On Mon, 25 Apr 2011, Marcos wrote:

> thanks for your answer David.
>
> i'm seeing too much feature been included at squid 3.x, but it's getting as
> slower as new features are added.

that's unfortunantly fairly normal.

> i think squid 3.2 with 1 worker should be as fast as 2.7, but it's getting
> slower e hungry.

that's one major problem, but the fact that the ACL matching isn't scaling
with more workers I think is what's killing us.

1 3.2 worker is ~1/3 the speed of 2.7, but with the easy availablity of 8+
real cores (not hyperthreaded 'fake' cores), you should still be able to
get ~3x the performance of 2.7 by using 3.2.

unfortunantly that's not what's happening, and we end up topping out
around 1/2-2/3 the performance of 2.7

David Lang

>
> Marcos
>
>
> ----- Mensagem original ----
> De: "[hidden email]" <[hidden email]>
> Para: Marcos <[hidden email]>
> Cc: Amos Jeffries <[hidden email]>; [hidden email];
> [hidden email]
> Enviadas: Sexta-feira, 22 de Abril de 2011 15:10:44
> Assunto: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues
>
> ping, I haven't seen a response to this additional information that I sent out
> last week.
>
> squid 3.1 and 3.2 are a significant regression in performance from squid 2.7 or
> 3.0
>
> David Lang
>
> On Thu, 14 Apr 2011, [hidden email] wrote:
>
>> Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues
>>
>> Ok, I finally got a chance to test 2.7STABLE9
>>
>> it performs about the same as squid 3.0, possibly a little better.
>>
>> with my somewhat stripped down config (smaller regex patterns, replacing CIDR
>> blocks and names that would need to be looked up in /etc/hosts with individual
>> IP addresses)
>>
>> 2.7 gives ~4800 requests/sec
>> 3.0 gives ~4600 requests/sec
>> 3.2.0.6 with 1 worker gives ~1300 requests/sec
>> 3.2.0.6 with 5 workers gives ~2800 requests/sec
>>
>> the numbers for 3.0 are slightly better than what I was getting with the full
>> ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I got from the
>> last round of tests (with either the full or simplified ruleset)
>>
>> so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the
>> ability to use multiple worker processes in 3.2 doesn't make up for this.
>>
>> the time taken seems to almost all be in the ACL avaluation as eliminating all
>> the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.
>>
>> one theory is that even though I have IPv6 disabled on this build, the added
>> space and more expensive checks needed to compare IPv6 addresses instead of IPv4
>> addresses accounts for the single worker drop of ~66%. that seems rather
>> expensive, even though there are 293 http_access lines (and one of them uses
>> external file contents in it's acls, so it's a total of ~2400 source/destination
>> pairs, however due to the ability to shortcut the comparison the number of tests
>> that need to be done should be <400)
>>
>>
>>
>> In addition, there seems to be some sort of locking betwen the multiple worker
>> processes in 3.2 when checking the ACLs as the test with almost no ACLs scales
>> close to 100% per worker while with the ACLs it scales much more slowly, and
>> above 4-5 workers actually drops off dramatically (to the point where with 8
>> workers the throughput is down to about what you get with 1-2 workers) I don't
>> see any conceptual reason why the ACL checks of the different worker threads
>> should impact each other in any way, let alone in a way that limits scalability
>> to ~4 workers before adding more workers is a net loss.
>>
>> David Lang
>>
>>
>>> On Wed, 13 Apr 2011, Marcos wrote:
>>>
>>>> Hi David,
>>>>
>>>> could you run and publish your benchmark with squid 2.7 ???
>>>> i'd like to know if is there any regression between 2.7 and 3.x series.
>>>>
>>>> thanks.
>>>>
>>>> Marcos
>>>>
>>>>
>>>> ----- Mensagem original ----
>>>> De: "[hidden email]" <[hidden email]>
>>>> Para: Amos Jeffries <[hidden email]>
>>>> Cc: [hidden email]; [hidden email]
>>>> Enviadas: S?bado, 9 de Abril de 2011 12:56:12
>>>> Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>>>>
>>>> On Sat, 9 Apr 2011, Amos Jeffries wrote:
>>>>
>>>>> On 09/04/11 14:27, [hidden email] wrote:
>>>>>> A couple more things about the ACLs used in my test
>>>>>>
>>>>>> all of them are allow ACLs (no deny rules to worry about precidence of)
>>>>>> except for a deny-all at the bottom
>>>>>>
>>>>>> the ACL line that permits the test source to the test destination has
>>>>>> zero overlap with the rest of the rules
>>>>>>
>>>>>> every rule has an IP based restriction (even the ones with url_regex are
>>>>>> source -> URL regex)
>>>>>>
>>>>>> I moved the ACL that allows my test from the bottom of the ruleset to
>>>>>> the top and the resulting performance numbers were up as if the other
>>>>>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
>>>>>> rule.
>>>>>>
>>>>>> I changed one of the url_regex rules to just match one line rather than
>>>>>> a file containing 307 lines to see if that made a difference, and it
>>>>>> made no significant difference. So this indicates to me that it's not
>>>>>> having to fully evaluate every rule (it's able to skip doing the regex
>>>>>> if the IP match doesn't work)
>>>>>>
>>>>>> I then changed all the acl lines that used hostnames to have IP
>>>>>> addresses in them, and this also made no significant difference
>>>>>>
>>>>>> I then changed all subnet matches to single IP address (just nuked /##
>>>>>> throughout the config file) and this also made no significant difference.
>>>>>>
>>>>>
>>>>> Squid has always worked this way. It will *test* every rule from the top down
>>>>> to the one that matches. Also testing each line left-to-right until one fails or
>>>>> the whole line matches.
>>>>>
>>>>>>
>>>>>> so why are the address matches so expensive
>>>>>>
>>>>>
>>>>> 3.0 and older IP address is a 32-bit comparison.
>>>>> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>>>>>
>>>>> If something like a word-wise comparison can be implemented faster than
>>>>> memcmp() we would welcome it.
>>>>
>>>> I wonder if there should be a different version that's used when IPv6 is
>>>> disabled. this is a pretty large hit.
>>>>
>>>> if the data is aligned properly, on a 64 bit system this should still only be 2
>>>> compares. do you do any alignment on the data now?
>>>>
>>>>>> and as noted in the e-mail below, why do these checks not scale nicely
>>>>>> with the number of worker processes? If they did, the fact that one 3.2
>>>>>> process is about 1/3 the speed of a 3.0 process in checking the acls
>>>>>> wouldn't matter nearly as much when it's so easy to get an 8+ core system.
>>>>>>
>>>>>
>>>>> There you have the unknown.
>>>>
>>>> I think this is a fairly critical thing to figure out.
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

david-2
In reply to this post by david-2
On Mon, 25 Apr 2011, [hidden email] wrote:

> On Mon, 25 Apr 2011, Alex Rousskov wrote:
>
>> On 04/14/2011 09:06 PM, [hidden email] wrote:
>>
>>> In addition, there seems to be some sort of locking betwen the multiple
>>> worker processes in 3.2 when checking the ACLs
>>
>> There are pretty much no locks in the current official SMP code. This
>> will change as we start adding shared caches in a week or so, but even
>> then the ACLs will remain lock-free. There could be some internal
>> locking in the 3rd-party libraries used by ACLs (regex and such), but I
>> do not know much about them.
>
> what are the 3rd party libraries that I would be using?

one thought I had is that this could be locking on name lookups. how hard
would it be to create a quick patch that would bypass the name lookups
entirely and only do the lookups by IP.

if that regains the speed and/or scalability it would point fingers fairly
conclusively at the DNS components.

this is the only think that I can think of that should be shared between
multiple workers processing ACLs

David Lang
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues

Alex Rousskov
On 04/25/2011 05:31 PM, [hidden email] wrote:

> On Mon, 25 Apr 2011, [hidden email] wrote:
>> On Mon, 25 Apr 2011, Alex Rousskov wrote:
>>> On 04/14/2011 09:06 PM, [hidden email] wrote:
>>>
>>>> In addition, there seems to be some sort of locking betwen the multiple
>>>> worker processes in 3.2 when checking the ACLs
>>>
>>> There are pretty much no locks in the current official SMP code. This
>>> will change as we start adding shared caches in a week or so, but even
>>> then the ACLs will remain lock-free. There could be some internal
>>> locking in the 3rd-party libraries used by ACLs (regex and such), but I
>>> do not know much about them.
>>
>> what are the 3rd party libraries that I would be using?

See "ldd squid". Here is a sample based on a randomly picked Squid:

    libnsl, libresolv, libstdc++, libgcc_s, libm, libc, libz, libepol

Please note that I am not saying that any of these have problems in SMP
environment. I am only saying that Squid itself does not lock anything
runtime so if our suspect is SMP-related locks, they would have to
reside elsewhere. The other possibility is that we should suspect
something else, of course. IMHO, it is more likely to be something else:
after all, Squid does not use threads, where such problems are expected.

BTW, do you see more-or-less even load across CPU cores? If not, you may
need a patch that we find useful on older Linux kernels. It is discussed
in the "Will similar workers receive similar amount of work?" section of
http://wiki.squid-cache.org/Features/SmpScale


> one thought I had is that this could be locking on name lookups. how
> hard would it be to create a quick patch that would bypass the name
> lookups entirely and only do the lookups by IP.

I did not realize your ACLs use DNS lookups. Squid internal DNS code
does not have any runtime SMP locks. However, the presence of DNS
lookups increases the number of suspects.

A patch you propose does not sound difficult to me, but since I cannot
contribute such a patch soon, it is probably better to test with ACLs
that do not require any DNS lookups instead.


> if that regains the speed and/or scalability it would point fingers
> fairly conclusively at the DNS components.
>
> this is the only think that I can think of that should be shared between
> multiple workers processing ACLs

but it is _not_ currently shared from Squid point of view.


Cheers,

Alex.
12
Loading...