squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

pete dawgg
Hello list,

i run squid 3.5.27 with some special settings for windows updates as suggested here: https://wiki.squid-cache.org/ConfigExamples/Caching/WindowsUpdates It's been running almost trouble-free for some time, but for ~2 months the cache-partition has been filling up to 100% (space; inodes were OK) and squid then failed.

the cache-dir is on a 100GB ext2-partition and configured like this:

cache_dir aufs /mnt/cache/squid 75000 16 256
cache_swap_low 60
cache_swap_high 75
minimum_object_size 0 KB
maximum_object_size 6000 MB

some special settings for the windows updates:
range_offset_limit 6000 MB
maximum_object_size 6000 MB
quick_abort_min -1
quick_abort_max -1
quick_abort_pct -1

when i restart squid with its initscript it sometimes expunges some stuff from the cache but then fails again after a short while:
before restart:
/dev/sdb2        99G     93G  863M  100% /mnt/cache
after restart:
/dev/sdb2        99G     87G  7,4G   93% /mnt/cache

there are two types of errors in cache.log:
FATAL: Ipc::Mem::Segment::open failed to shm_open(/squid-cf__metadata.shm): (2) No such file or directory
FATAL: Failed to rename log file /mnt/cache/squid/swap.state.new to /mnt/cache/squid/swap.state

What should i do to make squid work with windows updates reliably again?

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

Amos Jeffries
Administrator
On 11/07/18 22:39, pete dawgg wrote:
> Hello list,
>
> i run squid 3.5.27 with some special settings for windows updates as suggested here: https://wiki.squid-cache.org/ConfigExamples/Caching/WindowsUpdates It's been running almost trouble-free for some time, but for ~2 months the cache-partition has been filling up to 100% (space; inodes were OK) and squid then failed.
>

That implies that either your cache_dir size accounting is VERY badly
broken, something else is filling the disk (eg failing to rotate
swap.state journals), or disk purging is not able to keep up with the
traffic flow.


> the cache-dir is on a 100GB ext2-partition and configured like this:
>

Hmm, a partition. What else is using the same physical disk?
 Squid puts such random I/O pattern on cache disks its best not to be
using the actual physical drive for other things in parallel - they can
slow Squid down, and conversely Squid can cause problems to other uses
by flooding the disk controller queues.


> cache_dir aufs /mnt/cache/squid 75000 16 256

These numbers do matter for ext2 more than for other FS types. You need
them to be large enough not to allocate too many inodes per directory. I
would use "64 256" here, or even "128 256" for a bigger safety margin.

(I *think* modern ext2 implementations have resolved the core issue, but
that may be wrong and ext2 is old enough to be wary.)


> cache_swap_low 60
> cache_swap_high 75
> minimum_object_size 0 KB
> maximum_object_size 6000 MB

If you bumped this for the Win8 sizes mentioned in our wiki, the Win10
major updates have bumped sizes up again past 10GB. So you may need to
increase this.


>
> some special settings for the windows updates:
> range_offset_limit 6000 MB

Add the ACLs necessary to restrict this to WU traffic. Its really hard
on cache space**, so should not be allowed to just any traffic.


** What I mean by that is it may result in N parallel fetches of the
entire object unless collapsed forwarding feature is used.
 In regards to your situation; consider a 10GB WU object being fetched
10 times -> 10*10 GB of disk space required just to fetch. Which
over-fills your available 45GB (60% of 75000 MB [cache_swap_low/100 *
cache_dir] ). And 11 will overflow your whole disk.



> maximum_object_size 6000 MB
> quick_abort_min -1
> quick_abort_max -1
> quick_abort_pct -1
>
> when i restart squid with its initscript it sometimes expunges some stuff from the cache but then fails again after a short while:
> before restart:
> /dev/sdb2        99G     93G  863M  100% /mnt/cache
> after restart:
> /dev/sdb2        99G     87G  7,4G   93% /mnt/cache
>

How much of that /mnt/cache size is in /mnt/cache/squid ?

Is it one physical HDD spindle (versus a RAID drive) ?


>
> there are two types of errors in cache.log:
> FATAL: Ipc::Mem::Segment::open failed to
shm_open(/squid-cf__metadata.shm): (2) No such file or directory

The cf__metadata.shm error is quite bad - it means your collapsed
forwarding is now working well. Which implies it is not preventing the
disk overflow on parallel huge WU fetches.

Are you able to try the new Squid-4? there are some collapsed forwarding
and cache management changes that may fix or allow better diagnosis of
these particularly and maybe your disk usage problem.


> FATAL: Failed to rename log file /mnt/cache/squid/swap.state.new to
/mnt/cache/squid/swap.state

This is suspicious, how large are those swap files?

Does your proxy have correct access permissions on them and the
directories in their path - both Unix filesystem and SELinux / AppArmour
/ whatever your system uses for advanced access matter here.

Same things to check for the /dev/shm device and *.shm file access error
above. But /dev/shm should be root things rather than Squid user access.


>
> What should i do to make squid work with windows updates reliably again?

Some other things you can check;

You can try to make the cache_swap_high/low be closer together and much
larger (eg the default 90 and 95 values). Current 3.5 have fixed the bug
which made smaller values necessary on some earlier installs.


If you can afford the delays it introduces to restart, you could run a
full scan of the cached data (stop Squid, delete the swap.state* files,
then restart Squid and wait).
 - you could do that with a copy of Squid not handling user traffic if
necessary, but the running one cannot use the cache while its happening.


Otherwise, have you tried purging the entire cache and starting Squid
with a clean slate?
 that would be a lot faster for recovery than the above scan. But does
have a bit more bandwidth spent short-term while re-filling the cache.


Amos

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

Alex Rousskov
In reply to this post by pete dawgg
On 07/11/2018 04:39 AM, pete dawgg wrote:

> cache_dir aufs /mnt/cache/squid 75000 16 256

> FATAL: Ipc::Mem::Segment::open failed to shm_open(/squid-cf__metadata.shm): (2) No such file or directory

If you are using a combination of an SMP-unaware disk cache (AUFS) with
SMP features such as multiple workers or a shared memory cache, please
note that this combination is not supported.

The FATAL message above is about a shared memory segment used for
collapsed forwarding. IIRC, Squid v3 attempted to create those segments
even if they were not needed, so I cannot tell for sure whether you are
using an unsupported combination of SMP/non-SMP features.

I can tell you that you cannot use a combination of collapsed
forwarding, AUFS cache_dir, and multiple workers. Also, non-SMP
collapsed forwarding was primarily tested with UFS cache_dirs.


Unfortunately, I cannot answer your question regarding overflowing AUFS
cache directories. One possibility is that Squid is not cleaning up old
cache files fast enough. You already set cache_swap_low/cache_swap_high
aggressively. Does Squid actively remove objects from the full disk
cache when you start it up _without_ any traffic? If not, it could be a
Squid bug. Unfortunately, nobody has worked on AUFS code for years
(AFAIK) so it may be difficult to fix anything that might be broken there.


Cheers,

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

pete dawgg
THX for your reply!

> Betreff: Re: [squid-users] squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails
>
> On 07/11/2018 04:39 AM, pete dawgg wrote:
>
> > cache_dir aufs /mnt/cache/squid 75000 16 256
>
> > FATAL: Ipc::Mem::Segment::open failed to shm_open(/squid-cf__metadata.shm): (2) No such file or directory
>
> If you are using a combination of an SMP-unaware disk cache (AUFS) with
> SMP features such as multiple workers or a shared memory cache, please
> note that this combination is not supported.
I have set workers 8 just recently; but the disk full error had definitely been occuring before.

> The FATAL message above is about a shared memory segment used for
> collapsed forwarding. IIRC, Squid v3 attempted to create those segments
> even if they were not needed, so I cannot tell for sure whether you are
> using an unsupported combination of SMP/non-SMP features.
>
> I can tell you that you cannot use a combination of collapsed
> forwarding, AUFS cache_dir, and multiple workers. Also, non-SMP
> collapsed forwarding was primarily tested with UFS cache_dirs.
I was not aware of that - i can de-activate the workers 8 setting again.
"Collapsed forwarding" was not set intentionally. This error seems to occur when the disk is
really full and squid is restarted.

>
> Unfortunately, I cannot answer your question regarding overflowing AUFS
> cache directories. One possibility is that Squid is not cleaning up old
> cache files fast enough. You already set cache_swap_low/cache_swap_high
> aggressively. Does Squid actively remove objects from the full disk
> cache when you start it up _without_ any traffic? If not, it could be a
> Squid bug. Unfortunately, nobody has worked on AUFS code for years
> (AFAIK) so it may be difficult to fix anything that might be broken there.
When there is no traffic squid seems to cleaning up well enough: over night (no traffic)
disk usage went down to 30GB (now it's at 50GB again)

There was another error i just fixed:
> FATAL: Failed to open swap log /mnt/cache/squid/swap.state.new
Not a permissions or diskspace problem, caused by workers 8.
I have deactivated workers 8 and this error went away.

THX for your input!
pete
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

Alex Rousskov
On 07/12/2018 05:53 AM, pete dawgg wrote:


> I have set workers 8 just recently; but the disk full error had
> definitely been occuring before.

AUFS cache_dirs are not compatible with SMP Squid. Removing workers was
the right thing to do even if that incompatibility was not causing disk
overflows.


> FATAL: Ipc::Mem::Segment::open failed to shm_open(/squid-cf__metadata.shm): (2) No such file or directory

> This error seems to occur when the disk is really full and squid is restarted.

Ah, then it could be a side effect of poor PID management (and
associated shared resource locking) in Squid v3. You can probably ignore
this error until you fix the restarts. FWIW, Squid v4 addressed those
shortcomings.


> When there is no traffic squid seems to cleaning up well enough: over
> night (no traffic) disk usage went down to 30GB (now it's at 50GB
> again)

This may be a sign that your Squid cannot keep up with the load. IIRC,
AUFS uses lazy garbage collection so it is possible for the stream of
new objects to outpace the stream of object deletion events, resulting
in a gradually increasing cache size. Using even more aggressive
cache_swap_high might help, but there is no good configuration solution
to this UFS problem AFAIK.


> There was another error i just fixed:
>> FATAL: Failed to open swap log /mnt/cache/squid/swap.state.new
> Not a permissions or diskspace problem, caused by workers 8.
> I have deactivated workers 8 and this error went away.

Yes, that error is one of the signs that AUFS cache_dirs are not SMP-aware.

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

Amos Jeffries
Administrator
On 13/07/18 04:16, Alex Rousskov wrote:

> On 07/12/2018 05:53 AM, pete dawgg wrote:
>
>
>> When there is no traffic squid seems to cleaning up well enough: over
>> night (no traffic) disk usage went down to 30GB (now it's at 50GB
>> again)
>
> This may be a sign that your Squid cannot keep up with the load. IIRC,
> AUFS uses lazy garbage collection so it is possible for the stream of
> new objects to outpace the stream of object deletion events, resulting
> in a gradually increasing cache size. Using even more aggressive
> cache_swap_high might help, but there is no good configuration solution
> to this UFS problem AFAIK.
>

FYI, to be more aggressive place the two limits closer together.

I made the removal rate grow in steps of the difference between the
marks. A low of 60 and high of 70 means there are 4 steps of 10 between
60% and 100% full cache - so Squid will be removing 4*200 objects/sec
when the cache is 99.999% full. But a low of 90 and high 91 will remove
10*200 objects/sec at the same full point.

Low numbers like 60, 70 etc are only needed now if you have to push the
removal rate past 2K objects/sec - eg low 60 high 61 will be removing
40*200 = 8K objects/sec.


If you know your peak traffic rate in req/sec you should be able to tune
the purge rate to match that peak traffic rate. The speed traffic
reaches that peak should inform what the gap is between the watermarks.

Amos
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

Eliezer Croitoru
Hey Amos,

From the docs:
http://www.squid-cache.org/Versions/v4/cfgman/cache_swap_low.html

I see that this is only for UFS/AUFS/diskd and not rock cache_dir.
What about rock cache_dir?

Eliezer

----
Eliezer Croitoru
Linux System Administrator
Mobile: +972-5-28704261
Email: [hidden email]



-----Original Message-----
From: squid-users [mailto:[hidden email]] On Behalf Of Amos Jeffries
Sent: Friday, July 13, 2018 1:33 AM
To: [hidden email]
Subject: Re: [squid-users] squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

On 13/07/18 04:16, Alex Rousskov wrote:

> On 07/12/2018 05:53 AM, pete dawgg wrote:
>
>
>> When there is no traffic squid seems to cleaning up well enough: over
>> night (no traffic) disk usage went down to 30GB (now it's at 50GB
>> again)
>
> This may be a sign that your Squid cannot keep up with the load. IIRC,
> AUFS uses lazy garbage collection so it is possible for the stream of
> new objects to outpace the stream of object deletion events, resulting
> in a gradually increasing cache size. Using even more aggressive
> cache_swap_high might help, but there is no good configuration solution
> to this UFS problem AFAIK.
>

FYI, to be more aggressive place the two limits closer together.

I made the removal rate grow in steps of the difference between the
marks. A low of 60 and high of 70 means there are 4 steps of 10 between
60% and 100% full cache - so Squid will be removing 4*200 objects/sec
when the cache is 99.999% full. But a low of 90 and high 91 will remove
10*200 objects/sec at the same full point.

Low numbers like 60, 70 etc are only needed now if you have to push the
removal rate past 2K objects/sec - eg low 60 high 61 will be removing
40*200 = 8K objects/sec.


If you know your peak traffic rate in req/sec you should be able to tune
the purge rate to match that peak traffic rate. The speed traffic
reaches that peak should inform what the gap is between the watermarks.

Amos
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

Alex Rousskov
On 07/12/2018 06:20 PM, Eliezer Croitoru wrote:

> From the docs:
> http://www.squid-cache.org/Versions/v4/cfgman/cache_swap_low.html
>
> I see that this is only for UFS/AUFS/diskd and not rock cache_dir.
> What about rock cache_dir?

Rock cache_dirs cannot overflow by design. Rock reserves a configured
amount of disk space and uses nothing but that amount of disk space. Due
to optimistic allocation by file systems, you can still run out of disk
space if something else consumes space on the same partition, but the
rock database itself cannot overflow.

Alex.


> -----Original Message-----
> From: squid-users [mailto:[hidden email]] On Behalf Of Amos Jeffries
> Sent: Friday, July 13, 2018 1:33 AM
> To: [hidden email]
> Subject: Re: [squid-users] squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails
>
> On 13/07/18 04:16, Alex Rousskov wrote:
>> On 07/12/2018 05:53 AM, pete dawgg wrote:
>>
>>
>>> When there is no traffic squid seems to cleaning up well enough: over
>>> night (no traffic) disk usage went down to 30GB (now it's at 50GB
>>> again)
>>
>> This may be a sign that your Squid cannot keep up with the load. IIRC,
>> AUFS uses lazy garbage collection so it is possible for the stream of
>> new objects to outpace the stream of object deletion events, resulting
>> in a gradually increasing cache size. Using even more aggressive
>> cache_swap_high might help, but there is no good configuration solution
>> to this UFS problem AFAIK.
>>
>
> FYI, to be more aggressive place the two limits closer together.
>
> I made the removal rate grow in steps of the difference between the
> marks. A low of 60 and high of 70 means there are 4 steps of 10 between
> 60% and 100% full cache - so Squid will be removing 4*200 objects/sec
> when the cache is 99.999% full. But a low of 90 and high 91 will remove
> 10*200 objects/sec at the same full point.
>
> Low numbers like 60, 70 etc are only needed now if you have to push the
> removal rate past 2K objects/sec - eg low 60 high 61 will be removing
> 40*200 = 8K objects/sec.
>
>
> If you know your peak traffic rate in req/sec you should be able to tune
> the purge rate to match that peak traffic rate. The speed traffic
> reaches that peak should inform what the gap is between the watermarks.
>
> Amos
> _______________________________________________
> squid-users mailing list
> [hidden email]
> http://lists.squid-cache.org/listinfo/squid-users
>
> _______________________________________________
> squid-users mailing list
> [hidden email]
> http://lists.squid-cache.org/listinfo/squid-users
>

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

Eliezer Croitoru
Gear!

I am testing it with 4.1 since UFS and AUFS are great but... doesn't support SMP.

Eliezer

* another thread on the way to the list.

----
Eliezer Croitoru
Linux System Administrator
Mobile: +972-5-28704261
Email: [hidden email]



-----Original Message-----
From: Alex Rousskov [mailto:[hidden email]]
Sent: Friday, July 13, 2018 3:24 AM
To: [hidden email]
Cc: Eliezer Croitoru <[hidden email]>
Subject: Re: [squid-users] squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails

On 07/12/2018 06:20 PM, Eliezer Croitoru wrote:

> From the docs:
> http://www.squid-cache.org/Versions/v4/cfgman/cache_swap_low.html
>
> I see that this is only for UFS/AUFS/diskd and not rock cache_dir.
> What about rock cache_dir?

Rock cache_dirs cannot overflow by design. Rock reserves a configured
amount of disk space and uses nothing but that amount of disk space. Due
to optimistic allocation by file systems, you can still run out of disk
space if something else consumes space on the same partition, but the
rock database itself cannot overflow.

Alex.


> -----Original Message-----
> From: squid-users [mailto:[hidden email]] On Behalf Of Amos Jeffries
> Sent: Friday, July 13, 2018 1:33 AM
> To: [hidden email]
> Subject: Re: [squid-users] squid 3.5.27 does not respect cache_dir-size but uses 100% of partition and fails
>
> On 13/07/18 04:16, Alex Rousskov wrote:
>> On 07/12/2018 05:53 AM, pete dawgg wrote:
>>
>>
>>> When there is no traffic squid seems to cleaning up well enough: over
>>> night (no traffic) disk usage went down to 30GB (now it's at 50GB
>>> again)
>>
>> This may be a sign that your Squid cannot keep up with the load. IIRC,
>> AUFS uses lazy garbage collection so it is possible for the stream of
>> new objects to outpace the stream of object deletion events, resulting
>> in a gradually increasing cache size. Using even more aggressive
>> cache_swap_high might help, but there is no good configuration solution
>> to this UFS problem AFAIK.
>>
>
> FYI, to be more aggressive place the two limits closer together.
>
> I made the removal rate grow in steps of the difference between the
> marks. A low of 60 and high of 70 means there are 4 steps of 10 between
> 60% and 100% full cache - so Squid will be removing 4*200 objects/sec
> when the cache is 99.999% full. But a low of 90 and high 91 will remove
> 10*200 objects/sec at the same full point.
>
> Low numbers like 60, 70 etc are only needed now if you have to push the
> removal rate past 2K objects/sec - eg low 60 high 61 will be removing
> 40*200 = 8K objects/sec.
>
>
> If you know your peak traffic rate in req/sec you should be able to tune
> the purge rate to match that peak traffic rate. The speed traffic
> reaches that peak should inform what the gap is between the watermarks.
>
> Amos
> _______________________________________________
> squid-users mailing list
> [hidden email]
> http://lists.squid-cache.org/listinfo/squid-users
>
> _______________________________________________
> squid-users mailing list
> [hidden email]
> http://lists.squid-cache.org/listinfo/squid-users
>


_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users