squid stores multiple copies of identical ETags

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

squid stores multiple copies of identical ETags

Tabacchiera, Stefano

Hello there,

I have the following issue with squid (squid-3.5.20-12.el7_6.1.x86_64  on RHEL 7.6)

 

A client requests a json with “no-cache” header via proxy.

Squid forwards the request to origin server, which replies with “ETag” header (object is cachable).

Squid stores the object in cache_dir and forwards back to the client.

 

The client is pushing same request at high rate (~10/sec), regardless its cachable status.

Squid keeps forwarding the request and – here is my issue – keeps storing the same identical object on disk.

I have thousands of copies of the same Etag on disk.

 

Is there a way to avoid this? I think Squid should store a single copy per-URL/ETAg of the object.

I’d like to avoid an ad-hoc reload-into-ims refresh-pattern.

 

Here’s an example of headers sequence:

 

 

  1. CLIENT à SQUID

GET http://xxx.xxx.xxx.xxx/blah/FEED.json HTTP/1.1

Accept-Encoding: gzip

User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2

Content-Language: en-US

Cache-Control: no-cache

Pragma: no-cache

Host: xxx.xxx.xxx.xxx

Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2

Proxy-Connection: keep-alive

 

  1. SQUID à ORIGIN SERVER

GET /blah/FEED.json HTTP/1.1

Accept-Encoding: gzip

User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2

Content-Language: en-US

Pragma: no-cache

Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2

Host: xxx.xxx.xxx.xxx

Via: 1.1 RP-PRXSQUID-2 (squid/3.5.20)

X-Forwarded-For: unknown

Cache-Control: no-cache

Connection: keep-alive

 

  1. ORIGIN SERVER à SQUID

HTTP/1.1 200 OK

Server: Apache/2.2.16 (Debian)

Last-Modified: Fri, 26 Jun 2020 11:46:00 GMT

ETag: "62b905-a7cfa-5a8fb40bdee00"

Accept-Ranges: bytes

Content-Length: 687354

Keep-Alive: timeout=4, max=45

Connection: Keep-Alive

Content-Type: text/plain

 

  1. SQUID à CLIENT

HTTP/1.1 200 OK

Date: Fri, 26 Jun 2020 11:52:15 GMT

Server: Apache/2.2.16 (Debian)

Last-Modified: Fri, 26 Jun 2020 11:46:00 GMT

ETag: "62b905-a7cfa-5a8fb40bdee00"

Accept-Ranges: bytes

Content-Length: 687354

Content-Type: text/plain

X-Cache: MISS from RP-PRXSQUID-2

X-Cache-Lookup: HIT from RP-PRXSQUID-2:3128

Via: 1.1 RP-PRXSQUID-2 (squid/3.5.20)

Connection: keep-alive

 

Thanks

ST

____________________________________________________________________________________ La presente comunicazione ed i suoi allegati e' destinata esclusivamente ai destinatari. Qualsiasi suo utilizzo, comunicazione o diffusione non autorizzata e' proibita. Se ha ricevuto questa comunicazione per errore, la preghiamo di darne immediata comunicazione al mittente e di cancellare tutte le informazioni erroneamente acquisite. Grazie This message and its attachments are intended only for use by the addressees. Any use, re-transmission or dissemination not authorized of it is prohibited. If you received this e-mail in error, please inform the sender immediately and delete all the material. Thank you.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid stores multiple copies of identical ETags

Amos Jeffries
Administrator
On 27/06/20 12:19 am, Tabacchiera, Stefano wrote:

> Hello there,
>
> I have the following issue with squid (squid-3.5.20-12.el7_6.1.x86_64 
> on RHEL 7.6)
>
> A client requests a json with “no-cache” header via proxy.
>
> Squid forwards the request to origin server, which replies with “ETag”
> header (object is cachable).
>
> Squid stores the object in cache_dir and forwards back to the client.
>
>
> The client is pushing same request at high rate (~10/sec), regardless
> its cachable status.
>
> Squid keeps forwarding the request and – here is my issue – keeps
> storing the same identical object on disk.
>
> I have thousands of copies of the same Etag on disk.
>

Objects are *not* stored by ETag. They are stored by *URL* (or URL+Vary
header).

Also, an object existing on disk does not mean it is considered to be
"latest" version of an object. It only means that no other object has
needed to use that same cache slot/file since the existing object was
stored there.
 Clearing cache slots/files the instant its content become obsolete
would cause up to _double_ the amount of disk writing to happen. Squid
already does a huge amount of writes.

In general, if the same object occuring N times on disk is a problem,
you have issues with misconfigured cache_dir parameters. eg the
cache_dir size is too big for the physical disk it is stored on. Each
type of cache_dir is optimized for different object types on different
OS - selecting which to use can be important for high performance
installations.


>
> Is there a way to avoid this? I think Squid should store a single copy
> per-URL/ETAg of the object.
>
> I’d like to avoid an ad-hoc reload-into-ims refresh-pattern.
>

Squid is doing exactly what the client is demanding with its use of
"no-cache".


There is nothing wrong with that configuration. It is the best way to
make Squid cope with such a nasty client. The alternative is to ignore
*all* Cache-Control headers from all clients on all traffic - which is
much overkill.


Amos
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid stores multiple copies of identical ETags

Tabacchiera, Stefano
In reply to this post by Tabacchiera, Stefano

 

>In general, if the same object occuring N times on disk is a problem,

That's the point. There's  a LOT of identical objects on disk.

>you have issues with misconfigured cache_dir parameters. eg the
cache_dir size is too big for the physical disk it is stored on.

I have 2x100gb dedicated disks, ext4 noatime.
Each cache_dir is aufs 80000 16 256.

Where's the issue? I did't even imagine this would lead a multiple copies stores of the same object.

Can you please advise on this?
Thx!

ST

____________________________________________________________________________________ La presente comunicazione ed i suoi allegati e' destinata esclusivamente ai destinatari. Qualsiasi suo utilizzo, comunicazione o diffusione non autorizzata e' proibita. Se ha ricevuto questa comunicazione per errore, la preghiamo di darne immediata comunicazione al mittente e di cancellare tutte le informazioni erroneamente acquisite. Grazie This message and its attachments are intended only for use by the addressees. Any use, re-transmission or dissemination not authorized of it is prohibited. If you received this e-mail in error, please inform the sender immediately and delete all the material. Thank you.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid stores multiple copies of identical ETags

Amos Jeffries
Administrator
On 27/06/20 5:39 am, Tabacchiera, Stefano wrote:

>
>>In general, if the same object occuring N times on disk is a problem,
>
> That's the point. There's  a LOT of identical objects on disk.
>
>>you have issues with misconfigured cache_dir parameters. eg the
> cache_dir size is too big for the physical disk it is stored on.
>
> I have 2x100gb dedicated disks, ext4 noatime.
> Each cache_dir is aufs 80000 16 256.
>
> Where's the issue? I did't even imagine this would lead a multiple
> copies stores of the same object.

So far the problem appears to be you not understanding how caching
works. My previous response contains the explanation that should have
resolved that.


>
> Can you please advise on this?

Only with what I stated already in my previous response.


Amos
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid stores multiple copies of identical ETags

Tabacchiera, Stefano
>>>In general, if the same object occuring N times on disk is a problem,
>>>you have issues with misconfigured cache_dir parameters. eg the
>>> cache_dir size is too big for the physical disk it is stored on.    >

 >> That's the point. There's  a LOT of identical objects on disk.
 >> I have 2x100gb dedicated disks, ext4 noatime.
 >> Each cache_dir is aufs 80000 16 256.
 >> Where's the issue? I did't even imagine this would lead a multiple
 >> copies stores of the same object.

   >  So far the problem appears to be you not understanding how caching
   >  works. My previous response contains the explanation that should have
   >  resolved that.

Amos, I'm sorry, but I'm still confused.

Please follow me on this:
Consider a cachable object, e.g. a static image, with all its response headers set (content-length/last-modified/etag/expiration/etc).
When the client requests it with "no-cache", it prevents squid from providing the cached on-disk object, and it forces the retrieve of a new copy from the origin server.
So far, so good.
But THIS new copy is the same identical object which is already on disk (same url/size/etc.), 'cause the client is requesting the same object many times per second, all day.

I understand that squid must serve the new copy to the client (no-cache), but what I don't get is why squid is storing every time a new copy of THIS object on disk.
In my (maybe faulty) understandings this could be avoided, by simply look up in the store log and find that this particular object already exists on disk.

Since this doesn't seem to be happening, chances are: squid doesn't care about storing multiple copies on disk OR (more probably) I'm still missing something vital.

In the real case, the object is a JSON which is modified every 5 minutes. Every times it changes, obviously it has a new Etag, new Last-modified, a proper content-length, etc.
Client requests it like 10 times per sec: 10*300 ~ 3000 copies on disk. Consider a mean object size of 500KB: 3000*500KB = 1.4GB.
A single object is wasting 1GB of disk space every 5'. Indeed, during a restart, squid does a lot of purging of duplicate objects.
Is this really necessary? I don't see the point.

You mentioned the cache_dir parameters, like the cache size compared to disk size, or L1/L2 ratio.
Can you please be more specific or point me at the right documentation?
I'd appreciate a lot your help.

Thanks
ST

____________________________________________________________________________________ La presente comunicazione ed i suoi allegati e' destinata esclusivamente ai destinatari. Qualsiasi suo utilizzo, comunicazione o diffusione non autorizzata e' proibita. Se ha ricevuto questa comunicazione per errore, la preghiamo di darne immediata comunicazione al mittente e di cancellare tutte le informazioni erroneamente acquisite. Grazie This message and its attachments are intended only for use by the addressees. Any use, re-transmission or dissemination not authorized of it is prohibited. If you received this e-mail in error, please inform the sender immediately and delete all the material. Thank you.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid stores multiple copies of identical ETags

Alex Rousskov
On 6/27/20 2:31 PM, Tabacchiera, Stefano wrote:

> Consider a cachable object ... with all its response headers set
> (content-length/last-modified/etag/expiration/etc). When the client
> requests it with "no-cache", it prevents squid from providing the
> cached on-disk object, and it forces the retrieve of a new copy from
> the origin server.

> But THIS new copy is the same identical object which is already on
> disk (same url/size/etc.)

Squid does not know that the response headers and body have not changed.
Squid could, in theory, trust the URL+Vary+ETag+etc. combination as a
precise response identity, but it is a bit risky to do that by default
because ETags/etc. might lie. There is currently no code implementing
that optimization either.


> In my (maybe faulty) understandings this could be avoided, by simply
> look up in the store log and find that this particular object already
> exists on disk.

Squid could do that if it trusts ETag/etc and updates stored headers.
Squid does not do that (yet?). Even the header update part is not fully
supported yet!


> Since this doesn't seem to be happening, chances are: squid doesn't
> care about storing multiple copies on disk

To be more accurate, Squid does not store multiple copies of (what Squid
considers to be) the same response -- only one object can be indexed per
URL/Vary. Bugs notwithstanding, Squid will overwrite the old response
(for some definition of "overwrite") with the new one.

I do not know much about aufs -- that code has been neglected for a
while -- but perhaps aufs simply does not have enough time to delete its
old/unused files? Try setting cache_swap_low and cache_swap_high to the
same very low value, perhaps even zero (to avoid backgrounding the
cleanup task).



HTH,

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid stores multiple copies of identical ETags

Tabacchiera, Stefano
Alex,
first of all, thank you for your clarification.

>    Squid does not know that the response headers and body have not changed.
>    Squid could, in theory, trust the URL+Vary+ETag+etc. combination as a
>    precise response identity, but it is a bit risky to do that by default
>    because ETags/etc. might lie. There is currently no code implementing
>    that optimization either.

Ok, now I know that only URL/Vary are taken into account when storing an object.

>    To be more accurate, Squid does not store multiple copies of (what Squid
>    considers to be) the same response -- only one object can be indexed per
>    URL/Vary. Bugs notwithstanding, Squid will overwrite the old response
>    (for some definition of "overwrite") with the new one.

Since in my case there's no Vary header and the object Full URL never changes,
I',m starting to think about a bug (?!).

>    I do not know much about aufs -- that code has been neglected for a
>    while -- but perhaps aufs simply does not have enough time to delete its
>    old/unused files? Try setting cache_swap_low and cache_swap_high to the
>    same very low value, perhaps even zero (to avoid backgrounding the
>  cleanup task).

Uhm, are you saying that the the process of an object replacement on disk is not atomic?
I mean: squid would store a new copy of the object while leaving the old copy deletion to cleanup task?
If it's not, I still suspect a bug.
I'm hesitant to to turn the store_log on, 'cause the performance impact.
Btw, is there a specific debug_option?

Thanks a lot.
ST


____________________________________________________________________________________ La presente comunicazione ed i suoi allegati e' destinata esclusivamente ai destinatari. Qualsiasi suo utilizzo, comunicazione o diffusione non autorizzata e' proibita. Se ha ricevuto questa comunicazione per errore, la preghiamo di darne immediata comunicazione al mittente e di cancellare tutte le informazioni erroneamente acquisite. Grazie This message and its attachments are intended only for use by the addressees. Any use, re-transmission or dissemination not authorized of it is prohibited. If you received this e-mail in error, please inform the sender immediately and delete all the material. Thank you.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid stores multiple copies of identical ETags

Alex Rousskov
On 6/28/20 2:41 PM, Tabacchiera, Stefano wrote:

>>    Squid does not know that the response headers and body have not changed.
>>    Squid could, in theory, trust the URL+Vary+ETag+etc. combination as a
>>    precise response identity, but it is a bit risky to do that by default
>>    because ETags/etc. might lie. There is currently no code implementing
>>    that optimization either.

> now I know that only URL/Vary are taken into account when storing an object.

For the record: I did not say or imply that only URL/Vary are taken into
account. Other request aspects such as request method also influence the
cache key.


> I mean: squid would store a new copy of the object while leaving the
> old copy deletion to cleanup task?

Some parts of the cleanup process may be delegated. The details depend
on the cache_dir type. I do not know or remember aufs specifics, but I
suspect that all ufs-based cache_dirs, including aufs, use lazy garbage
collection (under normal circumstances). The cache_swap_low and
cache_swap_high directives should determine what is "normal".


> I'm hesitant to to turn the store_log on, 'cause the performance impact.
> Btw, is there a specific debug_option?

FWIW, I would not try to debug this on a live/production cache,
especially if you are not used to navigating debugging cache.logs from
busy proxies. The same bug (if any) should be reproducible in lab settings.

These debugging sections may be relevant (but this is not meant as a
comprehensive list, and I do not know what exactly you need to look at):

> doc/debug-sections.txt:section 20    Memory Cache
> doc/debug-sections.txt:section 20    Storage Manager MD5 Cache Keys
> doc/debug-sections.txt:section 20    Store Controller
> doc/debug-sections.txt:section 32    Asynchronous Disk I/O
> doc/debug-sections.txt:section 43    AIOPS
> doc/debug-sections.txt:section 47    Store Directory Routines
> doc/debug-sections.txt:section 81    Store HEAP Removal Policies

Most of the AUFS/AIO code is in src/DiskIO/DiskThreads


HTH,

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

R: squid stores multiple copies of identical ETags

Tabacchiera, Stefano

>> I mean: squid would store a new copy of the object while leaving the

>> old copy deletion to cleanup task?

 

>Some parts of the cleanup process may be delegated. The details depend on the cache_dir type. I do not know or remember aufs specifics, but I suspect that all ufs-based cache_dirs, including aufs, use lazy garbage collection >(under normal circumstances). The cache_swap_low and cache_swap_high directives should determine what is "normal".

 

Alex, you were absolutely right.

I managed to reproduce the case.

On test environment I set up "cache_swap_low 1" and "cache_swap_low 2" and enabled the store_log.

Then I tailed the store_log and watched the evolution of cache_dir, while running squidclient toward origin server every 100ms.

 

Store.log:

1593504359.704 SWAPOUT 00 000098C5 1865A3A26D411E7C0D8D87770720E405  200 1593504359 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504359.858 SWAPOUT 00 000098C6 1865A3A26D411E7C0D8D87770720E405  200 1593504359 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504360.015 SWAPOUT 00 000098C7 1865A3A26D411E7C0D8D87770720E405  200 1593504359 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504360.170 SWAPOUT 00 000098C8 1865A3A26D411E7C0D8D87770720E405  200 1593504359 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504360.324 SWAPOUT 00 000098C9 1865A3A26D411E7C0D8D87770720E405  200 1593504359 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504360.476 SWAPOUT 00 000098CA 1865A3A26D411E7C0D8D87770720E405  200 1593504359 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504360.634 SWAPOUT 00 000098CB 1865A3A26D411E7C0D8D87770720E405  200 1593504360 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504360.788 SWAPOUT 00 000098CC 1865A3A26D411E7C0D8D87770720E405  200 1593504360 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504360.941 SWAPOUT 00 000098CD 1865A3A26D411E7C0D8D87770720E405  200 1593504360 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504361.096 SWAPOUT 00 000098CE 1865A3A26D411E7C0D8D87770720E405  200 1593504360 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504361.249 SWAPOUT 00 000098CF 1865A3A26D411E7C0D8D87770720E405  200 1593504360 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504361.403 SWAPOUT 00 000098D0 1865A3A26D411E7C0D8D87770720E405  200 1593504360 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504361.556 SWAPOUT 00 000098D1 1865A3A26D411E7C0D8D87770720E405  200 1593504360 1593504061        -1 text/plain 544275/544275 GET http://xxx.xxx.xxx.xxx/blah/FEED.json

1593504361.607 RELEASE 00 000098D1 1865A3A26D411E7C0D8D87770720E405   ?         ?         ?         ? ?/? ?/? ? ?

 

The cached object was gradually appearing in cache_dir, until the “RELEASE” line showed up in the store.log.

At this right moment, all copies of the object stored on disk were deleted.

 

So I’m assuming that only one object on disk (the last one retrieved) is the object referenced as “active” by squid, all the rest being trashable.

Since the client is forcing a “no-cache” header, squid does what the client is asking for, and every time it stores the object on disk.

I’m also assuming that IF another client asked the same object without the “no-cache” header, squid would serve the latest cached object on disk.

If I’m right so far, squid never “overwrites” the old copy of an object on disk. Instead, it stores a new one, marking it as “active”, and let the deletion task to (a)ufs threads.

 

Could this this way?

 

Thanks!

ST

 

____________________________________________________________________________________ La presente comunicazione ed i suoi allegati e' destinata esclusivamente ai destinatari. Qualsiasi suo utilizzo, comunicazione o diffusione non autorizzata e' proibita. Se ha ricevuto questa comunicazione per errore, la preghiamo di darne immediata comunicazione al mittente e di cancellare tutte le informazioni erroneamente acquisite. Grazie This message and its attachments are intended only for use by the addressees. Any use, re-transmission or dissemination not authorized of it is prohibited. If you received this e-mail in error, please inform the sender immediately and delete all the material. Thank you.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: R: squid stores multiple copies of identical ETags

Alex Rousskov
On 6/30/20 5:10 AM, Tabacchiera, Stefano wrote:

> So I’m assuming that only one object on disk (the last one retrieved) is
> the object referenced as “active” by squid, all the rest being trashable.
>
> Since the client is forcing a “no-cache” header, squid does what the
> client is asking for, and every time it stores the object on disk.
>
> I’m also assuming that IF another client asked the same object without
> the “no-cache” header, squid would serve the latest cached object on disk.
>
> If I’m right so far, squid never “overwrites” the old copy of an object
> on disk. Instead, it stores a new one, marking it as “active”,

Yes, the above matches my understanding (for some definition of "last
one", "overwrites", and "active"). The actual situation is a bit more
nuanced (e.g., Squid could be storing and using multiple copies of the
same resource concurrently, even though any new request will never see
more than one copy), but those low-level details may not matter to your
investigation.


> and let the deletion task to (a)ufs threads.

I cannot confirm or deny this part -- I do not know whether garbage
collection is delegated to aufs thread(s) -- but it sounds plausible.


HTH,

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users