What would be the maximum ufs\aufs cache_dir objects?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

What would be the maximum ufs\aufs cache_dir objects?

Eliezer Croitoru
What would be the maximum ufs\aufs cache_dir objects?
Let say I have unlimited disk space and inodes and RAM, what would be the
maximum objects I can store on a single ufs\aufs cache_dir?
It's very easy to test but first I want to understand what might be the
limit?
I am asking since the structure is top level dirs and sub level dirs, so if
I want to get the maximum object capacity (assuming each one would use
0.5kb)?
If there is a known number I want to know it.

Thanks in advance,
Eliezer

----
Eliezer Croitoru
Linux System Administrator
Mobile: +972-5-28704261
Email: [hidden email]




_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: What would be the maximum ufs\aufs cache_dir objects?

Alex Rousskov
On 07/14/2017 06:37 AM, Eliezer Croitoru wrote:
> What would be the maximum ufs\aufs cache_dir objects?

The maximum number of objects currently supported by any single
cache_dir (rock or ufs-based) is approximately 16777215.

> src/store/forward.h:enum { SwapFilenMax = 0xFFFFFF }; // keep in sync with StoreEntry::swap_filen


There is no practical limit for the number of cache_dirs although Squid
does use linear search through cache_dirs in some cases/configurations.

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: What would be the maximum ufs\aufs cache_dir objects?

Amos Jeffries
Administrator
In reply to this post by Eliezer Croitoru
On 15/07/17 00:37, Eliezer Croitoru wrote:
> What would be the maximum ufs\aufs cache_dir objects? > Let say I have unlimited disk space and inodes and RAM, what would be the
> maximum objects I can store on a single ufs\aufs cache_dir?


One UFS cache_dir can hold a maximum of (2^27)-1 safely.

Technically it does not need the -1, but the old C code uses a mess of
signed and unsigned types to store the has ID value. Some (not all)
people hit assertions when the cache reaches that boundary.


> It's very easy to test but first I want to understand what might be the
> limit?

The index hash entries are stored as a 32-bit bitmask (sfileno) - with 5
bits for cache_dir ID and 27 bits for hash of the file details.


> I am asking since the structure is top level dirs and sub level dirs, so if
> I want to get the maximum object capacity (assuming each one would use
> 0.5kb)?

The L1/L2 separation is to cope with old filesystems that had limited
number of files in a directory.

Apparently, that limitation is no longer relevant with the current
generation of filesystems.


Amos
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: What would be the maximum ufs\aufs cache_dir objects?

Alex Rousskov
On 07/14/2017 10:47 AM, Amos Jeffries wrote:

> One UFS cache_dir can hold a maximum of (2^27)-1 safely.

You probably meant to say (2^25)-1 but the actual number is (2^24)-1
because the sfileno is signed. This is why you get 16'777'215 (a.k.a.
0xFFFFFF) as the actual limit.


> The index hash entries are stored as a 32-bit bitmask (sfileno) - with 5
> bits for cache_dir ID and 27 bits for hash of the file details.

The cache index entries are hashed on their keys, not file numbers (of
any kind). The index entry is using 25 bits for the file number, but
IIRC, those 25 bits are never merged/combined with the 7 bits of the
cache_dir ID in any meaningful way.


Alex.

> typedef signed_int32_t sfileno;>     sfileno swap_filen:25; // keep in sync with SwapFilenMax
>     sdirno swap_dirn:7;
> enum { SwapFilenMax = 0xFFFFFF }; // keep in sync with StoreEntry::swap_filen

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: What would be the maximum ufs\aufs cache_dir objects?

Eliezer Croitoru
So basically from I understand the limit of the AUFS\UFS cache_dir is at:
16,777,215 Objects.
So for a very loaded system it might be pretty "small".

I have asked since:
I have seen the mongodb ecap adapter that stores chunks and I didn't liked it.
In the other way I wrote a cache_dir in GoLang which I am using for the windows updates caching proxy and for now it's surpassing the AUFS\UFS limits.

Based on the success of the Windows Updates Cache proxy which strives to cache only public objects, I was thinking about writing something similar for a more global usage.
The basic constrain on what would be cached is only If the object has Cache-Control "public".
The first step would be an ICAP service (respmod) which will log requests and response and will decide what GET results are worthy of later fetch.
Squid currently does things on-the-fly while the client transaction is fetched by the client.
For an effective cache I believe we can compromise on another approach which relays or statistics.
The first rule is: Not everything worth caching!!!
Then after understanding and configuring this we can move on to fetch *Public* only objects when they get a high repeated downloads.
This is actually how google cache and other similar cache systems work.
They first let traffic reach the "DB" or "DATASTORE" if it's the first time seen.
Then after more the a specific threshold they object is being fetched by the cache system without any connection to the transaction which the clients consume.
It might not be the most effective caching "method" for specific very loaded systems or specific big files and *very* high cost up-stream connections but for many it will be fine.
And the actual logic and implementation can be each of couple algorithms like LRU as the default and couple others as an option.

I believe that this logic will be good for specific systems and will remove all sort of weird store\cache_dir limitations.
I already have a ready to use system which I named "YouTube-Store" that allows the admin to download and serve specific YouTube videos to a local web-service.
It can be utilized together with an external_acl helper that will redirect clients to a special page that hosts cached\stored video with an option to bypass the cached version.

I hope to publish this system soon under BSD license.

Thanks,
Eliezer

----
Eliezer Croitoru
Linux System Administrator
Mobile: +972-5-28704261
Email: [hidden email]



-----Original Message-----
From: squid-users [mailto:[hidden email]] On Behalf Of Alex Rousskov
Sent: Friday, July 14, 2017 20:49
To: Amos Jeffries <[hidden email]>; [hidden email]
Subject: Re: [squid-users] What would be the maximum ufs\aufs cache_dir objects?

On 07/14/2017 10:47 AM, Amos Jeffries wrote:

> One UFS cache_dir can hold a maximum of (2^27)-1 safely.

You probably meant to say (2^25)-1 but the actual number is (2^24)-1
because the sfileno is signed. This is why you get 16'777'215 (a.k.a.
0xFFFFFF) as the actual limit.


> The index hash entries are stored as a 32-bit bitmask (sfileno) - with 5
> bits for cache_dir ID and 27 bits for hash of the file details.

The cache index entries are hashed on their keys, not file numbers (of
any kind). The index entry is using 25 bits for the file number, but
IIRC, those 25 bits are never merged/combined with the 7 bits of the
cache_dir ID in any meaningful way.


Alex.

> typedef signed_int32_t sfileno;>     sfileno swap_filen:25; // keep in sync with SwapFilenMax
>     sdirno swap_dirn:7;
> enum { SwapFilenMax = 0xFFFFFF }; // keep in sync with StoreEntry::swap_filen

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: What would be the maximum ufs\aufs cache_dir objects?

Amos Jeffries
Administrator
On 18/07/17 05:34, Eliezer Croitoru wrote:

> So basically from I understand the limit of the AUFS\UFS cache_dir is at:
> 16,777,215 Objects.
> So for a very loaded system it might be pretty "small".
>
> I have asked since:
> I have seen the mongodb ecap adapter that stores chunks and I didn't liked it.
> In the other way I wrote a cache_dir in GoLang which I am using for the windows updates caching proxy and for now it's surpassing the AUFS\UFS limits.
>
> Based on the success of the Windows Updates Cache proxy which strives to cache only public objects, I was thinking about writing something similar for a more global usage.
> The basic constrain on what would be cached is only If the object has Cache-Control "public".

You would end up with only a small sub-set of HTTP every being cached.

CC:public's main reason for existence is to re-enable cacheability of
responses that contain security credentials - which is prevented by
default as a security fail-safe.

I know a fair number of servers still send it when they should not. But
that is declining as content gets absorbed by CDN who take more care
with their bandwidth expenditure.



> The first step would be an ICAP service (respmod) which will log requests and response and will decide what GET results are worthy of later fetch.
> Squid currently does things on-the-fly while the client transaction is fetched by the client.

What things are you speaking about here?

How do you define "later"? is that 1 nanosecond or 64 years?
  and what makes 1 nanosecond difference in request timing for a 6GB
object any less costly than 1 second?

Most of what Squid does and the timing of it have good reasons behind
them. Not saying change is bad, but to make real improvements instead of
re-inventing some long lost wheel design one has to know those reasons
to avoid them becoming problems.
  eg. the often laughed at square wheel is a real and useful design for
some circumstances. And their lesser bretheren cogwheels and the like
are an age proven design in rail history for places where roundness
actively inhibits movement.


> For an effective cache I believe we can compromise on another approach which relays or statistics.
> The first rule is: Not everything worth caching!!!
> Then after understanding and configuring this we can move on to fetch *Public* only objects when they get a high repeated downloads.
> This is actually how google cache and other similar cache systems work.
> They first let traffic reach the "DB" or "DATASTORE" if it's the first time seen.

FYI: that is the model Squid is trying to move away from - because it
slows down traffic processing. As far as I'm aware G has a farm of
servers to throw at any task - unlike most sysadmin trying to stand up a
cache.


> Then after more the a specific threshold they object is being fetched by the cache system without any connection to the transaction which the clients consume.

Introducing the slow-loris attack.

It has several variants:
1) client sends a request, very , very, ... very slowly. many thousands
of bots all do this at once, or building up over time.
   -> an unwary server gets crushed under the weight of open TCP
sockets, and its normal clients get pushed out into DoS.

2) client sends a request. then ACK's delivery, very, very, ... very slowly.
   -> an unwary server gets crushed under the weight of open TCP
sockets, and its normal clients get pushed out into DoS. AND suffers for
each byte of bandwidth it spent fetching content for that client.

3) both of the above.

The slower a server is at detecting this attack the more damage can be
done. This is magnified by whatever amount of resource expenditure the
server goes to before detection can kick in - RAM, disk I/O, CPU time,
TCP sockets, and of most relevant here: upstream bandwidth.

Also, Loris and clients on old tech like 6K modems or worse are
indistinguishable.

To help resolve this problem Squid does the _opposite_ to what you
propose above. It makes the client delivery and the server fetch align
to avoid mistakes detecting these attacks and disconnecting legitimate
clients.
  The read_ahead_gap directive configures the threshold amount of server
fetch which can be done at full server-connection speed before slowing
down to client speed. The various I/O timeouts can be tuned to what a
sysadmin knows about their clients expected I/O capabilities.


> It might not be the most effective caching "method" for specific very loaded systems or specific big files and *very* high cost up-stream connections but for many it will be fine.
> And the actual logic and implementation can be each of couple algorithms like LRU as the default and couple others as an option.
>
> I believe that this logic will be good for specific systems and will remove all sort of weird store\cache_dir limitations.

Which weird limitations are you referring to?

The limits you started this thread about are caused directly by the size
of a specific integer representation and the mathematical properties
inherent in a hashing algorithm.

Those types of limit can be eliminated or changed in the relevant code
without redesigning how HTTP protocol caching behaves.


Amos
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: What would be the maximum ufs\aufs cache_dir objects?

Eliezer Croitoru
OK so time for a response!

I want to first describe the different "cache" models which are similar to real world scenarios.
The first and basic example would be the local small "Store" which is supplying food or basic things you need like work tools such as a screw driver and many other small things for basic house maintenance.
In these type of stores you have "on-demand"  order which is separated into the "fast" or the "slow" supply time.
And compared to these there is the "Warehouse" or some big place which in some cases just supply what they "have" or "sell" like ikea or similar brands.
And of-course in the world of workshops it's everything "on-demand" and almost nothing is on the shelf while these supply their services for almost anyone they also store basic standard materials which can be used for each order.

In the world of proxies we have a "storage" system but it's not 100% similar to any of the real world scenarios of "storage" in stores.
For this reason it's hard to just pick a specific model like "LRU" which local food stores uses most of the time but with a "pre-fetch" flavour.

For a cache admin there are couple things to think about before implementing any cache:
- purpose
- available resources(bandwidth, storage space, etc...)
- funding for the system\project

So for example some admins just try blindly to force the usage of caching on their clients while they harm themselves and their clients(what squid 3.3+ fixed).
They just don't get it that the only time when you want to cache is when you need it and not when you want it.
If you have limited amount of bandwidth and your clients knows it but blindly "steal" the whole line from others forcibly the real measures to enforce the bandwidth policy is not using a cache but by a QOS system.
There are solutions which can help admins to give their clients the best Internet experience.
I know that on some ships for example that have expensive Satellite Internet links you pay per MB and Windows 10 updates download are out of the question and Microsoft sites and updates should be blocked by default and should only be allowed for specific major bug fixes cases.
For places which have lots of users but limited bandwidth, cache might not be the right solution for every scenario and you(admin) need a bandwidth policy rather than a cache.
A cache is something I can call "luxury" and it only an enhancement of a network.
In today Internet there is so much content out there that we actually need to kind "limit" our usage and consumption to something reasonable compared to our environment.
With all my desire to watch some YouTube video in 720p HD or 1080p HD it's not the right choice if someone else in my network needs to use the Internet link for medical related things.

With all the above in mind I believe that the Squid way of doing things is good and fit for most of the harshest environments and 3.5 does good job restricting the admin from caching what is dangerous to be cached.
This is one of the things I consider in squid as Stable!

And now related to the statement "cache only public" is divided into two types:
- Everything that worth caching and is permitted by law and will not do harm
- Caching what is required to be cached

For example: Why should I cache a 13KB object if I have a 10Gbit line to the WAN?

From my experience with caching there is no "general" solution and the job of a cache admin should be a task that can take time to tune the system for the right purpose.
For example if squid will cache every new object there is an option that the clients way of using the internet will fill up the disk and then will start a cycle of cache cleanup and population that will never end and there will be a circular situation of an "egg and a chicken" and the cache will never even serve one hit because of the admin tries to "catch all".

So now cache public objects might take another thing than "CC =~ public" and the successor of the tiny technical term will become a "smarter" one that defines "public" to what is really should be cached and that have a chance to being re-fetched and re-downloaded more than once and will not cause the cache to cycle around and just write+cleanup over and over again.

Squid currently does "things"(analyses requests and responses and populates the cache) on-the-fly and gives the users a very "Active"' part in the population of the cache..
So the admin have the means to control the cache server and how much the users have influence but most of these I know tend to just spin a cache instance and maybe google for couple useless "refresh_pattern" and then use them, causing this endless loop of store..cleanup...

Squid is a great  product but the combination of:
- Greed
- Lust
- Pressure
- Budget
- Lack of experience
- Laziness

Leads cache systems around the world to less effective than they would have been with a bit more effort to understand the subject practically.

You have asked about "later"  and the definition is admin dependent.
Of-course that for a costly links such as Satellite later might not be cheaper.. but it can be effective.
Depends on the scenario later would be the right way of doing things while in many other cases it might not and the cache admin needs to do some digging understand what he does.
A cache is actually  a shelf product and if you need something that will work "in-line" ie both transparent and "on-the-fly" and "user-driven" it might be a good idea to pay on it to somebody that can give results.
As was mentioned here(list) in the past, when you calculate the hours of a sysadmin compared to a ready to use product there are scenarios which a working product is the better choice from all the aspects that was mentioned up here in this email.

I am pretty sure I understand why squid timing for the download is working as it is...( I am working in a big ISP after all, one of the top 10 in the whole area).
But I want to clear out that I don't want to invent a wheel but to give response to some specific cases which I already do.
For example the MS updates caching proxy will not work for other domains that these of ms updates.
Also I just want to mention that MS updates has a very remarkable way of making sure that the client will receive the file and insure it's integrity, MS deserve a big respect for their way of implementing the CIA!!!
(Despite to the fact that many describe why and how much they dislike MS)

Indeed G has more than one server farms that helps with "harvesting, rendering, analyzing, categorizing etc" which many doesn't have and I claim that for specific targeted things I can offer a free solutions that might help many ISP's that are already using squid.
Also I believe that offering video tutorials from a Squid Developer might help cache admins to understand how to tune their systems without creating this "cycle" I mentioned earlier.
(I do intent to release couple tutorials and I would like to get recommendations for some key points that needs to be said)

About the mentioned cons\attacks that the server would be vulnerable to...
The service I am offering would work with squid as an ICAP service and it will not download just any request over and over again.
Also it's good you mentioned these specific attacks pattern because the solution should eventually be integrated with squid logs analysis to find out how many unique requests have been made for a specific url\object and it will help to mitigate some of these attacks.

I do like the read ahead gap and I liked the concept but we are talking about couple things eventually:
- Understand the scenario and the requirements from the cache
- Limit the cache "Scope"
- Allow an object to be fetched only once and based on statistics.
- Allow squid to cache what it can and the ICAP service will act as an "Addon" for squid helping the admin with specific scenarios like MS updates, YouTube, Vimeo and couple other sites of interest.

Currently squid cannot use AUFS\UFS cache_dir for SMP and the cache store system I wrote utilizes the FS and has the option to choose between hashing algorithm like instead of MD5 such as SHA256\512..
I believe that it's a time that we start to think about more then MD5 ie SHA256 and maybe make it configurable as we talked a year or more ago.(I cannot do this and I do not have a funder for this..)

And just to grasp the differences, the caching service I am running for MS updates utilizes less CPU, balance the CPU and get very high number of cache HIT's and throughput.

@bold@ My solutions are act as addons to Squid and not replacing it..@bold@
So if squid is vulnerable to something it will be hit before my service.

Currently I am just finishing a solution for YouTube local Store for public videos only.
It consists of couple modules:
- Queue and fetch system
- Storage system (NFS)
- Web Server(Apache, PHP)
- Object storage server
- Squid traffic analysis utilities
- External acl helper that will help to redirect traffic from the YouTube Page into the locally cached version( will have an option to bypass the cached version)

I indeed wrote some things from scratch but the concept was there for a very long time and was built over time.
From my testing MS updates are a pain in the neck in the last few years but I have seen improvement with them.
I noticed that Akamai services are sometimes broken and MS systems tends to start fetching the object form Akamai and then starts to stream it directly from a MS farm directly so....
CDN are nice but if you implement then in the wrong way the can "block" the traffic.
This specific issue I have seen with MS updates spanned over couple countries and I didn't managed to contact any of Akamai personal sing the public emails contacts for a while..
So they just don't get paid from MS due to their lack of effort to make their service one level up.

I hope that couple things were cleared out.
If you have any more comments I'm here for them.

Eliezer

----
Eliezer Croitoru
Linux System Administrator
Mobile: +972-5-28704261
Email: [hidden email]



-----Original Message-----
From: squid-users [mailto:[hidden email]] On Behalf Of Amos Jeffries
Sent: Wednesday, July 19, 2017 16:38
To: [hidden email]
Subject: Re: [squid-users] What would be the maximum ufs\aufs cache_dir objects?

On 18/07/17 05:34, Eliezer Croitoru wrote:

> So basically from I understand the limit of the AUFS\UFS cache_dir is at:
> 16,777,215 Objects.
> So for a very loaded system it might be pretty "small".
>
> I have asked since:
> I have seen the mongodb ecap adapter that stores chunks and I didn't liked it.
> In the other way I wrote a cache_dir in GoLang which I am using for the windows updates caching proxy and for now it's surpassing the AUFS\UFS limits.
>
> Based on the success of the Windows Updates Cache proxy which strives to cache only public objects, I was thinking about writing something similar for a more global usage.
> The basic constrain on what would be cached is only If the object has Cache-Control "public".

You would end up with only a small sub-set of HTTP every being cached.

CC:public's main reason for existence is to re-enable cacheability of
responses that contain security credentials - which is prevented by
default as a security fail-safe.

I know a fair number of servers still send it when they should not. But
that is declining as content gets absorbed by CDN who take more care
with their bandwidth expenditure.



> The first step would be an ICAP service (respmod) which will log requests and response and will decide what GET results are worthy of later fetch.
> Squid currently does things on-the-fly while the client transaction is fetched by the client.

What things are you speaking about here?

How do you define "later"? is that 1 nanosecond or 64 years?
  and what makes 1 nanosecond difference in request timing for a 6GB
object any less costly than 1 second?

Most of what Squid does and the timing of it have good reasons behind
them. Not saying change is bad, but to make real improvements instead of
re-inventing some long lost wheel design one has to know those reasons
to avoid them becoming problems.
  eg. the often laughed at square wheel is a real and useful design for
some circumstances. And their lesser bretheren cogwheels and the like
are an age proven design in rail history for places where roundness
actively inhibits movement.


> For an effective cache I believe we can compromise on another approach which relays or statistics.
> The first rule is: Not everything worth caching!!!
> Then after understanding and configuring this we can move on to fetch *Public* only objects when they get a high repeated downloads.
> This is actually how google cache and other similar cache systems work.
> They first let traffic reach the "DB" or "DATASTORE" if it's the first time seen.

FYI: that is the model Squid is trying to move away from - because it
slows down traffic processing. As far as I'm aware G has a farm of
servers to throw at any task - unlike most sysadmin trying to stand up a
cache.


> Then after more the a specific threshold they object is being fetched by the cache system without any connection to the transaction which the clients consume.

Introducing the slow-loris attack.

It has several variants:
1) client sends a request, very , very, ... very slowly. many thousands
of bots all do this at once, or building up over time.
   -> an unwary server gets crushed under the weight of open TCP
sockets, and its normal clients get pushed out into DoS.

2) client sends a request. then ACK's delivery, very, very, ... very slowly.
   -> an unwary server gets crushed under the weight of open TCP
sockets, and its normal clients get pushed out into DoS. AND suffers for
each byte of bandwidth it spent fetching content for that client.

3) both of the above.

The slower a server is at detecting this attack the more damage can be
done. This is magnified by whatever amount of resource expenditure the
server goes to before detection can kick in - RAM, disk I/O, CPU time,
TCP sockets, and of most relevant here: upstream bandwidth.

Also, Loris and clients on old tech like 6K modems or worse are
indistinguishable.

To help resolve this problem Squid does the _opposite_ to what you
propose above. It makes the client delivery and the server fetch align
to avoid mistakes detecting these attacks and disconnecting legitimate
clients.
  The read_ahead_gap directive configures the threshold amount of server
fetch which can be done at full server-connection speed before slowing
down to client speed. The various I/O timeouts can be tuned to what a
sysadmin knows about their clients expected I/O capabilities.


> It might not be the most effective caching "method" for specific very loaded systems or specific big files and *very* high cost up-stream connections but for many it will be fine.
> And the actual logic and implementation can be each of couple algorithms like LRU as the default and couple others as an option.
>
> I believe that this logic will be good for specific systems and will remove all sort of weird store\cache_dir limitations.

Which weird limitations are you referring to?

The limits you started this thread about are caused directly by the size
of a specific integer representation and the mathematical properties
inherent in a hashing algorithm.

Those types of limit can be eliminated or changed in the relevant code
without redesigning how HTTP protocol caching behaves.


Amos
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: What would be the maximum ufs\aufs cache_dir objects?

Omid Kosari
Interesting because i was going to create a new topic like this but Eliezer read my mind ;)

Nowadays i can see that the http traffic is going fewer and fewer and every day i am thinking about retiring the squid .

But currently is see that most of the remaining http traffic which worth caching is
Microsoft ( Windows Updates + App Updates )
Apple (IOS updates + App Updates )
Game Consoles (Playstation + Xbox + Game Updates )
Google ( Android Apps + Chrome Apps )
Samsung (Firmware Update + AppUpdates )
CDNs (Akamai + llnwd )
Antivirus Updates

The international HTTP traffic is less than 20% of all international traffic . The sites mentioned above include more than 60% of international http traffic so they are more than 10% of all international traffic .

Now i prefer to only cache mentioned sites . But each line needs a special customization like what Eliezer tool for windows updates .

Squid is an advance general caching software/platform and customizing for each website is far from its roadmap . So i think supporting others like Eliezer to create custom helpers/services for each website may help Squid be more popular and active.