squid.service with Type=Notify is not always reliable (Arch Linux)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

squid.service with Type=Notify is not always reliable (Arch Linux)

Amish
Hello,

I have frequently observed an issue with squid.service but I am not able
to detect the real cause. As mostly it works but sometimes does not.

I am using Arch Linux with squid 4.13 (but I had noticed same issue in
4.10 as well)

Most likely I am noticing this issue ever since service Type was changed
from forking to notify in squid.service

Commit:
https://github.com/squid-cache/squid/commit/6fa8c66435d55a2e713db0222cdca3a9dccf5bbe

What happens is squid starts correctly, but systemd does not seem to be
getting the notification from squid that it has started.

Sep 01 06:40:04 foo systemd[1]: Starting Squid Web Proxy Server...
Sep 01 06:41:34 foo systemd[1]: squid.service: start operation timed
out. Terminating.
Sep 01 06:42:06 foo systemd[1]: squid.service: Failed with result 'timeout'.
Sep 01 06:42:06 foo systemd[1]: Failed to start Squid Web Proxy Server.

Then it kills squid process.

Squid cache.log shows no error w.r.t. notify and seems to start correctly.

What surprises me is that it works most of the times. Just randomly the
above timeout error occurs.

If in service override file, I change type to forking and remove
--foreground. It starts working again.

Like this: (comment lines are original lines when the issue occurs)

[Service]
#Type=notify
Type=forking
ExecStart=
#ExecStart=/usr/bin/squid -f /etc/squid/custom.conf --foreground -sYC
ExecStart=/usr/bin/squid -f /etc/squid/custom.conf -sYC

Any idea what could be issue? Is there any squid.conf setting which I
may using and stopping notify randomly?

Any help would be highly appreciated.

Thanks and regards,

Amish

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid.service with Type=Notify is not always reliable (Arch Linux)

Alex Rousskov
On 9/1/20 2:32 AM, Amish wrote:

> I have frequently observed an issue with squid.service but I am not able
> to detect the real cause. As mostly it works but sometimes does not.

> What happens is squid starts correctly, but systemd does not seem to be
> getting the notification from squid that it has started.

> Sep 01 06:40:04 foo systemd[1]: Starting Squid Web Proxy Server...
> Sep 01 06:41:34 foo systemd[1]: squid.service: start operation timed out. Terminating.
> Sep 01 06:42:06 foo systemd[1]: squid.service: Failed with result 'timeout'.
> Sep 01 06:42:06 foo systemd[1]: Failed to start Squid Web Proxy Server.

When did your Squid workers print their "Accepting ... connections at
..." message(s)? Those messages will give you the approximate time of
sd_notify() calls made by Squid workers.


> Squid cache.log shows no error w.r.t. notify and seems to start correctly.

AFAICT, Squid code does not report sd_notify() failure to notify iff
that failure was due to an unset $NOTIFY_SOCKET variable. The first
sd_notify() call in Squid also unsets $NOTIFY_SOCKET variable (in the
calling worker). However, I cannot explain why that variable would be
unset during the first sd_notify() call _sometimes_.

Perhaps systemd is confused by concurrent notifications coming from
multiple workers?

Can you enable some kind of sd_notify() debugging that would show us
what the first sd_notify() call was doing and when/whether systemd
received the notification from Squid?

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid.service with Type=Notify is not always reliable (Arch Linux)

Amish

On 01/09/20 7:17 pm, Alex Rousskov wrote:

> On 9/1/20 2:32 AM, Amish wrote:
>
>> I have frequently observed an issue with squid.service but I am not able
>> to detect the real cause. As mostly it works but sometimes does not.
>> What happens is squid starts correctly, but systemd does not seem to be
>> getting the notification from squid that it has started.
>> Sep 01 06:40:04 foo systemd[1]: Starting Squid Web Proxy Server...
>> Sep 01 06:41:34 foo systemd[1]: squid.service: start operation timed out. Terminating.
>> Sep 01 06:42:06 foo systemd[1]: squid.service: Failed with result 'timeout'.
>> Sep 01 06:42:06 foo systemd[1]: Failed to start Squid Web Proxy Server.
> When did your Squid workers print their "Accepting ... connections at
> ..." message(s)? Those messages will give you the approximate time of
> sd_notify() calls made by Squid workers.

Accepting ... connections at ...  message came almost immediately (in 1
second).

Sep 01 06:40:05 foo squid[8446]: Set Current Directory to /var/cache/squid
Sep 01 06:40:05 foo squid[8446]: Starting Squid Cache version 4.13 for
x86_64-pc-linux-gnu...
Sep 01 06:40:05 foo squid[8446]: Service Name: squid
Sep 01 06:40:05 foo squid[8446]: Process ID 8446
Sep 01 06:40:05 foo squid[8446]: Process Roles: worker
Sep 01 06:40:05 foo squid[8446]: With 16384 file descriptors available
Sep 01 06:40:05 foo squid[8446]: Initializing IP Cache...
Sep 01 06:40:05 foo squid[8446]: DNS Socket created at [::], FD 5
Sep 01 06:40:05 foo squid[8446]: DNS Socket created at 0.0.0.0, FD 10
Sep 01 06:40:05 foo squid[8446]: Adding domain localdomain from
/etc/resolv.conf
Sep 01 06:40:05 foo squid[8446]: Adding nameserver 127.0.0.1 from
/etc/resolv.conf
Sep 01 06:40:05 foo squid[8446]: helperOpenServers: Starting 5/32
'security_file_certgen' processes
Sep 01 06:40:05 foo squid[8446]: helperOpenServers: Starting 1/1 'prrdr'
processes
Sep 01 06:40:05 foo squid[8446]: helperOpenServers: Starting 1/1 'prusr'
processes
Sep 01 06:40:05 foo squid[8446]: Logfile: opening log
daemon:/var/log/squid/access.log
Sep 01 06:40:05 foo squid[8446]: Logfile Daemon: opening log
/var/log/squid/access.log
Sep 01 06:40:05 foo squid[8446]: Local cache digest enabled;
rebuild/rewrite every 3600/3600 sec
Sep 01 06:40:05 foo squid[8446]: Store logging disabled
Sep 01 06:40:05 foo squid[8446]: Swap maxSize 0 + 262144 KB, estimated
20164 objects
Sep 01 06:40:05 foo squid[8446]: Target number of buckets: 1008
Sep 01 06:40:05 foo squid[8446]: Using 8192 Store buckets
Sep 01 06:40:05 foo squid[8446]: Max Mem  size: 262144 KB
Sep 01 06:40:05 foo squid[8446]: Max Swap size: 0 KB
Sep 01 06:40:05 foo squid[8446]: Using Least Load store dir selection
Sep 01 06:40:05 foo squid[8446]: Set Current Directory to /var/cache/squid
Sep 01 06:40:05 foo squid[8446]: Finished loading MIME types and icons.
Sep 01 06:40:05 foo squid[8446]: HTCP Disabled.
Sep 01 06:40:05 foo squid[8446]: Squid plugin modules loaded: 0
Sep 01 06:40:05 foo squid[8446]: Adaptation support is on
Sep 01 06:40:05 foo squid[8446]: Accepting SSL bumped HTTP Socket
connections at local=[::]:3128 remote=[::] FD 27 flags=9
Sep 01 06:40:06 foo squid[8446]: storeLateRelease: released 0 objects
Sep 01 06:40:06 foo squid[8446]: ERROR: negotiating TLS on FD 23:
error:1416F086:SSL routines:tls_process_server_certificate:certificate
verify failed (1/-1/0)
Sep 01 06:40:06 foo squid[8446]: ERROR: negotiating TLS on FD 26:
error:1416F086:SSL routines:tls_process_server_certificate:certificate
verify failed (1/-1/0)

Browser was able to access the websites after this point. Many such
lines and then

Sep 01 06:41:26 foo squid[8446]: parse URL too large (11871 bytes)
Sep 01 06:41:26 foo squid[8446]: ERROR: negotiating TLS on FD 96:
error:1416F086:SSL routines:tls_process_server_certificate:certificate
verify failed (1/-1/0)
Sep 01 06:41:27 foo squid[8446]: ERROR: negotiating TLS on FD 76:
error:1416F086:SSL routines:tls_process_server_certificate:certificate
verify failed (1/-1/0)
Sep 01 06:41:30 foo squid[8446]: Error negotiating SSL connection on FD
54: error:00000001:lib(0):func(0):reason(1) (1/-1)
Sep 01 06:41:32 foo squid[8446]: ERROR: negotiating TLS on FD 78:
error:1416F086:SSL routines:tls_process_server_certificate:certificate
verify failed (1/-1/0)
Sep 01 06:41:33 foo squid[8446]: ERROR: negotiating TLS on FD 90:
error:1416F086:SSL routines:tls_process_server_certificate:certificate
verify failed (1/-1/0)
Sep 01 06:41:33 foo squid[8446]: ERROR: negotiating TLS on FD 95:
error:1416F086:SSL routines:tls_process_server_certificate:certificate
verify failed (1/-1/0)
Sep 01 06:41:34 foo systemd[1]: squid.service: start operation timed
out. Terminating.
Sep 01 06:41:34 foo squid[8446]: Preparing for shutdown after 692 requests
Sep 01 06:41:34 foo squid[8446]: Waiting 30 seconds for active
connections to finish
Sep 01 06:41:34 foo squid[8446]: Closing HTTP(S) port [::]:3128
Sep 01 06:42:05 foo squid[8446]: Shutdown: NTLM authentication.
Sep 01 06:42:05 foo squid[8446]: Shutdown: Negotiate authentication.
Sep 01 06:42:05 foo squid[8446]: Shutdown: Digest authentication.
Sep 01 06:42:05 foo squid[8446]: Shutdown: Basic authentication.
Sep 01 06:42:06 foo squid[8446]: Shutting down...
Sep 01 06:42:06 foo squid[8446]: storeDirWriteCleanLogs: Starting...
Sep 01 06:42:06 foo squid[8446]:   Finished.  Wrote 0 entries.
Sep 01 06:42:06 foo squid[8446]:   Took 0.00 seconds (  0.00 entries/sec).
Sep 01 06:42:06 foo squid[8446]: Logfile: closing log
daemon:/var/log/squid/access.log
Sep 01 06:42:06 foo squid[8446]: Logfile Daemon: closing log
daemon:/var/log/squid/access.log
Sep 01 06:42:06 foo squid[8446]: Open FD READ/WRITE   11
security_file_certgen #1
Sep 01 06:42:06 foo squid[8446]: Open FD READ/WRITE   13
security_file_certgen #2
Sep 01 06:42:06 foo squid[8446]: Open FD UNSTARTED    15
security_file_certgen #3
Sep 01 06:42:06 foo squid[8446]: Open FD UNSTARTED    17
security_file_certgen #4
Sep 01 06:42:06 foo squid[8446]: Open FD UNSTARTED    19
security_file_certgen #5
Sep 01 06:42:06 foo squid[8446]: Open FD READ/WRITE   21 prrdr #1
Sep 01 06:42:06 foo squid[8446]: Open FD READ/WRITE   24 prusr #1
Sep 01 06:42:06 foo squid[8446]: Open FD UNSTARTED    25 IPC UNIX STREAM
Parent
Sep 01 06:42:06 foo squid[8446]: Squid Cache (Version 4.13): Exiting
normally.
Sep 01 06:42:06 foo squid[8444]: Squid Parent: squid-1 process 8446
exited with status 0
Sep 01 06:42:06 foo squid[8444]: Removing PID file (/run/squid.pid)
Sep 01 06:42:06 foo systemd[1]: squid.service: Failed with result 'timeout'.
Sep 01 06:42:06 foo systemd[1]: Failed to start Squid Web Proxy Server.

Not sure that parse URL line has anything to do with this bug as
sd_notify() was expected to happen long before that.

There are 3-4 other servers (all Arch linux and squid 4.13) which I
control, as of now they are working fine with Type=Notify.

>> Squid cache.log shows no error w.r.t. notify and seems to start correctly.
> AFAICT, Squid code does not report sd_notify() failure to notify iff
> that failure was due to an unset $NOTIFY_SOCKET variable. The first
> sd_notify() call in Squid also unsets $NOTIFY_SOCKET variable (in the
> calling worker). However, I cannot explain why that variable would be
> unset during the first sd_notify() call _sometimes_.
>
> Perhaps systemd is confused by concurrent notifications coming from
> multiple workers?
>
> Can you enable some kind of sd_notify() debugging that would show us
> what the first sd_notify() call was doing and when/whether systemd
> received the notification from Squid?

I have no idea how to do that. If you know then please do let me know.

I will try if I can. Server is production / live server, and without
knowing how to replicate this bug, its difficult for me to test. As I
can not have too much downtime for testing.

Thank you for your response,

Amish.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid.service with Type=Notify is not always reliable (Arch Linux)

Alex Rousskov
On 9/1/20 10:27 AM, Amish wrote:

> Accepting ... connections at ...  message came almost immediately (in 1
> second).
> Sep 01 06:40:05 foo squid[8446]: Accepting SSL bumped HTTP Socket
> connections at local=[::]:3128 remote=[::] FD 27 flags=9

OK, so you are not using SMP Squid and, assuming your Squid build
supports calling sd_notify(), sd_notify() was called. We need to figure
out why it had no effect. Suggested next steps:

1. Adjust Squid to print the value of $NOTIFY_SOCKET together with the
"Accepting..." line. This will confirm that the variable is set. It
should be set. This debugging can be added without fear of producing too
much info but it is unlikely to be insightful.

2. Add some kind of sd_notify() debugging that would show us what the
first sd_notify() call was doing and when/whether systemd received the
notification from Squid. I have not researched how to do that, but I am
sure it is possible. I bet there are not enough notifications happening
on your production server to cause problems, but you should practice on
a lab server first, of course.


> Not sure that parse URL line has anything to do with this bug as
> sd_notify() was expected to happen long before that.

Right. I would only worry if that line appears every time Squid starts,
indicating some kind of automated systemd-related(?) probe/test (which
fails).


HTH,

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid.service with Type=Notify is not always reliable (Arch Linux)

Amish

On 01/09/20 8:31 pm, Alex Rousskov wrote:

> On 9/1/20 10:27 AM, Amish wrote:
>
>> Accepting ... connections at ...  message came almost immediately (in 1
>> second).
>> Sep 01 06:40:05 foo squid[8446]: Accepting SSL bumped HTTP Socket
>> connections at local=[::]:3128 remote=[::] FD 27 flags=9
> OK, so you are not using SMP Squid and, assuming your Squid build
> supports calling sd_notify(), sd_notify() was called. We need to figure
> out why it had no effect. Suggested next steps:
>
> 1. Adjust Squid to print the value of $NOTIFY_SOCKET together with the
> "Accepting..." line. This will confirm that the variable is set. It
> should be set. This debugging can be added without fear of producing too
> much info but it is unlikely to be insightful.
>
> 2. Add some kind of sd_notify() debugging that would show us what the
> first sd_notify() call was doing and when/whether systemd received the
> notification from Squid. I have not researched how to do that, but I am
> sure it is possible. I bet there are not enough notifications happening
> on your production server to cause problems, but you should practice on
> a lab server first, of course.

Ok. I will try above. But here is a note from "man sd_notify" about race
condition that MAY occur.

Conversely, if an auxiliary process of the unit sends an sd_notify()
message and immediately exits, the service manager might not be able to
properly attribute the message to the unit, and thus will ignore it,
even if NotifyAccess=all is set for it.

Hence, to eliminate all race conditions involving lookup of the client's
unit and attribution of notifications to units correctly,
sd_notify_barrier() may be used. This call acts as a synchronization
point and ensures all notifications sent before this call have been
picked up by the service manager when it returns successfully. Use of
sd_notify_barrier() is needed for clients which are not invoked by the
service manager, otherwise this synchronization mechanism is unnecessary
for attribution of notifications to the unit.

Example 5. Eliminating race conditions

When the client sending the notifications is not spawned by the service
manager, it may exit too quickly and the service manager may fail to
attribute them correctly to the unit. To prevent such races, use
sd_notify_barrier() to synchronize against reception of all
notifications sent before this call is made.

            sd_notify(0, "READY=1");
                  /* set timeout to 5 seconds */
                  sd_notify_barrier(0, 5 * 1000000);


I am not sure if this is related. I have no clue about sd_notify(). But
can it be the reason?

Amish

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid.service with Type=Notify is not always reliable (Arch Linux)

Alex Rousskov
On 9/2/20 3:01 AM, Amish wrote:

> On 01/09/20 8:31 pm, Alex Rousskov wrote:
>> 1. Adjust Squid to print the value of $NOTIFY_SOCKET together with the
>> "Accepting..." line. This will confirm that the variable is set. It
>> should be set. This debugging can be added without fear of producing too
>> much info but it is unlikely to be insightful.
>>
>> 2. Add some kind of sd_notify() debugging that would show us what the
>> first sd_notify() call was doing and when/whether systemd received the
>> notification from Squid. I have not researched how to do that, but I am
>> sure it is possible. I bet there are not enough notifications happening
>> on your production server to cause problems, but you should practice on
>> a lab server first, of course.

> Ok. I will try above. But here is a note from "man sd_notify" about race
> condition that MAY occur.

> Conversely, if an auxiliary process of the unit sends an sd_notify()
> message and immediately exits, the service manager might not be able to
> properly attribute the message to the unit, and thus will ignore it,
> even if NotifyAccess=all is set for it.

Since the Squid worker that calls sd_notify() keeps running under normal
conditions (and if it _does_ exit quickly, then its notification should
be indeed ignored!), the above caveat seems irrelevant to me.

Disclaimer: I do not know much about systemd and sd_notify(). I am just
reading the documentation you found and applying it to Squid.

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid.service with Type=Notify is not always reliable (Arch Linux)

Amos Jeffries
Administrator
In reply to this post by Amish
On 2/09/20 7:01 pm, Amish wrote:

>
> On 01/09/20 8:31 pm, Alex Rousskov wrote:
>> On 9/1/20 10:27 AM, Amish wrote:
>>
>>> Accepting ... connections at ...  message came almost immediately (in 1
>>> second).
>>> Sep 01 06:40:05 foo squid[8446]: Accepting SSL bumped HTTP Socket
>>> connections at local=[::]:3128 remote=[::] FD 27 flags=9
>> OK, so you are not using SMP Squid and, assuming your Squid build
>> supports calling sd_notify(), sd_notify() was called. We need to figure
>> out why it had no effect. Suggested next steps:
>>
>> 1. Adjust Squid to print the value of $NOTIFY_SOCKET together with the
>> "Accepting..." line. This will confirm that the variable is set. It
>> should be set. This debugging can be added without fear of producing too
>> much info but it is unlikely to be insightful.
>>
>> 2. Add some kind of sd_notify() debugging that would show us what the
>> first sd_notify() call was doing and when/whether systemd received the
>> notification from Squid. I have not researched how to do that, but I am
>> sure it is possible. I bet there are not enough notifications happening
>> on your production server to cause problems, but you should practice on
>> a lab server first, of course.
>
> Ok. I will try above. But here is a note from "man sd_notify" about race
> condition that MAY occur.


You are guessing now. Please don't do that until every possible check
has been done and narrowed the vast range of possibilities down.

The purpose of the checks suggested by Alex was to identify whether a
race is occuring or not.

Amos
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: squid.service with Type=Notify is not always reliable (Arch Linux)

Amish

On 02/09/20 8:05 pm, Amos Jeffries wrote:

> On 2/09/20 7:01 pm, Amish wrote:
>> On 01/09/20 8:31 pm, Alex Rousskov wrote:
>>> On 9/1/20 10:27 AM, Amish wrote:
>>>
>>>> Accepting ... connections at ...  message came almost immediately (in 1
>>>> second).
>>>> Sep 01 06:40:05 foo squid[8446]: Accepting SSL bumped HTTP Socket
>>>> connections at local=[::]:3128 remote=[::] FD 27 flags=9
>>> OK, so you are not using SMP Squid and, assuming your Squid build
>>> supports calling sd_notify(), sd_notify() was called. We need to figure
>>> out why it had no effect. Suggested next steps:
>>>
>>> 1. Adjust Squid to print the value of $NOTIFY_SOCKET together with the
>>> "Accepting..." line. This will confirm that the variable is set. It
>>> should be set. This debugging can be added without fear of producing too
>>> much info but it is unlikely to be insightful.
>>>
>>> 2. Add some kind of sd_notify() debugging that would show us what the
>>> first sd_notify() call was doing and when/whether systemd received the
>>> notification from Squid. I have not researched how to do that, but I am
>>> sure it is possible. I bet there are not enough notifications happening
>>> on your production server to cause problems, but you should practice on
>>> a lab server first, of course.
>> Ok. I will try above. But here is a note from "man sd_notify" about race
>> condition that MAY occur.
> You are guessing now. Please don't do that until every possible check
> has been done and narrowed the vast range of possibilities down.
>
> The purpose of the checks suggested by Alex was to identify whether a
> race is occuring or not.

For initial testing, I had already added debug message for step 1. And I
tried replicate the problem at my end.

But I simply could not replicate the problem at my end. Debug output was
on expected lines. (on my own machine)

Sep 03 05:11:54 amish squid[1086]: Accepting NAT intercepted HTTP Socket
connections at local=[::]:3128 remote=[::] FD 33 flags=41
Sep 03 05:11:54 amish squid[1086]: NOTIFY_SOCKET before sd_notify() is
/run/systemd/notify
Sep 03 05:11:54 amish systemd[1]: Started Squid Web Proxy Server.
Sep 03 05:11:54 amish squid[1086]: NOTIFY_SOCKET after sd_notify() is
Sep 03 05:11:54 amish squid[1086]: Accepting SSL bumped HTTP Socket
connections at local=[::]:8080 remote=[::] FD 34 flags=9
Sep 03 05:11:54 amish squid[1086]: NOTIFY_SOCKET before sd_notify() is
Sep 03 05:11:54 amish squid[1086]: NOTIFY_SOCKET after sd_notify() is
Sep 03 05:11:54 amish squid[1086]: Accepting NAT intercepted SSL bumped
HTTPS Socket connections at local=[::]:8081 remote=[::] FD 35 flags=41
Sep 03 05:11:54 amish squid[1086]: NOTIFY_SOCKET before sd_notify() is
Sep 03 05:11:54 amish squid[1086]: NOTIFY_SOCKET after sd_notify() is

For step 2, I first need to read systemd documentation about sd_notify()
and thats when I came across mention of above race condition.

Race conditions are not easy to reproduce. That's when you look at the
code and guess where the race can be, if there can be a race. Since
sd_notify() documentation kind of insists on using sd_notify_barrier()
and that does not exist in squid code. I thought that could be missing
piece.

What is strange is I had observed that when there was this issue, no
matter how many times I restart the squid, the problem came every single
time. And then changing Type=forking worked. May be there is no race.
But something else?

But now based on your and Alex replies, I will go ahead and find a way
to add more debug to code for tracking systemd and squid notify
communication. And then hope that issue occurs in my system.

Now I will post back about this, if at all I can do something and find
something. :)

Thank you to both and regards,

Amish.

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users