BUG 3556

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG 3556

Stephen Borrill
At a few installations of squid 4.12 (patched for GREASE) on NetBSD 9,
I'm seeing that occasionally one of the listening ports no longer
accepts connections (it doesn't reject them, but a connection does not
get established). The port appears random; it's not the same every time
and isn't related to ports with SSL interception. A restart of squid
fixes it.

Looking through the logs, this appears to coincide with lines such as:

2020/10/14 22:32:16 kid1| ERROR: getsockname() failed to locate local-IP
on local=[::] remote=10.0.106.147:61996 FD 25 flags=1: (22) Invalid argument
2020/10/14 22:32:16 kid1| BUG 3556: FD 25 is not an open socket.

This looks similar to Alex Rousskov's recent observations:
https://bugs.squid-cache.org/show_bug.cgi?id=3556#c15

However, we have also seen with at sites where there is no SSL
interception (the above lines are from such an installation).

Based on this, I have 4 questions;

1) Am I right that triggering BUG 3556 could lead to the described symptoms?

2) Is this a squid bug or triggered by a problem in the underlying OS?
If the latter, where to start looking?

3) What workarounds are there? Given that a restart fixes it, in some
respects it would be better for squid to quit so it can be restarted
rather than continue to run in a half-working state.

4) Related to 3), are there any other ways to detect the problem when it
is happening besides parsing logs or testing if all ports are accepting
connections? This could be used to trigger an automated restart as a
temporary workaround.

--
Stephen
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: BUG 3556

Alex Rousskov
On 10/15/20 4:07 AM, Stephen Borrill wrote:
> At a few installations of squid 4.12 (patched for GREASE) on NetBSD
> 9, I'm seeing that occasionally one of the listening ports no longer
> accepts connections (it doesn't reject them, but a connection does
> not get established).> The port appears random; it's not the same
> every time and isn't related to ports with SSL interception. A
> restart of squid fixes it.

> Looking through the logs, this appears to coincide with lines such as:
>
> 2020/10/14 22:32:16 kid1| ERROR: getsockname() failed to locate local-IP
> on local=[::] remote=10.0.106.147:61996 FD 25 flags=1: (22) Invalid argument
> 2020/10/14 22:32:16 kid1| BUG 3556: FD 25 is not an open socket.
>
> This looks similar to Alex Rousskov's recent observations:
> https://bugs.squid-cache.org/show_bug.cgi?id=3556#c15

Please keep in mind that those "BUG 3556" messages warn us about Squid
bugs elsewhere/somewhere in Squid code. For each particular message
instance, the exact bug is unknown a priori, and several different bugs
have triggered these messages in the past. While the original bug 3556
report was for a specific bug, the log messages were not (and are not).


> However, we have also seen with at sites where there is no SSL
> interception (the above lines are from such an installation).

> 1) Am I right that triggering BUG 3556 could lead to the described symptoms?

I would rephrase this as "Failure to obtain the (intended) IP address of
an (intercepted) connection leads to BUG 3556 messages."


> 2) Is this a squid bug or triggered by a problem in the underlying OS?

Those "BUG 3556" messages indicate a Squid bug. There is no question
about that. However, the ERROR messages may indicate a Squid
bug/deficiency and/or an environment (OS configuration, etc.) problem.
In summary, you are dealing with multiple problems here. You should
focus on the ERROR messages, not "BUG 3556" messages.


> If the latter, where to start looking?

Check system log for errors. Perhaps you are exhausting some system
resource?

I would also try to map ERROR messages to client transactions in hope to
spot some common pattern behind those failed transactions.
Unfortunately, I do not know whether Squid (especially Squid v4) would
log these failed transactions.


> 3) What workarounds are there?

a) Monitor logs and automatically restart the Squid instance if needed.

b) Patch Squid to kill the affected process. Adding "assert(false);"
after the ERROR message is printed in Comm::TcpAcceptor::oldAccept()
will kill the process. Killing a single worker may or may not be enough
in SMP mode; it would be interesting and potentially useful to know
whether that is enough.

You may be able to easily test your workaround using the trick I
outlined in https://bugs.squid-cache.org/show_bug.cgi?id=3556#c15


> Given that a restart fixes it, in some
> respects it would be better for squid to quit so it can be restarted
> rather than continue to run in a half-working state.

Yes, earlier Squids were written using the "Do whatever you can to stay
up" or "Damn the torpedoes!" principle. FWIW, I am pushing for reversing
the relevant code logic to follow the "Squid instance that cannot
provide an essential service explicitly requested by the admin should
quit with an error" principle, but it will take time to achieve that ideal.


> 4) Related to 3), are there any other ways to detect the problem when it
> is happening besides parsing logs or testing if all ports are accepting
> connections? This could be used to trigger an automated restart as a
> temporary workaround.

Yes, see suggestion 3b above.


HTH,

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: BUG 3556

Stephen Borrill
On 15/10/2020 14:59, Alex Rousskov wrote:

> On 10/15/20 4:07 AM, Stephen Borrill wrote:
>> At a few installations of squid 4.12 (patched for GREASE) on NetBSD
>> 9, I'm seeing that occasionally one of the listening ports no longer
>> accepts connections (it doesn't reject them, but a connection does
>> not get established).> The port appears random; it's not the same
>> every time and isn't related to ports with SSL interception. A
>> restart of squid fixes it.
>
>> Looking through the logs, this appears to coincide with lines such as:
>>
>> 2020/10/14 22:32:16 kid1| ERROR: getsockname() failed to locate local-IP
>> on local=[::] remote=10.0.106.147:61996 FD 25 flags=1: (22) Invalid argument
>> 2020/10/14 22:32:16 kid1| BUG 3556: FD 25 is not an open socket.
>>
>> This looks similar to Alex Rousskov's recent observations:
>> https://bugs.squid-cache.org/show_bug.cgi?id=3556#c15
>
> Please keep in mind that those "BUG 3556" messages warn us about Squid
> bugs elsewhere/somewhere in Squid code. For each particular message
> instance, the exact bug is unknown a priori, and several different bugs
> have triggered these messages in the past. While the original bug 3556
> report was for a specific bug, the log messages were not (and are not).
>
>
>> However, we have also seen with at sites where there is no SSL
>> interception (the above lines are from such an installation).
>
>> 1) Am I right that triggering BUG 3556 could lead to the described symptoms?
>
> I would rephrase this as "Failure to obtain the (intended) IP address of
> an (intercepted) connection leads to BUG 3556 messages."
>
>
>> 2) Is this a squid bug or triggered by a problem in the underlying OS?
>
> Those "BUG 3556" messages indicate a Squid bug. There is no question
> about that. However, the ERROR messages may indicate a Squid
> bug/deficiency and/or an environment (OS configuration, etc.) problem.
> In summary, you are dealing with multiple problems here. You should
> focus on the ERROR messages, not "BUG 3556" messages.
>
>
>> If the latter, where to start looking?
>
> Check system log for errors. Perhaps you are exhausting some system
> resource?

That's always my first port of call anyway, there are no reports.

> I would also try to map ERROR messages to client transactions in hope to
> spot some common pattern behind those failed transactions.
> Unfortunately, I do not know whether Squid (especially Squid v4) would
> log these failed transactions.
>
>
>> 3) What workarounds are there?
>
> a) Monitor logs and automatically restart the Squid instance if needed.
>
> b) Patch Squid to kill the affected process. Adding "assert(false);"
> after the ERROR message is printed in Comm::TcpAcceptor::oldAccept()
> will kill the process. Killing a single worker may or may not be enough
> in SMP mode; it would be interesting and potentially useful to know
> whether that is enough.

We aren't using SMP mode.

> You may be able to easily test your workaround using the trick I
> outlined in https://bugs.squid-cache.org/show_bug.cgi?id=3556#c15

I have also been pointed to your comment here:
https://bugs.squid-cache.org/show_bug.cgi?id=5069#c1

I'm currently testing the COMM_ERROR -> NOMESSAGE patch as bug 5069
sounds identical.

>> Given that a restart fixes it, in some
>> respects it would be better for squid to quit so it can be restarted
>> rather than continue to run in a half-working state.
>
> Yes, earlier Squids were written using the "Do whatever you can to stay
> up" or "Damn the torpedoes!" principle. FWIW, I am pushing for reversing
> the relevant code logic to follow the "Squid instance that cannot
> provide an essential service explicitly requested by the admin should
> quit with an error" principle, but it will take time to achieve that ideal.
>
>
>> 4) Related to 3), are there any other ways to detect the problem when it
>> is happening besides parsing logs or testing if all ports are accepting
>> connections? This could be used to trigger an automated restart as a
>> temporary workaround.
>
> Yes, see suggestion 3b above.
>
>
> HTH,
>
> Alex.
>

_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: BUG 3556

Alex Rousskov
On 10/15/20 10:06 AM, Stephen Borrill wrote:
> I have also been pointed to your comment here:
> https://bugs.squid-cache.org/show_bug.cgi?id=5069#c1

Aha, I thought there was another related bug but did not check. Glad you
found it!


> I'm currently testing the COMM_ERROR -> NOMESSAGE patch as bug 5069
> sounds identical.

Please update 5069 bug report if that code change works for you too. It
will give whoever volunteers the final solution more confidence that
they are on the right path. The key here is whether the error has a
transient nature and does not affect the listening socket ability to
accept other transactions.

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: BUG 3556

Stephen Borrill
On 15/10/2020 16:02, Alex Rousskov wrote:

> On 10/15/20 10:06 AM, Stephen Borrill wrote:
>> I have also been pointed to your comment here:
>> https://bugs.squid-cache.org/show_bug.cgi?id=5069#c1
>
> Aha, I thought there was another related bug but did not check. Glad you
> found it!
>
>> I'm currently testing the COMM_ERROR -> NOMESSAGE patch as bug 5069
>> sounds identical.
>
> Please update 5069 bug report if that code change works for you too.

Will do.

> It
> will give whoever volunteers the final solution more confidence that
> they are on the right path. The key here is whether the error has a
> transient nature and does not affect the listening socket ability to
> accept other transactions.

Yes, that's what I'm curious to know too. The problem seems to occur
once every few days and because the error will still be logged, it
should be easy to test whether operation continues afterwards.

--
Stephen


_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users