Change of server hardware (?) resulted in massive increase of crashes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Change of server hardware (?) resulted in massive increase of crashes

Ralf Hildebrandt
I've been running squid-5 and squid-6 (both release and HEAD) on my
old proxy farm. Four machines, ubuntu bionic. All was well, except for
the occasional crash.

I filed bug reports, and the preliminary patches helped with the
crashes:

https://bugs.squid-cache.org/show_bug.cgi?id=5055
https://bugs.squid-cache.org/show_bug.cgi?id=5056

Recently, I set up a new cluster of four up to date machines with new
hardware (still ubuntu, focal though, still x64, assloads of CPUs and
memeory) and tested that cluster with the users from our department.
All went well, squid never once crashed.

Then we took the old cluster offline, and let the new cluster take over.

Now I'm getting (with the same hand-build squid versions!) a LOT
(about once every 15 Minutes) of crashes like this one:

2020/09/22 09:34:07| FATAL: check failed: opening()
    exception location: tunnel.cc(1305) noteDestinationsEnd
    current master transaction: master359979

My infrastructure generates backtraces upon crash, but in the case I'm
not getting any. Which is odd, given I start squid in gdb with
"/usr/sbin/squid -sYNC"

--
Ralf Hildebrandt
Charité - Universitätsmedizin Berlin
Geschäftsbereich IT | Abteilung Netzwerk

Campus Benjamin Franklin (CBF)
Haus I | 1. OG | Raum 105
Hindenburgdamm 30 | D-12203 Berlin

Tel. +49 30 450 570 155
[hidden email]
https://www.charite.de
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: Change of server hardware (?) resulted in massive increase of crashes

Ralf Hildebrandt
* Ralf Hildebrandt <[hidden email]>:

> 2020/09/22 09:34:07| FATAL: check failed: opening()
>     exception location: tunnel.cc(1305) noteDestinationsEnd
>     current master transaction: master359979

I had to go back as far as 5.0.2 to exclude master commit 25b0ce4, now
it's stable (running for an hour without a crash now).

Ralf Hildebrandt
Charité - Universitätsmedizin Berlin
Geschäftsbereich IT | Abteilung Netzwerk

Campus Benjamin Franklin (CBF)
Haus I | 1. OG | Raum 105
Hindenburgdamm 30 | D-12203 Berlin

Tel. +49 30 450 570 155
[hidden email]
https://www.charite.de
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: Change of server hardware (?) resulted in massive increase of crashes

Alex Rousskov
In reply to this post by Ralf Hildebrandt
On 9/22/20 3:47 AM, Ralf Hildebrandt wrote:
> I'm getting (with the same hand-build squid versions!) a LOT
> (about once every 15 Minutes) of crashes like this one:
>
> 2020/09/22 09:34:07| FATAL: check failed: opening()
>     exception location: tunnel.cc(1305) noteDestinationsEnd
>     current master transaction: master359979

This is still bug #5055. I hope we will post an official pull request
properly addressing it soon.

In my environment, Squid v5 is hardly usable without those fixes but, as
you know, YMMV. Your OS upgrade could trigger different DNS resolution
timings, the new cluster may have different IPv6 connectivity profile,
or there can be similar minor/innocent changes that result in slightly
different Squid state and more exceptions. I would not spend time trying
to pinpoint the exact trigger.

I updated bug #5055 with a patch that covers the tunneling case:
https://bugs.squid-cache.org/show_bug.cgi?id=5055#c5


> My infrastructure generates backtraces upon crash, but in the case I'm
> not getting any.

Unlike "assertion failed" FATAL messages, the "check failed" FATAL
messages are the result of an unhandled (for the lack of a better word)
exception. Today, such exceptions do not generate core dumps because the
low-level stack is pretty much lost by the time the exception is caught
by the high-level code. Unhandled exception handling (yes, I know) may
change in the future, but that is a separate issue.


HTH,

Alex.
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: [ext] Re: Change of server hardware (?) resulted in massive increase of crashes

Ralf Hildebrandt
> > 2020/09/22 09:34:07| FATAL: check failed: opening()
> >     exception location: tunnel.cc(1305) noteDestinationsEnd
> >     current master transaction: master359979
>
> This is still bug #5055. I hope we will post an official pull request
> properly addressing it soon.

I can easily test it here :)

> In my environment, Squid v5 is hardly usable without those fixes but, as
> you know, YMMV.

Yes.

> Your OS upgrade could trigger different DNS resolution
> timings, the new cluster may have different IPv6 connectivity profile,
> or there can be similar minor/innocent changes that result in slightly
> different Squid state and more exceptions. I would not spend time trying
> to pinpoint the exact trigger.
>
> I updated bug #5055 with a patch that covers the tunneling case:
> https://bugs.squid-cache.org/show_bug.cgi?id=5055#c5

Thanks, I'll try that once the dust has settled.


Ralf Hildebrandt
Charité - Universitätsmedizin Berlin
Geschäftsbereich IT | Abteilung Netzwerk

Campus Benjamin Franklin (CBF)
Haus I | 1. OG | Raum 105
Hindenburgdamm 30 | D-12203 Berlin

Tel. +49 30 450 570 155
[hidden email]
https://www.charite.de
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users
Reply | Threaded
Open this post in threaded view
|

Re: [ext] Re: Change of server hardware (?) resulted in massive increase of crashes

Ralf Hildebrandt
In reply to this post by Alex Rousskov
> This is still bug #5055. I hope we will post an official pull request
> properly addressing it soon.
>
> In my environment, Squid v5 is hardly usable without those fixes but, as
> you know, YMMV. Your OS upgrade could trigger different DNS resolution
> timings, the new cluster may have different IPv6 connectivity profile,
> or there can be similar minor/innocent changes that result in slightly
> different Squid state and more exceptions. I would not spend time trying
> to pinpoint the exact trigger.
>
> I updated bug #5055 with a patch that covers the tunneling case:
> https://bugs.squid-cache.org/show_bug.cgi?id=5055#c5

I applied your renewed patch to squid-6.0.0-20200811-r983fab6e9 -- and
so far the resulting binary seems to be much more stable than with the
previous patch to #5055.

We're currently giving that instance 10% of the connections for
testing (in contrast to the usual 25%)

5.0.2 (running on the other 3 nodes) gives us about 21.7h average
uptime with a median uptime of 28.6h

--
Ralf Hildebrandt
Charité - Universitätsmedizin Berlin
Geschäftsbereich IT | Abteilung Netzwerk

Campus Benjamin Franklin (CBF)
Haus I | 1. OG | Raum 105
Hindenburgdamm 30 | D-12203 Berlin

Tel. +49 30 450 570 155
[hidden email]
https://www.charite.de
_______________________________________________
squid-users mailing list
[hidden email]
http://lists.squid-cache.org/listinfo/squid-users