Unbound is always unhealthy - watchdog nagios check_dns issue?

irotsoma · Apr 17, 2024

I’ve been running the nightly version for a while and have had one issue I haven’t been able to resolve. The watchdog is always reporting the unbound container as unhealthy even though it seems to be fine. This is even though running the healthcheck.sh script is always successful. And I even see that when I start up the container shows healthy:

Network mailcowdockerized_mailcow-network Created 0.1s
Container mailcowdockerized-sogo-mailcow-1 Started 0.0s
Container mailcowdockerized-unbound-mailcow-1 Healthy 0.0s
Container mailcowdockerized-netfilter-mailcow-1 Started 0.0s
Container mailcowdockerized-olefy-mailcow-1 Started 0.0s
Container mailcowdockerized-dockerapi-mailcow-1 Started 0.0s
Container mailcowdockerized-memcached-mailcow-1 Started 0.0s
Container mailcowdockerized-clamd-mailcow-1 Started 0.0s
Container mailcowdockerized-redis-mailcow-1 Started 0.0s
Container mailcowdockerized-solr-mailcow-1 Started 0.0s
Container mailcowdockerized-mysql-mailcow-1 Started 0.0s
Container mailcowdockerized-dovecot-mailcow-1 Started 0.0s
Container mailcowdockerized-postfix-mailcow-1 Started 0.0s
Container mailcowdockerized-php-fpm-mailcow-1 Started 0.0s
Container mailcowdockerized-nginx-mailcow-1 Started 0.0s
Container mailcowdockerized-ofelia-mailcow-1 Started 0.0s
Container mailcowdockerized-rspamd-mailcow-1 Started 0.0s
Container mailcowdockerized-acme-mailcow-1 Started 0.0s
Container mailcowdockerized-watchdog-mailcow-1 Started 0.0s

But I still see this error in the logs:

watchdog-mailcow-1 | Tue Apr 16 21:52:27 PDT 2024 Unbound health level: 0% (0/5), health trend: -139
watchdog-mailcow-1 | Tue Apr 16 21:52:28 PDT 2024 Unbound hit error limit

After reading many articles, I finally decided that nothing was actually wrong with the unbound container and moved on to the watchdog container. I see that it is running this check in the unbound_checks() function inside the watchdog.sh script:

/usr/lib/nagios/plugins/check_dns -s ${host_ip} -H stackoverflow.com 2>> /tmp/unbound-mailcow 1>&2; err_count=$(( ${err_count} + $? ))

I tried running the check_dns tool with various different options and I always get the result Segmentation fault (core dumped). However, if I use dig it works perfectly fine.

ad4040ce952b:/tmp# /usr/lib/nagios/plugins/check_dns -H stackoverflow.com
Segmentation fault (core dumped)
ad4040ce952b:/tmp# /usr/lib/nagios/plugins/check_dns -s 127.0.0.11 -H stackoverflow.com
Segmentation fault (core dumped)
ad4040ce952b:/tmp# dig stackoverflow.com

; <<>> DiG 9.18.19 <<>> stackoverflow.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47554
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;stackoverflow.com. IN A

;; ANSWER SECTION:
stackoverflow.com. 300 IN A 104.18.32.7
stackoverflow.com. 300 IN A 172.64.155.249

;; Query time: 76 msec
;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
;; WHEN: Tue Apr 16 22:01:11 PDT 2024
;; MSG SIZE rcvd: 78

So I’m guessing something is wrong with that tool. I’m not familiar with nagios tools and could use some help figuring it out. I’m running this on a VPS on Rocky Linux 9.3 with no additional firewall or anything else running on this server.

DocFraggle · Apr 17, 2024

It’s working fine with 2024-04

f87b47a44e88:/# /usr/lib/nagios/plugins/check_dns -H stackoverflow.com
DNS OK: 0.024 seconds response time. stackoverflow.com returns 104.18.32.7,172.64.155.249|time=0.024036s;;;0.000000

What’s the output if you use dig from inside the container? Should look like this:

f87b47a44e88:/# dig @127.0.0.11 stackoverflow.com

; <<>> DiG 9.18.19 <<>> @127.0.0.11 stackoverflow.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47677
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;stackoverflow.com.             IN      A

;; ANSWER SECTION:
stackoverflow.com.      241     IN      A       104.18.32.7
stackoverflow.com.      241     IN      A       172.64.155.249

;; Query time: 1 msec
;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
;; WHEN: Wed Apr 17 09:55:57 CEST 2024
;; MSG SIZE  rcvd: 78

irotsoma · Apr 18, 2024

DocFraggle What’s the output if you use dig from inside the container? Should look like this:

Yes the content I posted in the original message is from within the watchdog container. dig works fine. check_dns gives the segfault error.

[root@delta watchdog]# docker compose exec watchdog-mailcow /bin/bash

fc042ba8c1d5:/# dig @127.0.0.11 stackoverflow.com

; <<>> DiG 9.18.19 <<>> @127.0.0.11 stackoverflow.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11654
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;stackoverflow.com. IN A

;; ANSWER SECTION:
stackoverflow.com. 300 IN A 172.64.155.249
stackoverflow.com. 300 IN A 104.18.32.7

;; Query time: 28 msec
;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
;; WHEN: Wed Apr 17 16:44:40 PDT 2024
;; MSG SIZE rcvd: 78

fc042ba8c1d5:/# /usr/lib/nagios/plugins/check_dns -H stackoverflow.com
Segmentation fault (core dumped)

I’m testing the keycloak functionality which is why I’m using the nightly rather than the release versions. I’m not seeing any differences between the nightly and 2024-04 branches with regards to the watchdog image or scripts. And the nagios tool doesn’t seem to create a log or have any debug or verbose output option, so I’m at a loss for how to troubleshoot it.

DocFraggle · Apr 18, 2024

Are you running an ARM server?

irotsoma · Apr 18, 2024

DocFraggle

No it’s x86

# inxi -Sz System: Kernel: 5.14.0-362.24.1.el9_3.0.1.x86_64 arch: x86_64 bits: 64 Console: pty pts/0 Distro: Rocky Linux 9.3 (Blue Onyx)

DocFraggle · Apr 18, 2024

Hmm… is there anything special about your setup of mailcow?

There is already an older GitHub issue: mailcow/mailcow-dockerized5033

irotsoma · Apr 18, 2024

DocFraggle

Yeah looks like the same issue. Also, mailcow/mailcow-dockerizedissues/5121. Seems they all gave up since nagios has no way to debug it, but while I did comment out the check in the script for now so it’s not constantly restarting the unbound container, I don’t like to give up. Lol.

Nothing really special about the setup except that I’m using the Keycloak functionality. The server itself has nothing significant installed, not even firewalls. It’s a pretty ordinary KVM based VPS. Clean install of docker and mailcow. I do have IPv6 in use, but I see someone else mentioned they had the issue with it disabled. No real customizations added yet.

DocFraggle · Apr 18, 2024

Is -V throwing the segfault as well?

f87b47a44e88:/# /usr/lib/nagios/plugins/check_dns -V
check_dns v (nagios-plugins 2.4.5)

The version of the installed APK would be interesting, too:

f87b47a44e88:/# apk info nagios-plugins-dns
nagios-plugins-dns-2.4.5-r2 description:
Nagios plugin check_dns

nagios-plugins-dns-2.4.5-r2 webpage:
https://nagios-plugins.org/

nagios-plugins-dns-2.4.5-r2 installed size:
80 KiB

irotsoma · Apr 18, 2024

DocFraggle

It does respond fine to the -h and -V options. And the version is the same as you posted. I also verified that the Dockerfile is the same between the latest nightly branch commit and the 2024-04 tag. So the actual docker images should be basically identical.

And I ran nslookup since I read that the dns_check is just calling that tool and it returns expected values.

fc042ba8c1d5:/# /usr/lib/nagios/plugins/check_dns -V check_dns v (nagios-plugins 2.4.5) fc042ba8c1d5:/# apk info nagios-plugins-dns nagios-plugins-dns-2.4.5-r2 description: Nagios plugin check_dns nagios-plugins-dns-2.4.5-r2 webpage: https://nagios-plugins.org/ nagios-plugins-dns-2.4.5-r2 installed size: 80 KiB fc042ba8c1d5:/# nslookup stackexchange.com Server: 127.0.0.11 Address: 127.0.0.11#53 Non-authoritative answer: Name: stackexchange.com Address: 172.64.144.30 Name: stackexchange.com Address: 104.18.43.226

DocFraggle · Apr 18, 2024

If you use the Google DNS, does it segfault as well?

/usr/lib/nagios/plugins/check_dns -s 8.8.8.8 -H stackoverflow.com

irotsoma · Apr 18, 2024

DocFraggle
Yes, I tried various combinations of servers and hosts. Everything I’ve tried causes the segfault so far.

DocFraggle · Apr 18, 2024

Last thing I can think of is that the installed package is corrupt… what’s your md5sum?

f87b47a44e88:/# md5sum /usr/lib/nagios/plugins/check_dns
42aaa5fc36dcecda78f83d1048e7861b  /usr/lib/nagios/plugins/check_dns

Maybe try to reinstall it inside the watchdog container with the apk command

irotsoma · Apr 19, 2024

DocFraggle

Yep md5sum is identical, too.

And I tried deleting all nagios plugins and reinstalling and no help. So I’m guessing there’s some kind of conflict or bug with docker or the host OS rather than anything wrong with the container itself. Too bad it can’t be debugged.

DocFraggle · Apr 19, 2024

What’s your current setup? Do you have selinux in place?

irotsoma · Apr 20, 2024

DocFraggle

It’s a totally fresh install of Rocky Linux 9.3. selinux is enabled but in permissive state. Just for the heck of it I disabled it totally and restarted. But no change.