• Community Support
  • USEnglish
  • Unbound is always unhealthy - watchdog nagios check_dns issue?

I’ve been running the nightly version for a while and have had one issue I haven’t been able to resolve. The watchdog is always reporting the unbound container as unhealthy even though it seems to be fine. This is even though running the healthcheck.sh script is always successful. And I even see that when I start up the container shows healthy:

✔ Network mailcowdockerized_mailcow-network Created 0.1s
✔ Container mailcowdockerized-sogo-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-unbound-mailcow-1 Healthy 0.0s
✔ Container mailcowdockerized-netfilter-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-olefy-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-dockerapi-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-memcached-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-clamd-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-redis-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-solr-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-mysql-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-dovecot-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-postfix-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-php-fpm-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-nginx-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-ofelia-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-rspamd-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-acme-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-watchdog-mailcow-1 Started 0.0s

But I still see this error in the logs:

watchdog-mailcow-1 | Tue Apr 16 21:52:27 PDT 2024 Unbound health level: 0% (0/5), health trend: -139
watchdog-mailcow-1 | Tue Apr 16 21:52:28 PDT 2024 Unbound hit error limit

After reading many articles, I finally decided that nothing was actually wrong with the unbound container and moved on to the watchdog container. I see that it is running this check in the unbound_checks() function inside the watchdog.sh script:

/usr/lib/nagios/plugins/check_dns -s ${host_ip} -H stackoverflow.com 2>> /tmp/unbound-mailcow 1>&2; err_count=$(( ${err_count} + $? ))

I tried running the check_dns tool with various different options and I always get the result Segmentation fault (core dumped). However, if I use dig it works perfectly fine.

ad4040ce952b:/tmp# /usr/lib/nagios/plugins/check_dns -H stackoverflow.com
Segmentation fault (core dumped)
ad4040ce952b:/tmp# /usr/lib/nagios/plugins/check_dns -s 127.0.0.11 -H stackoverflow.com
Segmentation fault (core dumped)
ad4040ce952b:/tmp# dig stackoverflow.com

; <<>> DiG 9.18.19 <<>> stackoverflow.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47554
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;stackoverflow.com. IN A

;; ANSWER SECTION:
stackoverflow.com. 300 IN A 104.18.32.7
stackoverflow.com. 300 IN A 172.64.155.249

;; Query time: 76 msec
;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
;; WHEN: Tue Apr 16 22:01:11 PDT 2024
;; MSG SIZE rcvd: 78

So I’m guessing something is wrong with that tool. I’m not familiar with nagios tools and could use some help figuring it out. I’m running this on a VPS on Rocky Linux 9.3 with no additional firewall or anything else running on this server.

It’s working fine with 2024-04

f87b47a44e88:/# /usr/lib/nagios/plugins/check_dns -H stackoverflow.com
DNS OK: 0.024 seconds response time. stackoverflow.com returns 104.18.32.7,172.64.155.249|time=0.024036s;;;0.000000

What’s the output if you use dig from inside the container? Should look like this:

f87b47a44e88:/# dig @127.0.0.11 stackoverflow.com

; <<>> DiG 9.18.19 <<>> @127.0.0.11 stackoverflow.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47677
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;stackoverflow.com.             IN      A

;; ANSWER SECTION:
stackoverflow.com.      241     IN      A       104.18.32.7
stackoverflow.com.      241     IN      A       172.64.155.249

;; Query time: 1 msec
;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
;; WHEN: Wed Apr 17 09:55:57 CEST 2024
;; MSG SIZE  rcvd: 78

    Have something to say?

    Join the community by quickly registering to participate in this discussion. We'd like to see you joining our great moo-community!

    DocFraggle What’s the output if you use dig from inside the container? Should look like this:

    Yes the content I posted in the original message is from within the watchdog container. dig works fine. check_dns gives the segfault error.

    [root@delta watchdog]# docker compose exec watchdog-mailcow /bin/bash

    fc042ba8c1d5:/# dig @127.0.0.11 stackoverflow.com

    ; <<>> DiG 9.18.19 <<>> @127.0.0.11 stackoverflow.com
    ; (1 server found)
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11654
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 1232
    ;; QUESTION SECTION:
    ;stackoverflow.com. IN A

    ;; ANSWER SECTION:
    stackoverflow.com. 300 IN A 172.64.155.249
    stackoverflow.com. 300 IN A 104.18.32.7

    ;; Query time: 28 msec
    ;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
    ;; WHEN: Wed Apr 17 16:44:40 PDT 2024
    ;; MSG SIZE rcvd: 78

    fc042ba8c1d5:/# /usr/lib/nagios/plugins/check_dns -H stackoverflow.com
    Segmentation fault (core dumped)

    I’m testing the keycloak functionality which is why I’m using the nightly rather than the release versions. I’m not seeing any differences between the nightly and 2024-04 branches with regards to the watchdog image or scripts. And the nagios tool doesn’t seem to create a log or have any debug or verbose output option, so I’m at a loss for how to troubleshoot it.

    Are you running an ARM server?

      DocFraggle

      No it’s x86

      # inxi -Sz
      System:
      Kernel: 5.14.0-362.24.1.el9_3.0.1.x86_64
      arch: x86_64 bits: 64
      Console: pty pts/0 Distro: Rocky Linux
      9.3 (Blue Onyx)

      DocFraggle

      Yeah looks like the same issue. Also, mailcow/mailcow-dockerizedissues/5121. Seems they all gave up since nagios has no way to debug it, but while I did comment out the check in the script for now so it’s not constantly restarting the unbound container, I don’t like to give up. Lol.

      Nothing really special about the setup except that I’m using the Keycloak functionality. The server itself has nothing significant installed, not even firewalls. It’s a pretty ordinary KVM based VPS. Clean install of docker and mailcow. I do have IPv6 in use, but I see someone else mentioned they had the issue with it disabled. No real customizations added yet.

      Is -V throwing the segfault as well?

      f87b47a44e88:/# /usr/lib/nagios/plugins/check_dns -V
      check_dns v (nagios-plugins 2.4.5)

      The version of the installed APK would be interesting, too:

      f87b47a44e88:/# apk info nagios-plugins-dns
      nagios-plugins-dns-2.4.5-r2 description:
      Nagios plugin check_dns
      
      nagios-plugins-dns-2.4.5-r2 webpage:
      https://nagios-plugins.org/
      
      nagios-plugins-dns-2.4.5-r2 installed size:
      80 KiB

        DocFraggle

        It does respond fine to the -h and -V options. And the version is the same as you posted. I also verified that the Dockerfile is the same between the latest nightly branch commit and the 2024-04 tag. So the actual docker images should be basically identical.

        And I ran nslookup since I read that the dns_check is just calling that tool and it returns expected values.

        fc042ba8c1d5:/# /usr/lib/nagios/plugins/check_dns -V
        check_dns v (nagios-plugins 2.4.5)
        fc042ba8c1d5:/# apk info nagios-plugins-dns
        nagios-plugins-dns-2.4.5-r2 description:
        Nagios plugin check_dns
        nagios-plugins-dns-2.4.5-r2 webpage:
        https://nagios-plugins.org/
        nagios-plugins-dns-2.4.5-r2 installed size:
        80 KiB
        fc042ba8c1d5:/# nslookup stackexchange.com
        Server: 127.0.0.11
        Address: 127.0.0.11#53
        Non-authoritative answer:
        Name: stackexchange.com
        Address: 172.64.144.30
        Name: stackexchange.com
        Address: 104.18.43.226

        If you use the Google DNS, does it segfault as well?

        /usr/lib/nagios/plugins/check_dns -s 8.8.8.8 -H stackoverflow.com

          DocFraggle
          Yes, I tried various combinations of servers and hosts. Everything I’ve tried causes the segfault so far.

          Last thing I can think of is that the installed package is corrupt… what’s your md5sum?

          f87b47a44e88:/# md5sum /usr/lib/nagios/plugins/check_dns
          42aaa5fc36dcecda78f83d1048e7861b  /usr/lib/nagios/plugins/check_dns

          Maybe try to reinstall it inside the watchdog container with the apk command

            DocFraggle

            Yep md5sum is identical, too.

            And I tried deleting all nagios plugins and reinstalling and no help. So I’m guessing there’s some kind of conflict or bug with docker or the host OS rather than anything wrong with the container itself. Too bad it can’t be debugged.

            What’s your current setup? Do you have selinux in place?

              DocFraggle

              It’s a totally fresh install of Rocky Linux 9.3. selinux is enabled but in permissive state. Just for the heck of it I disabled it totally and restarted. But no change.

              No one is typing