I’ve been running the nightly version for a while and have had one issue I haven’t been able to resolve. The watchdog is always reporting the unbound container as unhealthy even though it seems to be fine. This is even though running the healthcheck.sh script is always successful. And I even see that when I start up the container shows healthy:
✔ Network mailcowdockerized_mailcow-network Created 0.1s
✔ Container mailcowdockerized-sogo-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-unbound-mailcow-1 Healthy 0.0s
✔ Container mailcowdockerized-netfilter-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-olefy-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-dockerapi-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-memcached-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-clamd-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-redis-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-solr-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-mysql-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-dovecot-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-postfix-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-php-fpm-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-nginx-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-ofelia-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-rspamd-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-acme-mailcow-1 Started 0.0s
✔ Container mailcowdockerized-watchdog-mailcow-1 Started 0.0s
But I still see this error in the logs:
watchdog-mailcow-1 | Tue Apr 16 21:52:27 PDT 2024 Unbound health level: 0% (0/5), health trend: -139
watchdog-mailcow-1 | Tue Apr 16 21:52:28 PDT 2024 Unbound hit error limit
After reading many articles, I finally decided that nothing was actually wrong with the unbound container and moved on to the watchdog container. I see that it is running this check in the unbound_checks() function inside the watchdog.sh script:
/usr/lib/nagios/plugins/check_dns -s ${host_ip} -H stackoverflow.com 2>> /tmp/unbound-mailcow 1>&2; err_count=$(( ${err_count} + $? ))
I tried running the check_dns tool with various different options and I always get the result Segmentation fault (core dumped). However, if I use dig it works perfectly fine.
ad4040ce952b:/tmp# /usr/lib/nagios/plugins/check_dns -H stackoverflow.com
Segmentation fault (core dumped)
ad4040ce952b:/tmp# /usr/lib/nagios/plugins/check_dns -s 127.0.0.11 -H stackoverflow.com
Segmentation fault (core dumped)
ad4040ce952b:/tmp# dig stackoverflow.com
; <<>> DiG 9.18.19 <<>> stackoverflow.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47554
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;stackoverflow.com. IN A
;; ANSWER SECTION:
stackoverflow.com. 300 IN A 104.18.32.7
stackoverflow.com. 300 IN A 172.64.155.249
;; Query time: 76 msec
;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
;; WHEN: Tue Apr 16 22:01:11 PDT 2024
;; MSG SIZE rcvd: 78
So I’m guessing something is wrong with that tool. I’m not familiar with nagios tools and could use some help figuring it out. I’m running this on a VPS on Rocky Linux 9.3 with no additional firewall or anything else running on this server.