Backup MX only sending 3 messages an hour to the primary

kiril

I’ve just configured another mailcow system as a backup MX, my primary was down for about 8 hours for maintenance, and I now have 100s queued emails to be transferred to the primary. For some reason, the process is taking forever.

The primary is showing 3 concurrent connections in the log, with 7 commands per connection before disconnect, which is I think probably a single email, and this happens every 15 minutes.

The secondary is showing the probable problem (hostnames, IPs and emails changed):
postfix/error[2165]: 5AA0419C0: to=<x@x.com>, relay=none, delay=21276, delays=20973/303/0/0, dsn=4.4.2, status=deferred (delivery temporarily suspended: conversation with mail.x.com[4.4.4.4] timed out while receiving the initial server greeting)

What could be the cause of the 21 second delay here? I’ve whitelisted the backup MX, as per the primary logs:
postfix/postscreen[2501]: WHITELISTED [5.5.5.5]:33560
When I telnet in, I get the 220 response within a second or so.

Update: I’ve tried adding the backup IP to the postscreen whitelist following https://docs.mailcow.email/manual-guides/Postfix/u_e-postfix-postscreen_whitelist/?h=postscreen, but this has made no difference.

esackbauer

I don’t get it why people use backup MX for maintenance tasks (or at all, either use single instance or do a proper cluster). Mail servers will retry for at least 48 hours anyway.
The problem is that when the backup MX receives the mail, the sender can consider the mail as received and that can lead to legal term problems, also no bounce message is sent to the original sender, which would notify him that the mails couldn’t be received yet.

Anyway, did you whitelist your backup MX also under System - Configuration - Options - Forwarding hosts?
Or are any rate limits configured?

kiril

The experience has definitely taught me the error of my ways. Just 20 messages left in the queue, and I’ll turn it off when cleared. The main reason was that it was unclear how long the outage would be. 24 hours was expected, which led me to consider 48 a possibility and therefore we’d be bouncing mail.

I did whitelist it and have made sure there aren’t any rate limits configured in the UI. I assume there are some default rate limits in postfix that I haven’t changed. I also tried adding a transport map to port 25 of the primary mx. Next thing to try was altering the transport map to use port 465 with a username and password, but I think I’ll just wait for the queue to clear and give up. I am still curious to understand what the limits might be, how to change them and how mailcow postfix might behave for a busier domain than one that just does a couple of hundred emails a day.

Probably what I’ll do is configure a cold standby in a different DC and lower the TTL on my MX and A records so it can take over within 10 mins or so.

DocFraggle

You could try to requeue all the mails on your secondary:

cd /opt/mailcow-dockerized # (I assume that's your mailcow location)
docker compose exec postfix-mailcow /bin/bash
root@d41ce1d74c00:/# postsuper -r ALL
postsuper: Requeued: XXX messages

postsuper -r ALL is the command, the formatting is not optimal above 😃

kiril

Did that - it caused postfix to deliver 3 emails on the next queue run with the rest failing with the same timeout error.