Verifying the cold standby backup process

chriss

Hello all

I have some questions for the mailcow community members that have implemented the mailcow cold standby backup process.

Having backups is good, however a tested and verified restore procedure is best.

So you’ve followed the process documented here: Cold-standby (rolling backup)

You’ve set up a new server mailcow-backup.host.name and configured it (hopefully correctly);
You’ve installed docker and docker compose on it, but not mailcow;
You’ve run the create_cold_standby.sh script several times and not seen any errors. In fact, you may have even automated it with cron.
You have the expectation that everything is in order.

But how can you be sure that when you decide to do a docker compose up -d on the new server, happiness is guaranteed?

What process or procedure do you follow to verify that your cold standby system will function properly when you need it?
How do you verify and test it without interfering with mail delivery to your primary server?

DocFraggle

chriss without interfering with mail delivery to your primary server?

Why would it interfere with your primary server? I run the script on a daily basis and start it up right away on the backup server. As it has a different IP address it doesn’t interfere with anything,

ETNyx

DocFraggle

I believe question is little different. He do not want to interfere he is suspicious about how it works and most importantly is it safe? Is it consistent? And how can user be sure about answers to this questions.

chriss

Yes I can see how it seems to be leap of faith,… Start inspecting _cold-standby.sh script a try too understand what is going on. Then you will see that developers are using (at least by mine understating) approach producing consistent backup.

Not sure if you can do any post check of consistency since your production will be in different state by the time you can do it.

DocFraggle

ETNyx I believe question is little different.

I disagree, I think that’s exactly what he was asking:

chriss What process or procedure do you follow to verify that your cold standby system will function properly when you need it?
How do you verify and test it without interfering with mail delivery to your primary server?

I verify it by just starting it up and logging in, and it doesn’t interfere with the primary server at all.

But only @chriss can tell us if this answers his question or not 🙂

chriss

@DocFraggle
@ETNyx

Thank you for the feedback, let me try and clarify a couple of things.

I haven’t run the cold standby script yet, just contemplating introducing it as part of the backup plan to supplement the exising backups to speed up the recovery process. I did have a look through the helper script.

As I see it, the script creates an identical copy of your current set up, including mailcow.conf which has a MAILCOW_HOSTNAME pointing to your primary server. This is different from your standby server which is configured as mailcow-backup.host.name (not literally, just sticking to the documented example, bit it is different).

It’s not explicitly stated in the docs, but I expect that the intention is that when you want to bring this into production, you will shut down the primary server (if it hasn’t died), rename the standby server to match the primary server and update DNS with the new IP address. Maybe you restore from a more recent backup than the last cold standby run if you have one at this stage. Finally you will “UP” the backup server and it takes over the role of the previous primary.
And apart from a few mail delays while DNS propagates the new IP address and perhaps some mails lost between the most recent state of the primary and the latest restored backup, everything is back to normal.

But my question is, how do you test this? Can this be tested?

Without testing, it’s only after you’ve committed to DNS changes and actively switched servers that you may start to encounter any problems and then have to deal with them on the fly. What problems might we encounter? Probably with the server config itself, since mailcow is a duplicate enhanced with a restore. Maybe the technician that set it up left postfix/exim running on port 25, or they didn’t disable ufw, or the hosting provided blocks incoming port 25…. any number of things can happen here that you can imagine and build tests for them into your recovery procedure, but maybe there are some things that you don’t imagine and that is why you would want to run tests.

As to the part about not interfering with the primary, part of the switching over process is making DNS changes and the new server replaces the old server. Testing this would definitely interfere with your primary mail server. Or am I missing something here? Is the intention to run this standby server under a different hostname and edit mailcow.conf accordingly before brining it up? But then what about LE and acme situation?

There was a previous post I read about a user that followed this process and while it appeared to function just fine, they reported that they’d later discovered that not everything had migrated and they’d lost emails in the process. Unfortunately they didn’t revert further to advise if they’d found and fixed the problem. Probably worthwhile my asking the question of the poster, but these are the things that raise the questions of data integrity.

Now I have an idea on how to at least test the deliverability aspects without interference, but I was hoping to hear how others in the community may be doing it before putting a proposal forward and testing it. As to the data integrity testing, that one seems a bit more intense and elusive, unless someone has already developed a test. I freely admit that the backup is based on a clone and assuming the clone is a bit for bit identical copy, it should work. The “leap of faith” at play here. Trust the tools and the process. But… bugs…. which is why we test.

At the moment this is still theoretical for me, but I do intend to execute the process in the newxt few days to see for myself how it all works. I’m just exploring ideas and solidifying my understanding of the process before I start and was wondering what the community is currently doing regarding testing the process.

DocFraggle

chriss OK, now I understand 🙂 So here’s what I did for testing purposes:

given my MAILCOW_HOSTNAME is mail.domain.tld, I created a CNAME testcow.domain.tld
I added “testcow.domain.tld” to ADDITIONAL_SAN and ADDITIONAL_SERVER_NAMES in my primary’s mailcow.conf. So ACME is adding the SAN testcow.domain.tld to the certificate.

As I told you, I have a daily cronjob which runs the create_cold_standby.sh script and creates the clone of my Mailcow on another VM. I created a local hosts entry on my Windows machine and point testcow.domain.tld to the IP of the backup VM. This way I can connect via https://testcow.domain.tld without SSL errors. It just works great and I can test basic stuff there. Of course, as you mentioned, I can’t test everything like sending mail, as I don’t have the correct DNS setup for testcow.domain.tld, but if you want you could setup this, too by adding the IP of your backup VM to your SPF record etc.

Some months ago I switched from a Rocky 8 VM to Debian 12. I use Hetzner as VPS provider. I set up a new VM with Debian 12, installed Docker and let the create_cold_standby.sh script do it’s job. Then, I shutdown the primary VM, removed the IPv4 and IPv6 address from the new VM and moved the addresses from the old VM to the new VM. This way I hadn’t to go through the painful process of reestablishing all the mail reputation stuff again.

I didn’t lose any mails, everything was cloned perfectly. I don’t know what exactly the problems were with the other guy…