@DocFraggle
@ETNyx
Thank you for the feedback, let me try and clarify a couple of things.
I haven’t run the cold standby script yet, just contemplating introducing it as part of the backup plan to supplement the exising backups to speed up the recovery process. I did have a look through the helper script.
As I see it, the script creates an identical copy of your current set up, including mailcow.conf which has a MAILCOW_HOSTNAME pointing to your primary server. This is different from your standby server which is configured as mailcow-backup.host.name
(not literally, just sticking to the documented example, bit it is different).
It’s not explicitly stated in the docs, but I expect that the intention is that when you want to bring this into production, you will shut down the primary server (if it hasn’t died), rename the standby server to match the primary server and update DNS with the new IP address. Maybe you restore from a more recent backup than the last cold standby run if you have one at this stage. Finally you will “UP” the backup server and it takes over the role of the previous primary.
And apart from a few mail delays while DNS propagates the new IP address and perhaps some mails lost between the most recent state of the primary and the latest restored backup, everything is back to normal.
But my question is, how do you test this? Can this be tested?
Without testing, it’s only after you’ve committed to DNS changes and actively switched servers that you may start to encounter any problems and then have to deal with them on the fly. What problems might we encounter? Probably with the server config itself, since mailcow is a duplicate enhanced with a restore. Maybe the technician that set it up left postfix/exim running on port 25, or they didn’t disable ufw, or the hosting provided blocks incoming port 25…. any number of things can happen here that you can imagine and build tests for them into your recovery procedure, but maybe there are some things that you don’t imagine and that is why you would want to run tests.
As to the part about not interfering with the primary, part of the switching over process is making DNS changes and the new server replaces the old server. Testing this would definitely interfere with your primary mail server. Or am I missing something here? Is the intention to run this standby server under a different hostname and edit mailcow.conf
accordingly before brining it up? But then what about LE and acme situation?
There was a previous post I read about a user that followed this process and while it appeared to function just fine, they reported that they’d later discovered that not everything had migrated and they’d lost emails in the process. Unfortunately they didn’t revert further to advise if they’d found and fixed the problem. Probably worthwhile my asking the question of the poster, but these are the things that raise the questions of data integrity.
Now I have an idea on how to at least test the deliverability aspects without interference, but I was hoping to hear how others in the community may be doing it before putting a proposal forward and testing it. As to the data integrity testing, that one seems a bit more intense and elusive, unless someone has already developed a test. I freely admit that the backup is based on a clone and assuming the clone is a bit for bit identical copy, it should work. The “leap of faith” at play here. Trust the tools and the process. But… bugs…. which is why we test.
At the moment this is still theoretical for me, but I do intend to execute the process in the newxt few days to see for myself how it all works. I’m just exploring ideas and solidifying my understanding of the process before I start and was wondering what the community is currently doing regarding testing the process.