pgpool HA failover issue


Recommended Posts

Not sure if anyone here will have experience with this an be able to help, but worth a shot.

 

I'm working on setting up a fault-tolerant database backend for a high-traffic web service (no, not Neowin!), which uses Postgresql for the database layer.

 

There are two database servers, one configured as the master, and the other as a slave replicating from it using hot-standby streaming replication. This is all working fine.

 

On each of the database servers, we're also running a pgpool instance, configured with a virtual IP. These instances handle connections from the web service, and direct them to the databases as appropriate (reads are balanced, writes all go to the master). Clients connect to pgpool using the virtual IP address, with the idea that if one server, or pgpool instance fails, the virtual IP is picked up by the other server, and clients carry on as if nothing happened.

 

Pgpool is configured with failover and restore scripts, so that if the master postgres instance fails, the slave is quickly promoted to master, and then the master can quickly be restarted as a slave at a later time when an admin can take a look and resolve whatever caused it to fail.

 

There is also a third pgpool instance running on a separate server, with no database running on it, so that there will always be a quorum in an election between the pgpool instances when deciding which is master. This third instance has a low priority set, so it should never be promoted to master.

 

I have tested this by stopping postgresql on the master server, and this is spotted by pgpool, and it successfully fails over as expected to the slave. There's about 5 seconds when the database is unavailable, and then connections continue as before.

 

However, if I pull the network cable from the server which the master postgres instance is running on - to simulate a server crash, or network failure, the pgpool instances on the remaining two servers hold an election to decide a new pgpool master instance (if the master was on the same server as the postgres master previously), and the virtual IP is successfully moved across (if needed), but the postgres instance on the slave isn't being promoted to master, the failover script is never executed. The pgpool logs show that the connection to the previous master is failing (as expected)

 

If anyone has any experience with this kind of setup of pgpool, and has any idea where it may be going wrong, I'd be really grateful!

Link to comment
Share on other sites

This could sound like an issue with Gratuitous ARP's on the network if you are using Virtual IP's. What's running on your network?

 

https://wiki.wireshark.org/Gratuitous_ARP

 

There seems to be a lot of good information in there after skim reading it.

 

http://scale-out-blog.blogspot.co.uk/2011/01/virtual-ip-addresses-and-their.html

 

 

 

 

Link to comment
Share on other sites

The virtual IP was failing over fine, but the failover script to kick the slave postgresql instance up to master wasn't being fired, so clients weren't able to make any writes to the database.

 

I managed to solve the issue by moving the pgpool instances over to VMs on the client servers, rather than on the database servers, as the issue only seemed to happen when one of the pgpool instances went offline at the same time as one of the postgres instances - the pgpool cluster would get distracted electing a new master, and just ignore the fact one of the backends was offline. Moving the pgpool instances, so that if the postgres server died, the pgpool server didn't seemed to resolve the issue fine.

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.