Kubegres uses Streaming Replication between the Primary and Replica PostgreSql pods. This is the default replication mode of PostgreSql.Pod anti-affinity
For a given PostgreSql cluster created with Kubegres, if there are sufficient number of worker nodes, Kubegres tries to apply "pod anti-affinity" based on nodes. This means, each PostgreSql instance is deployed on a separate worker node as long as there are sufficient nodes. This approach is to ensure that if a physical worker node is not available, databases are available on other physical worker nodes. This guarantees a higher resiliency and availability of the database.
However, if there are not sufficient worker nodes in a Kubernetes cluster for the number of PostgreSql instances to deploy, then Kubegres will deploy more than one PostgreSql instances on the same worker node.
- On a Kubernetes cluster of 3 worker nodes, we would like to deploy 3 instances of PostgreSql: Kubegres will deploy each PostgreSql instance on a separate worker node.
- On a Kubernetes cluster of 2 worker nodes, we would like to deploy 3 instances of PostgreSql: Kubegres will deploy one PostgreSql instance on one worker node and 2 PostgreSql instances on the other worker node.
In a production environment, we highly recommend having nodes on multiple datacenters to increase resiliency in the event where one datacenter is not available.
You can modify the default Pod/Node scheduling behaviour by using the fields 'scheduler.affinity' and/or 'scheduler.tolerations', as with the example below:
apiVersion: kubegres.reactive-tech.io/v1 kind: Kubegres metadata: name: mypostgres namespace: default spec: scheduler: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - postgres-name topologyKey: "kubernetes.io/hostname" tolerations: - key: group operator: Equal value: critical ...
To explain how replication and failover work, we are going to take a use-case with:
- a Kubernetes cluster of 4 nodes
- a Kubegres resource is deployed with a PostgreSql cluster of 3 pods
For this use case, we will assume that the YAML in the page Getting started was used when deploying Kubegres resource.
Deployment, Replication and Services
The first time a Kubegres resource is created with the command "kubectl", the Kubegres operator deploys each PostgreSql pod in a specific and distinct node. If later a pod is redeployed, it will be located in the same node where it was initially deployed. In this sense it behaves like a standard StatefulSet resource. This also means a node cannot have more than one PostgreSql pod deployed with the same Kubegres YAML.
Since we have a cluster of 3 PostgreSql servers, Kubegres operator deploys a Primary PostgreSql pod first and on success, it deploys 2 Replica PostgreSql pods, as follows:
The Primary PostgreSql pod can be accessed for read and write SQL requests, when the Replica PostgreSql pods can only be accessed for read SQL requests. On data change, Primary replicates the changes to the Replica pods using real-time data streaming.
Kubegres creates 2 Kubernetes services allowing client applications to access to either Primary or Replica pods, as follows:
- a primary service for read/write requests to the Primary PostgreSql pod (e.g. with name "mypostgres")
- a replica service for read requests to the Replica PostgreSql pods (e.g. with name "mypostgres-replica")
Kubegres operator is constantly monitoring whether the Primary PostgreSql requires a failover. For that it relies on:
- PostgreSql container's health: it regularly runs the executable pg_isready provided by PostgreSql.
- the availability of nodes: it relies on Kubernetes to notify it when a node is down
It is possible to disable the automatic failover feature of Kubegres by setting this flag to false: 'failover.isDisabled: false', e.g.:
kind: Kubegres metadata: name: mypostgres namespace: default spec: ... failover: isDisabled: true ...
It is also possible to manually promote a Replica as a Primary by specifying a Replica Pod's name in the field 'failover.promotePod: "mypostgres-2-0"', e.g.:
kind: Kubegres metadata: name: mypostgres namespace: default spec: ... failover: promotePod: "mypostgres-2-0" ...
Assuming that the automatic failover is enabled (and it is by default), when a Primary PostgreSql is unavailable, Kubegres starts the failover process in 2 steps:
- Kubegres checks if there is a Replica PostgreSql pod available. In this example, we have 2 pods available on nodes 2 and 3. Kubegres would promote the Replica PostgreSql pod on node 2 as a Primary pod. This process relies on the standard PostgreSql promotion process which is triggered by creating a log file.
- Once the promotion is completed, Kubegres deploys a new Replica PostgreSql pod. Kubernetes will automatically locate it on the next available node, in our case it would be node 4. It will never deploy more than 1 PostgreSql pod in the same node.
Please see below the state of the PostgreSql cluster after node 1 was reported as unavailable triggering a failover of Primary on node 1 to a Replica on node 2:
How to mitigate failover for client applications
A failover can take few seconds depending on how reactive is Kubernetes in notifying Kubegres about an issue. To make this process as transparent as possible, client applications connecting to PostgreSql must follow the following approaches:
- Connect to Primary PostgreSql via Kubernetes service (Service "mypostgres" in diagram below). When a failover happens, the Kubernetes service for Primary will automatically reconnect to the newly promoted Primary PostgreSql. This is transparent for client applications using it.
- When a new Primary is in the process of being promoted during a failover, some read/write requests for Primary PostgreSql could be rejected. To mitigate this, client apps are encouraged to have a retry logic on failure so that these requests can be successful when a new Primary PostgreSql is promoted.
- For read only requests, client apps could connect to the Replica service (Service "mypostgres-replica" in diagram below). This allows client apps to not be affected for read requests when a new Primary is in the process of being promoted during a failover.