Edge Node Cluster App Instance Failover Behavior

This is a series of articles. You will likely follow them in this order. 

  1. Edge Node Cluster Overview
  2. Create an Edge Node Cluster 
  3. Manage an Edge Node Cluster
  4. Use the ZEDEDA CLI to Manage an Edge Node Cluster
  5. Edge Node Cluster App Instance Failover Behavior - You are here!

Edge Node User-Initiated Maintenance

There are three cases which will be defined as user-initiated node maintenance, which are Reboots, Shutdowns, and EVE-OS upgrades, all of which are initiated from the ZEDEDA Cloud management interfaces. The cluster handling for App Instances in all three of these cases is identical and requires draining a node of workloads before allowing the outage event to continue. This drain process will stop application instances on the node in question and allow them to schedule on another node in the cluster. There can be a delay in this process if the clustered volume instance is degraded in some way due to one or more replicas needing to be rebuilt if there was a recent unplanned outage of some other node. In this case the drain process will block the current node until the clustered volume instance has time to read data off remaining healthy replica images to rebuild the clustered volume instance to full health.

The preceding drain and rebuild process can vary in time range for an extended period of time due to both available cluster node disk IO performance, inter-node networking performance, as well as size of the data to be replicated.

If the preceding rebuild process is not needed on the clustered volume instances, the application instances should failover within minutes of receiving the request and the node will continue.

Reboot and Prepare-Shutdown

The preceding stated drain process begins immediately and Application Instances can see downtime around ~1-2min.
Note: Downtime will vary depending on hardware differences and health of the cluster volume instance for that Application Instance.

EVE-OS Updates

The preceding stated drain process begins for EVE-OS update requests after the image has been downloaded to the edge-node locally, verified successfully and activated from the ZEDEDA Cloud interface.  Downtime for Application Instances should follow existing EVE-OS expectations for systems not part of an edge node cluster.

Application Failback

Failback is the process that occurs when an edge node comes back online and applications are moved back to their original scheduled/requested node. During this process, the edge node cluster identifies all application instances that are not running on their originally preferred node and continues to stop them and allow the cluster to reschedule them on the requested node. The failback process begins after a cluster node completes a boot process and rejoins the cluster.

Edge Node Unexpected Outages

For all unexpected node outages there will be a delay seen before this status is shown in the ZEDEDA Cloud management tools. The edge node cluster will first recognize the node outage after a minimum of 1 minute. This status may not be seen in ZEDEDA Cloud until at a minimum of 1 minute further due to the frequency EVE-OS uses for status reporting.

Application failover in unexpected outages is subject to much longer delays due to the need to avoid unnecessary application failover for intermittent inter-node network outages. If the edge node cluster detects a node is unreachable for more than 5 minutes, then the failover process will begin for all application instances running on that node.

 

Next Steps

This is a series of articles. You will likely follow them in this order. 

  1. Edge Node Cluster Overview
  2. Create an Edge Node Cluster 
  3. Manage an Edge Node Cluster
  4. Use the ZEDEDA CLI to Manage an Edge Node Cluster
  5. Edge Node Cluster App Instance Failover Behavior - You are here!

After you’ve completed the series, you might be interested in the following articles. 

Was this article helpful?
0 out of 0 found this helpful