The control plane (CP) includes the systems that configure resources running in the data plane. Control planes provide the administrative APIs used to create, read/describe, update, delete, and list (CRUDL) resources. In other words, control plane allocates and configures the resources, and the data plane runs them.
Data plane consists of the systems for consuming those resources, which is basically primary function of the service.
Control planes (DP) and data planes are decoupled to enhance the resiliency so that a failure in the control plane should not impact the data plane. AWS incorporate this design principle in most of their services to enhance the performance and availability of their services.
The control plane replicates its configuration data to multiple replicas of data plane distributed across the regions. That enables the data plane to continue working in the event of control plane impairment. Data plane can access and operate on resources that has been already provisioned with or without control plane.
For instance, if the ability to deploy and configure a load balancer is out, we can continue to utilize pre-deployed load balancers to serve requests based on current configuration. Similarly, EC2 control plane is responsible for allocating and reconfiguring instances. Data plane is responsible for currently running and interacting with EC2 instances and can do so even when control plane becomes unavailable.
Control planes are usually more complex compared to data planes. Hence failures more common in control planes. Based on the principals described in the section above, we can generally isolate whether you’re using the control or data plane based on the service and the API action. Following are some examples:
a) Launch EC2
b) Create S3 bucket)
c) Create a Lambda function.
Any given application’s degree of dependency varies on use case basis. Each workload must be assessed individually by thoroughly examining the AWS API calls issued by it and analyze the answers to the following questions:
Depending upon application’s exposure to control plane hiccups, risk mitigation strategy may differ accordingly. Here are some examples:
Once the application dependency is established on control plane with details API list scoped by the application’s business as usual (BAU) activities, the API calls should be incorporated into application health checks. For example, instead of returning a status HTML page, it can be replaced with a dynamic page that can perform all the required API calls on test/dummy resources to avoid data corruption and only then return an OK (200) status.
A canary probe (e.g. Lambda) for S3 control plane actions may consist of the following S3 API operations performed on a dummy resource within a specific region.
1. CreateBucket:
PUT /v20180820/bucket/
Host: Bucket.s3-control.amazonaws.com
…LocationConstraint…
…
2. PutBucketPolicy:
PUT /v20180820/bucket/
Host: Bucket.s3-control.amazonaws.com
x-amz-account-id: AccountId
…
3. PutBucketTagging
PUT /?tagging HTTP/1.1
Host: Bucket.s3-control.amazonaws.com
x-amz-account-id: AccountId
. . .
A canary probe for API gateway actions may consist of the following control plane API operation performed on a dummy resource within a specific region.
1. CreateRestApi:
Creates a new RestApi resource.
POST /restapis HTTP/1.1
Content-type: application/json
2. CreateDeployment:
POST /restapis/restapi_id/deployments HTTP/1.1
Content-type: application/json
3. DeleteDeployment:
DELETE /restapis/restapi_id/deployments/deployment_id HTTP/1.1
4. DeleteRestApi:
DELETE /restapis/restapi_id/deployments/deployment_id HTTP/1.
Advice for building resilient systems in AWS
AWS Architecture Blog: Doing Constant Work to Avoid Failures
Static stability using Availability Zones
AWS Whitepaper; Zonal services
AWS Whitepaper; Control planes and data planes
AWS Whitepaper; AWS Fault Isolation Boundaries
AWS Well-Architected; Control plane and data plane
AWS Whitepaper; Detecting and Mitigating Gray Failures; Control planes and data planes
Avoiding overload in distributed systems by putting the smaller service in control
Route53; Data and control planes for routing control
Image Credit: AWS Architecture Blog: Doing Constant Work to Avoid Failures