Project and Environment Separation¶
Using the Hipster Shop as an example workload of eleven coordinating microservices that form a "service offering", there are a few different approaches for how to organize GCP
projects and GKE
clusters. Each has pros and cons from a security perspective.
Single GCP Project, Single GKE Cluster, Namespaces per Environment - A single
my-hipster-shopwith a GKE
gke-hipster-shopand three Kubernetes
- Least expensive
- Weakest "isolation" strategy
- Weakest "defense in depth" strategy
- Weakest "resource contention" strategy
Single GCP Project, Three GKE Clusters - A single
my-hipster-shopwith three GKE
gke-hipster-shop-prod. A single
hipster-shopis in each
- Good "isolation" strategy
- Good "defense in depth" strategy
- Good "resource contention" strategy
- Most complex
- Most expensive
- Most complex
Three GCP Projects, Single GKE Cluster per Project - Three
my-hipster-shop-prod. In each
project, a GKE
gke-hipster-shop-<env>, and a single
- Strongest "isolation" strategy
- Strongest "defense in depth" strategy
- Strongest "resource contention" strategy
- Most complex
- Most complex
- Most expensive
- Most complex
- Project Complexity - Although GCP
projectsare "free", this refers to the ongoing maintenance and operational overhead of managing and caring for GCP
projects. Managing 5
projectsvs 15 vs 150 has challenges that require more sophisticated tooling and processes to do well.
- Network Complexity - Having all GKE
clusterson the same shared VPC
networkor in separate VPC
networkswill vary the amount of ongoing maintenance of IP CIDR allocation/consumption, Interconnect/Cloud VPN configuration complexity, and complexity of collecting network related flow logs.
- Cluster Cost - While the control plane of GKE is free, the additional GKE worker nodes in three
clustersvs one plus the additional maintenance cost of managing (upgrading, securing, monitoring) more
clustersincreases the overall cost.
- Isolation - Does the configuration leverage harder boundary mechanisms like GCP
projects, separate GKE
clusterson separate GCE
instances, or does it rely on softer boundaries like Kubernetes
- Defense in Depth - Should a security incident occur, which strategy serves to reduce the available attack surface by default, make lateral movement more difficult to perform successfully, and make malicious activity easier to identify vs normal activity?
- Resource Contention - Are workloads competing for resources on the same
projects? Is it possible for a single workload to over-consume resources such that the
pods? If the
clusterautoscales to add more
nodes, does that consume all available
diskquota in the
projectand prevent other
clustersfrom having available resources to autoscale if they need to grow?
- Standardize Early - Understand that you will be building tools and processes implicitly and explicitly around your approach, so choose the one that meets your requirements and be consistent across all your GKE deployments. Automation with infrastructure-as-code tools like Terraform for standardizing configuration and naming conventions of
clustersis strongly encouraged.
- One Cluster per Project - While it is the most expensive approach to operate, it offers the best permissions isolation, defense-in-depth, and resource contention strategy by default. If you are serious about service offering and environment separation and have any compliance requirements, this is the best way to achieve those objectives.
- One Cluster per Service Offering - Running one
clusterper colocated set of services with similar data gravity and redundancy needs is ideal for reducing the "blast radius" of an incident. It might seem convenient to place three or four smaller "production" services into a single GKE
cluster, but consider the scope of an investigation should a compromise occur. All workloads in that
clusterand all the data they touch would have to be "in scope" for the remediation efforts.
The previous section covers the "cluster per project" approaches, and this section attempts to guide you through the "workloads per cluster" decisions. Two important definitions to cover first are Hard and Soft tenancy as written by Jessie Frazelle:
- Soft multi-tenancy - multiple users within the same organization in the same
cluster. Soft multi-tenancy could have possible bad actors such as people leaving the company, etc. Users are not thought to be actively malicious since they are within the same organization, but potential for accidents or "evil leaving employees." A large focus of soft multi-tenancy is to prevent accidents.
- Hard multi-tenancy - multiple users, from various places, in the same
cluster. Hard multi-tenancy means that anyone on the
clusteris thought to be potentially malicious and therefore should not have access to any other tenants resources.
From experience, building and operating a
cluster with a hard-tenancy use case in mind is very difficult. The tools and capabilties are improving in this area, but it requires extreme attention to detail, careful planning, 100% visibility into activity, and near-hyper active monitoring. For these reasons, your journey with Kubernetes and GKE should first solve for the soft-tenancy use case. The understanding and lessons learned will overlap nearly 100% if you decide to go for hard-tenancy and will absolutely give you the proper frame of reference to decide if your organization can tackle the added challenges.
There is no such thing as a "single-tenant"
When it comes to workloads, there are always a minimum of two classes: "System" and "User" workloads. Workloads that are responsible for the operation of the
cluster (CNI, log export, metrics export, etc) should be isolated from workloads that run actual applications and vice versa.
In Kubernetes, the default separation between these workload types is likely not sufficient for production needs. System components run on the same physical resources as user workloads, share a common administrative mechanism, share a common layer 3 network with no default access controls, and often run with higher privileges. Even in GKE, you will want to take steps to address these concerns.
- API Isolation - Using separate Kubernetes
namespacesto segment workloads for different purposes when it comes to how those resources interact with the Kubernetes API only. Service accounts per
namespacetied to granular RBAC policies are the primary approach.
- Network Isolation - Using Kubernetes
namespacesas an anchor point, defining which
podsare allowed to talk with each other explicitly via
NetworkPolicyobjects is the primary approach. For instance, preventing all ingress traffic from non-
namespaceswith the exception of
- Privilege Isolation - Leveraging well-formed containers running as non-privileged users in combination with
PodSecurityPoliciesto prevent user workloads from being able to access sensitive or privileged resources on the worker node and undermining the security of all workloads.
- Resource Isolation - Using features like
ResourceQuotasto cap overal cpu/memory/persistent disk resource consumption, resources
podsare given the resources they need without overcrowding other workloads, and separating security or performance sensitive workloads on separate
Whether you are a single developer running a couple microservices in one
cluster or a large
organization with many teams sharing large
clusters, these concerns are important and should not be overlooked. The remainder of this guide will attempt to show you how to implement these features in combination and give you confidence in the decisions you make for your use case.
- Tenants per Cluster - From a security perspective, having a single tenant per
clusterprovides the highest degree of separation among tenants, but it is common to allow multiple workloads from different users/teams of similar trust levels to share a
clusterfor cost and operational efficiency.
- Multiple Untrusted Tenants in a Single Cluster - This approach is not generally recommended as the level of effort to sufficiently isolate workloads is high and the risk of a vulnerability or mistake leading to a tenant escape is much higher.
- Separate the System from the Workloads - No matter which approach is taken, you should take steps to properly isolate the
servicesthat control and manage your
clusterfrom the workloads that operate in them. The system components can have permissions to GCP resources outside your
cluster, and it's important that an incident with a user workload can't escape to the system workloads and then escape "outside" the
projects have quotas or limits on how many resources are available for potential use. This doesn't mean that there is available capacity to fulfill the request, and on rare occasions, a particular zone might not be able to provide another GKE worker right away when your
cluster workloads need more capacity and node-pool autoscaling kicks in.
GKE uses several GCE related quotas like
gpu of a particular type, but it also uses lesser known quotas like "Number of secondary ranges per VPC". The resource related quotas are per region and sometimes per zone, so it's important to monitor your quota usage to avoid quota-capped resource exhaustion scenarios from negatively affecting your application's performance during a scale-up event or from preventing certain types of upgrade scenarios.
- Monitor Quota Consumption - Smaller GKE
clusterswill most likely fit into the
projectquota defaults, but
clustersof a dozen nodes or more may start to bump into the limits.
- Request Quota Increases Ahead of Time - Quota increases can take as much as 48 hrs to be approved, so it's best to plan ahead and ask early.
- Aim for 110% - If the single GKE
clusterin a project uses 10, 32-core
nodes, the total
cpucores needed is 320 or more. To give enough headroom to perform a "blue/green"
clusterupgrade if needed (bringing up an identical
clusterin parallel), the
projectquota should be at least 640
cpucores in that region to facilitate that approach. Following the 110% guideline, this would actually be more like 700
cpus. This allows for two, full-sized
clustersto be possible for a short duration while the workloads are migrated between the
- GCP APIs have Rate Limits - If your application makes 100s of requests per second or more to say, GCS or GCR, you may run into rate limits designed to protect overuse of the GCP APIs from a single customer affecting all customers. If you run into these, you may or may not be able to get them increased. Consider working with GCP support and implementing a different approach with your application.