Project Organization¶
Project and Environment Separation¶
Using the Hipster Shop as an example workload of eleven coordinating microservices that form a "service offering", there are a few different approaches for how to organize GCP projects and GKE clusters. Each has pros and cons from a security perspective.
-
Single GCP Project, Single GKE Cluster, Namespaces per Environment - A single
projectnamedmy-hipster-shopwith a GKEclusternamedgke-hipster-shopand three Kubernetesnamespaces:hipster-shop-dev,hipster-shop-test, andhipster-shop-prod.- Pros
- Simplest
projectstrategy - Simplest
networkstrategy - Least expensive
clusterstrategy
- Simplest
- Cons
- Weakest "isolation" strategy
- Weakest "defense in depth" strategy
- Weakest "resource contention" strategy
- Pros
-
Single GCP Project, Three GKE Clusters - A single
projectnamedmy-hipster-shopwith three GKEclustersnamedgke-hipster-shop-dev,gke-hipster-shop-test, andgke-hipster-shop-prod. A singlenamespacenamedhipster-shopis in eachcluster.- Pros
- Good "isolation" strategy
- Good "defense in depth" strategy
- Good "resource contention" strategy
- Cons
- Most complex
projectstrategy - Simple
networkstrategy - Most expensive
clusterstrategy
- Most complex
- Pros
-
Three GCP Projects, Single GKE Cluster per Project - Three
projectsnamedmy-hipster-shop-dev,my-hipster-shop-test, andmy-hipster-shop-prod. In eachproject, a GKEclusternamedgke-hipster-shop-<env>, and a singlenamespacenamedhipster-shopin eachcluster.- Pros
- Strongest "isolation" strategy
- Strongest "defense in depth" strategy
- Strongest "resource contention" strategy
- Cons
- Most complex
projectstrategy - Most complex
networkstrategy - Most expensive
clusterstrategy
- Most complex
- Pros
Strategy Descriptions¶
- Project Complexity - Although GCP
projectsare "free", this refers to the ongoing maintenance and operational overhead of managing and caring for GCPprojects. Managing 5projectsvs 15 vs 150 has challenges that require more sophisticated tooling and processes to do well. - Network Complexity - Having all GKE
clusterson the same shared VPCnetworkor in separate VPCnetworkswill vary the amount of ongoing maintenance of IP CIDR allocation/consumption, Interconnect/Cloud VPN configuration complexity, and complexity of collecting network related flow logs. - Cluster Cost - While the control plane of GKE is free, the additional GKE worker nodes in three
clustersvs one plus the additional maintenance cost of managing (upgrading, securing, monitoring) moreclustersincreases the overall cost. - Isolation - Does the configuration leverage harder boundary mechanisms like GCP
projects, separate GKEclusterson separate GCEinstances, or does it rely on softer boundaries like Kubernetesnamespaces? - Defense in Depth - Should a security incident occur, which strategy serves to reduce the available attack surface by default, make lateral movement more difficult to perform successfully, and make malicious activity easier to identify vs normal activity?
- Resource Contention - Are workloads competing for resources on the same
nodes,clusters, orprojects? Is it possible for a single workload to over-consume resources such that thenodesevictpods? If theclusterautoscales to add morenodes, does that consume all availablecpu/memory/diskquota in theprojectand prevent otherclustersfrom having available resources to autoscale if they need to grow?
Best Practices¶
- Standardize Early - Understand that you will be building tools and processes implicitly and explicitly around your approach, so choose the one that meets your requirements and be consistent across all your GKE deployments. Automation with infrastructure-as-code tools like Terraform for standardizing configuration and naming conventions of
projectsandclustersis strongly encouraged. - One Cluster per Project - While it is the most expensive approach to operate, it offers the best permissions isolation, defense-in-depth, and resource contention strategy by default. If you are serious about service offering and environment separation and have any compliance requirements, this is the best way to achieve those objectives.
- One Cluster per Service Offering - Running one
clusterper colocated set of services with similar data gravity and redundancy needs is ideal for reducing the "blast radius" of an incident. It might seem convenient to place three or four smaller "production" services into a single GKEcluster, but consider the scope of an investigation should a compromise occur. All workloads in thatclusterand all the data they touch would have to be "in scope" for the remediation efforts.
Resources¶
Separating Tenants¶
The previous section covers the "cluster per project" approaches, and this section attempts to guide you through the "workloads per cluster" decisions. Two important definitions to cover first are Hard and Soft tenancy as written by Jessie Frazelle:
- Soft multi-tenancy - multiple users within the same organization in the same
cluster. Soft multi-tenancy could have possible bad actors such as people leaving the company, etc. Users are not thought to be actively malicious since they are within the same organization, but potential for accidents or "evil leaving employees." A large focus of soft multi-tenancy is to prevent accidents. - Hard multi-tenancy - multiple users, from various places, in the same
cluster. Hard multi-tenancy means that anyone on theclusteris thought to be potentially malicious and therefore should not have access to any other tenants resources.
From experience, building and operating a cluster with a hard-tenancy use case in mind is very difficult. The tools and capabilties are improving in this area, but it requires extreme attention to detail, careful planning, 100% visibility into activity, and near-hyper active monitoring. For these reasons, your journey with Kubernetes and GKE should first solve for the soft-tenancy use case. The understanding and lessons learned will overlap nearly 100% if you decide to go for hard-tenancy and will absolutely give you the proper frame of reference to decide if your organization can tackle the added challenges.
There is no such thing as a "single-tenant" cluster
When it comes to workloads, there are always a minimum of two classes: "System" and "User" workloads. Workloads that are responsible for the operation of the cluster (CNI, log export, metrics export, etc) should be isolated from workloads that run actual applications and vice versa.
In Kubernetes, the default separation between these workload types is likely not sufficient for production needs. System components run on the same physical resources as user workloads, share a common administrative mechanism, share a common layer 3 network with no default access controls, and often run with higher privileges. Even in GKE, you will want to take steps to address these concerns.
- API Isolation - Using separate Kubernetes
namespacesto segment workloads for different purposes when it comes to how those resources interact with the Kubernetes API only. Service accounts pernamespacetied to granular RBAC policies are the primary approach. - Network Isolation - Using Kubernetes
namespacesas an anchor point, defining whichpodsare allowed to talk with each other explicitly viaNetworkPolicyobjects is the primary approach. For instance, preventing all ingress traffic from non-kube-systemnamespaceswith the exception ofudp/53forkube-dns. - Privilege Isolation - Leveraging well-formed containers running as non-privileged users in combination with
PodSecurityPoliciesto prevent user workloads from being able to access sensitive or privileged resources on the worker node and undermining the security of all workloads. - Resource Isolation - Using features like
ResourceQuotasto cap overal cpu/memory/persistent disk resource consumption, resourcesrequestsandlimitsto ensurepodsare given the resources they need without overcrowding other workloads, and separating security or performance sensitive workloads on separateNode Pools.
Whether you are a single developer running a couple microservices in one cluster or a large organization with many teams sharing large clusters, these concerns are important and should not be overlooked. The remainder of this guide will attempt to show you how to implement these features in combination and give you confidence in the decisions you make for your use case.
Best Practices¶
- Tenants per Cluster - From a security perspective, having a single tenant per
clusterprovides the highest degree of separation among tenants, but it is common to allow multiple workloads from different users/teams of similar trust levels to share aclusterfor cost and operational efficiency. - Multiple Untrusted Tenants in a Single Cluster - This approach is not generally recommended as the level of effort to sufficiently isolate workloads is high and the risk of a vulnerability or mistake leading to a tenant escape is much higher.
- Separate the System from the Workloads - No matter which approach is taken, you should take steps to properly isolate the
podsandservicesthat control and manage yourclusterfrom the workloads that operate in them. The system components can have permissions to GCP resources outside yourcluster, and it's important that an incident with a user workload can't escape to the system workloads and then escape "outside" thecluster.
Resources¶
Project Quotas¶
GCP projects have quotas or limits on how many resources are available for potential use. This doesn't mean that there is available capacity to fulfill the request, and on rare occasions, a particular zone might not be able to provide another GKE worker right away when your cluster workloads need more capacity and node-pool autoscaling kicks in.
GKE uses several GCE related quotas like cpu, memory, disk, and gpu of a particular type, but it also uses lesser known quotas like "Number of secondary ranges per VPC". The resource related quotas are per region and sometimes per zone, so it's important to monitor your quota usage to avoid quota-capped resource exhaustion scenarios from negatively affecting your application's performance during a scale-up event or from preventing certain types of upgrade scenarios.
Best Practices¶
- Monitor Quota Consumption - Smaller GKE
clusterswill most likely fit into theprojectquota defaults, butclustersof a dozen nodes or more may start to bump into the limits. - Request Quota Increases Ahead of Time - Quota increases can take as much as 48 hrs to be approved, so it's best to plan ahead and ask early.
- Aim for 110% - If the single GKE
clusterin a project uses 10, 32-corenodes, the totalcpucores needed is 320 or more. To give enough headroom to perform a "blue/green"clusterupgrade if needed (bringing up an identicalclusterin parallel), theprojectquota should be at least 640cpucores in that region to facilitate that approach. Following the 110% guideline, this would actually be more like 700cpus. This allows for two, full-sizedclustersto be possible for a short duration while the workloads are migrated between theclusters. - GCP APIs have Rate Limits - If your application makes 100s of requests per second or more to say, GCS or GCR, you may run into rate limits designed to protect overuse of the GCP APIs from a single customer affecting all customers. If you run into these, you may or may not be able to get them increased. Consider working with GCP support and implementing a different approach with your application.