Cluster Lifecycle¶
Infrastructure as Code¶
Creating GKE clusters using gcloud or the UI console is sufficient for testing purposes, but production-ready deployments should be managed with purpose-built tooling that declaratively set the defined state of infrastructure in code form. Terraform handles this well and it provides an easy to read example code base for how the security-related features are configured:
provider "google-beta" { version = "2.13.0" project = "my-project-id" region = "us-central1" } resource "google_container_cluster" "my-cluster-name" { provider = "google-beta" name = "my-cluster-name" project = "my-project-id" // Configures a highly-available control plane spread across three // zones in this region location = "us-central1" // It's best to create these networks and subnets in terraform // and then reference them here. network = google_compute_network.network.self_link subnetwork = google_compute_subnetwork.subnetwork.self_link // Set this to the version desired. It will become the starting // point version for the cluster. Upgrades of the control plane // can be initiated by simply bumping this version and running // terraform apply. min_master_version = "1.13.7-gke19" // Specify the newer Kubernetes logging and monitoring features // of the Stackdriver integration. // Previously "logging.googleapis.com" logging_service = "logging.googleapis.com/kubernetes" monitoring_service = "monitoring.googleapis.com/kubernetes" // Do not use the default node pool that GKE provides and instead // use the node pool(s) define below explicitly. GKE will actually // provision a 1 node node pool and then remove it before making the // node pools below. remove_default_node_pool = true initial_node_count = 1 // This is false by default. RBAC is enabled by default and preferred. enable_legacy_abac = false // Enable the Binary Authorization admission controller to allow // this cluster to evaluate a BinAuthZ policy when pods are created // to ensure images come from the proper sources. enable_binary_authorization = true // Defines the kubelet setting for how many pods could potentially // run on this node. Depending on instance type, setting this to // 64 would be more appropriate and use half the IP space. default_max_pods_per_node = 110 // Enable transparent "application level" encryption of etcd secrets // using a KMS keyring/key. Can be a software or HSM-backed KMS key // for extra compliance requirements. database_encryption { key_name = "my-existing-kms-key" state = "ENCRYPTED" } addons_config { // Do not deploy the in-cluster K8s dashboard and defer to kubectl // and the GCP UI console. kubernetes_dashboard { disabled = true } // Enable network policy (Calico) as an addon. network_policy_config { disabled = false } // Provide the ability to scale pod replicas based on real-time metrics horizontal_pod_autoscaling { disabled = false } } // Enable intranode visibility to expose pod-to-pod traffic to the VPC // for flow logging potential. Requires enabling VPC Flow Logging // on the subnet first enable_intranode_visibility = true // Enables the PSP admission controller in the cluster. DO NOT enable this // on an existing cluster without first configuring the necessary pod security // policies and RBAC role bindings or you will inhibit pods from running // until those are correctly configured. Recommend success in a test env first. pod_security_policy_config { enabled = true } // Enable the VPA addon in the cluster to track actual usage vs requests/limits. // Safe to enable at any time. vertical_pod_autoscaling { enabled = true } // Configure the workload identity "identity namespace". Requires additional // configuration on the node pool for workload identity to function. workload_identity_config { identity_namespace = "my-project-id.svc.goog.id" } // Disable basic authentication and cert-based authentication. // Empty fields for username and password are how to "disable" the // credentials from being generated. master_auth { username = "" password = "" client_certificate_config { issue_client_certificate = false } } // Enable network policy configurations via Calico. Must be configured with // the block in the addons section. network_policy { enabled = true } // The Google Security group that contains the "allowed list" of other Google // Security groups that can be referenced via in-cluster RBAC bindings instead // of having to specify Users one by one. // e.g. // gke-security-groups // - groupA // - groupB // And now, an RBAC RoleBinding/ClusterRoleBinding can reference Group: groupA authenticator_groups_config { security_group = "gke-security-groups@mydomain.com" } // Give GKE a 4 hour window each day in which to perform maintenance operations // and required security patches. In UTC. maintenance_policy { daily_maintenance_window { start_time = "TODO" } } // Use VPC Aliasing to improve performance and reduce network hops between nodes and load balancers. References the secondary ranges specified in the VPC subnet. ip_allocation_policy { use_ip_aliases = true cluster_secondary_range_name = google_compute_subnetwork.subnetwork.secondary_ip_range.0.range_name services_secondary_range_name = google_compute_subnetwork.subnetwork.secondary_ip_range.1.range_name } // Specify the list of CIDRs which can access the master's API. This can be // a list of up to 50 CIDRs. It's basically a control plane firewall rulebase. // Nodes automatically/always have access, so these are for users and automation // systems. master_authorized_networks_config { cidr_blocks { cidr_block = "10.0.0.0/8" display_name = "VPC Subnet" } cidr_blocks { cidr_block = "192.168.0.0/24" display_name = "Admin Subnet" } } // Configure the cluster to have private nodes and private control plane access only private_cluster_config { // Enables the control plane to not be exposed via public IP and which subnet to // use an IP from. enable_private_endpoint = true master_ipv4_cidr_block = "172.20.20.0/28" // Specifies that all nodes in all node pools for this cluster should not have a // public IP automatically assigned. enable_private_nodes = true } } resource "google_container_node_pool" "my-node-pool" { provider = "google-beta" name = "my-node-pool" // Spread nodes in this node pool evenly across the three zones in this region. location = "us-central1" cluster = google_container_cluster.my-cluster-name.name // Must be at or below the control plane version. Bump this field to trigger a // rolling node pool upgrade. version = "1.13.7-gke19" // Because this is a regional cluster and a regional node pool, this is the // number of nodes per-zone to create. 1 will create 3 total nodes. node_count = 1 // Overrides the cluster setting on a per-node-pool basis. max_pods_per_node = 110 // The min and max number of nodes (per-zone) to scale to. This defines a three // to 30 node cluster. autoscaling { min_node_count = 1 max_node_count = 10 } // Fix broken nodes automatically and keep them updated with the control plane. management { auto_repair = "true" auto_upgrade = "true" } node_config { machine_type = "n1-standard-1" // pd-standard is often too slow, so using pd-ssd's is recommended for pods // that do any scratch disk operations. disk_type = "pd-ssd" // COS or COS_containerd are ideal here. Ubuntu if specific kernel features or // disk drivers are necessary. image_type = "COS" // Use a custom service account for this node pool. Be sure to grant it // a minimal amount of IAM roles and not Project Editor like the default SA. service_account = "dedicated-sa-name@my-project-id.iam.gserviceaccount.com" // Use the default/minimal oauth scopes to help restrict the permissions to // only those needed for GCR and stackdriver logging/monitoring/tracing needs. oauth_scopes = [ "https://www.googleapis.com/auth/devstorage.read_only", "https://www.googleapis.com/auth/logging.write", "https://www.googleapis.com/auth/monitoring", "https://www.googleapis.com/auth/servicecontrol", "https://www.googleapis.com/auth/service.management.readonly", "https://www.googleapis.com/auth/trace.append" ] // Enable GKE Sandbox (Gvisor) on this node pool. This cannot be set on the // only node pool in the cluster as system workloads are not all compatible // with the restrictions and protections offered by gvisor. sandbox_config { sandbox_type = "gvisor" } // Protect node metadata and enable Workload Identity // for this node pool. "SECURE" just protects the metadata. // "EXPOSE" or not set allows for cluster takeover. // "GKE_METADATA_SERVER" specifies that each pod's requests to the metadata // API for credentials should be intercepted and given the specific // credentials for that pod only and not the node's. workload_metadata_config { node_metadata = "GKE_METADATA_SERVER" } metadata = { // Set metadata on the VM to supply more entropy google-compute-enable-virtio-rng = "true" // Explicitly remove GCE legacy metadata API endpoint to prevent most SSRF // bugs in apps running on pods inside the cluster from giving attackers // a path to pull the GKE metadata/bootstrapping credentials. disable-legacy-endpoints = "true" } } }
Resources¶
Upgrades and Security Fixes¶
Control Plane Upgrades¶
One of the most important problems that GKE helps solve is maintaining the control plane components for you. Using gcloud or the UI console, upgrading the control plane version is a single API call. There are a few important concepts to understand with this feature:
- The version of the control plane can and should be kept updated. - GKE supports the current "Generally Available" (GA) GKE version back to two older minor revisions. For instance, if the GA version is
1.13.7-gke19, then1.11.xand1.12.xare supported. - GKE versions tend to trail Kubernetes OSS slightly. - Kubernetes OSS releases a minor revision approximately every 90-120 days, and GKE is quick to offer "Alpha" releases with those newest versions, but it may take some time to graduate to "GA" in GKE.
-
Control Plane upgrades of Zonal clusters incurs a few minutes of downtime. - Upgrading to a newer GKE version causes each control plane instance to be recreated in-place. On
zonalclusters, there is a few minutes where that single instance is unavailable while it is being recreated. The data inetcdis kept as-is and all workloads that are running on thenode poolsremain in place. However, access to the API withkubectlwill be temporarily halted.Regional clusters perform a "rolling upgrade" approach to upgrading the three control plane instances, and so the upgrade happens one instance at a time. This leaves the control plane available on the other instances and should incur no noticable downtime.
For the sake of a highly-available control plane and the bonus that is no additional cost, it's recommended to run
regionalclusters. If inter-regional traffic costs are a concern, use thenode locationssetting to use a singlezonebut still keep theregionalcontrol plane.
Node Pool Upgrades¶
Upgrades of node pools are handled per-node pool, and this gives the operator a lot of control over the behavior. If auto-upgrades are enabled, the node pools will be scheduled for "rolling upgrade" style upgrades during the configured maintenance window time block. While it's possible to have node pools trail the control plane version by two minor revisions, it's best to not trail more than one.
As node pools are where workloads are running, a few points are important to understand:
- Node Pools are resource boundaries - If certain workloads are resource-intensive or prone to over-consuming resources in large bursts, it may make sense to dedicate a
node poolto them to reduce the potential for negatively affecting the performance and availability of other workloads. - Each Node Pool adds operational maintenance overhead - Quite simply, each
node pooladds more work to the operations team. Either to ensure is automatically upgraded or to perform the upgrades manually if auto-upgrades are disabled. -
Separating workloads to different Node Pools can help support security goals - While not a perfect solution, ensuring that workloads handling certain sensitive data are scheduled on separate
node pools(usingnode taintsandtolerationson workloads) can help reduce the chance of compromisingsecretsin a container "escape-to-the-host" situation.For example, placing PCI-related workloads on a separate
node poolfrom other non-PCI workloads means that a compromise of a non-PCIpodthat allows access to the underlyinghostwill be less likely to have access tosecretsthat the PCI workloads are using.
Security Bulletins¶
Another benefit of a managed service like GKE is the reduced operational security burden on having to triage, test, and manage upgrades due to security issues. The GKE Security Bulletins website and Atom/RSS feed provide a consolidated view of all things security-related in GKE. It is strongly recommended that teams subscribe to and receive automatic updates from the feed as the information is timely and typically very clear on which actions GKE the service is taking and which actions are left to you.
Resources¶
- GKE Cluster Upgrades
- GKE Node Pool Upgrades
- Automatic Node Upgrades
- GKE Security Bulletins
- GKE Release Notes
Scaling¶
While scaling isn't a purely security-related topic, there are a few security and resource availability concerns with each of the common scaling patterns:
-
Node Sizing - The resources available to GKE worker
nodesare shared by allpodsthat are scheduled on them. While there are solidlimitson CPU and RAM usage, there are not strong resource isolation mechanisms (yet) for disk usage that lands on thenode. It is possible for a singlepodto consume enough disk space and/or inodes and disrupt the health of thenode.Depending on the workload profile you plan on running in the
cluster, it might make sense to have morenodesof a smaller instance type with the fasterpd-ssddisk typethan fewer large instances sharing the slowerpd-hdddisk type, for example. When workload resource profiles drastically differ, it might make sense to use separatenode poolsor even separateclusters. -
Horizontal Pod Autoscaler - The built-in capability of scaling
replicasin adeploymentbased on CPU usage may or may not be granular enough for your needs. Ensure that the total capacity of thecluster(capacity of the max number ofnodesallowed by autoscaling) is greater than the resources used by the max number ofreplicasin thedeployment. Also, consider using "external" metrics from Stackdriver like "number of tasks in a pub/sub queue" or "web requests per pod" to be a more accurate and efficient basis for scaling your workloads. -
Vertical Pod Autoscaler - Horizontal Pod Autoscaling adds more
replicasas needed, but that is assuming thepodhas properly set resourcerequestsandlimitsfor CPU and RAM set. Eachpodshould be configured with the properrequestsfor how much CPU and RAM it uses plus 5-10% at idle for itsrequestssetting. Thelimitshould be the maximum CPU and RAM that thepodwill ever use.Having accurate
requestsandlimitsset for everypodgives the scheduler the correct information for how to place workloads, increases efficiency of resource usage, and reduces the risk of "noisy neighbor" issues frompodsthat consume more than their set share of resources.VPA runs inside the
clusterand can monitor workloads for configuredrequestsandlimitsin comparison with actual usage, and can either "audit" (just make recommendations) or actually modify thedeploymentsdynamically. It's recommended to run VPA in aclusterin "audit" mode to help find misconfiguredpodsbefore they causecluster-wide outages. -
Node Autoprovisioner - The GKE cluster-autoscaler adds or removes
nodesfrom existingnode poolsbased on their available capacity. If adeploymentneeds morereplicas, there are not enough runningnodesto handle them, thenode poolis configured to autoscale, and there are morenodesleft before the maximum is reached, the autoscaler will addnodesuntil the workloads are scheduled.But what if a
podis asking for resources that a newnodecan't handle? For example, apodthat asks for16cpu cores but thenodesare only8core instances? Or if apodneeds a GPU but thenode pooldoesn't attach GPUs? The node autoprovisioner can be configured to dynamically manage node pools for you to help satisfy these situations. Instead of waiting for an administrator to add a newnode pool, the cluster autoscaler can do that for you.