Fatskills
Practice. Master. Repeat.
Study Guide: GCP - Data Engineer Certification Complete Study Guide
Source: https://www.fatskills.com/law/chapter/gcp-data-engineer-certification-complete-study-guide

GCP - Data Engineer Certification Complete Study Guide

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~159 min read

Hosting a static site
- Create a cloud storage bucket that uses a domain name. Ex. Bucket reports.example.com for hosting http://reports.example.com.
- Domain name ownership verification is required by adding TXT records, HTML tags in header, etc.
- Define:
>>> MainPageSuffix = "index.html" or "main.html", etc
>>> NotFoundPage = "404.html"
- Copy over content directly to bucket OR
Store in GitHub and use WebHook to run update script OR
Use CI/CD tools like Jenkins & use cloud storage plugin for post build steps.

Best option if the content is static and also if users can upload content like files, photos, videos, etc.

Google Compute Engine (IaaS)
- You need complete control over your infrastructure and direct access.
- You want to tune the hardware and squeeze out last drop of performance.
- Control load balancing, scaling yourself.
- Configuration, administration and management - all on you.
- No need to buy the machines, install OS, etc
- Have custom made applications that can't be containerized.

Hosting with Google Cloud Engine
- Google cloud launcher.
- Choose machine types, disk sizes before deployment.
- Get accurate cost estimates before deployment.
- Can customise configuration, rename instances, etc
- After deployment, have full control of the VM instances.

Storage options:
Cloud Storage Buckets, Standard persistent disks, SSD (solid state disks), Local SSD

Storage Technologies:
Cloud SQL (MySQL, PostgreSQL, NoSQL, GCP NoSQL tools - BigTable, Datastore

Load Balancing:
- Network load balancing: forwarding rules based on address, port, protocol
- HTTP load balancing: look into content, examine cookies, certain clients to one server.

StackDriver for logging and monitoring

Hosting with Google Container Engine
Can use prebuilt container images for hosting specific websites like wordpress, etc.
- DevOps - need largely mitigated
- Can use Jenkins for CI/CD
- StackDriver for logging and monitoring

Mix-and-Match Use Cases
Hybrid Use Cases:
- Use App Engine for the front end serving layer, while running Redis in Compute Engine.
- Use Container Engine for a rendering micro-service that uses Compute Engine VMs running Windows to do the actual frame rendering.
- Use App Engine for your web front end, Cloud SQL as your database, and Container Engine for your big data processing.

Container
A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings

Container vs VM: Layers
Containers:
- App + Bins/Libs - per container.
- Docker Runtime (virtualize the OS)
- Host OS
- Infrastructure

Virtual Machines:
- App + Bins/Libs + Guest OS - per VM.
- Hipervisor (virtualize the hardware)
- Infrastructure

Kubernetes Container Cluster
- Kubernetes Master
- Kubernetes Node agent/client - Kubelet
- Pods - consists of a set of containers.

Containers and VMs
Containers:
- Virtualise the Operating System
- More portable
- Quick to boot
- Size - tens of MBs

Virtual Machines:
- Virtualise hardware
- Less portable
- Slow to boot
- Size - tens of GBs

Google App Engine
Fully managed serverless application platform
- Just write the code - leave the rest to platform.
- Open & familiar languages and tools
- Pay only for what you use
- Examples: Heroku, Engine Yard - Lots of code, languages

You can run your applications in App Engine using the flexible environment or standard environment. You can also choose to simultaneously use both environments for your application and allow your services to take advantage of each environment's individual benefits.

Google App Engine: Flexible environment
Using the App Engine flexible environment means that your application instances run within Docker containers on Google Compute Engine virtual machines (VMs).

Optimal for applications with the following characteristics:
- Source code that is written in a version of any of the supported programming languages:
Python, Java, Node.js, Go, Ruby, PHP, or .NET
- Runs in a Docker container that includes a custom runtime or source code written in other programming languages.
- Depends on other software, including operating system packages such as imagemagick, ffmpeg, libgit2, or others through apt-get.
- Uses or depends on frameworks that include native code.
- Accesses the resources or services of your Cloud Platform project that reside in the Compute Engine network.

Google App Engine: Standard environment
Using the App Engine standard environment means that your application instances run in a sandbox, using the runtime environment of a supported language.

Optimal for applications with the following characteristics:
- Source code is written in specific versions of the supported programming languages:
>> Python 2.7
>> Java 8, Java 7
>> Node.js 8 (beta)
>> PHP 5.5
>> Go 1.6, 1.8, and 1.9
- Intended to run for free or at very low cost, where you pay only for what you need and when you need it.
- Experiences sudden and extreme spikes of traffic which require immediate scaling.

Google App Engine: Flexible environment vs Compute Engine
While the flexible environment runs services in instances on Compute Engine VMs, the flexible environment differs from Compute Engine in the following ways:

- The VM instances used in the flexible environment are restarted on a weekly basis. During restarts, Google's management services apply any necessary operating system and security updates.
- You always have root access to Compute Engine VM instances. By default, SSH access to the VM instances in the flexible environment is disabled. If you choose, you can enable root access to your app's VM instances.
- The geographical region of the VM instances used in the flexible environment is determined by the location that you specify for the App Engine application of your GCP project. Google's management services ensures that the VM instances are co-located for optimal performance.

Google Compute Choices
App Engine:
- A flexible, zero ops (serverless!) platform for building highly available apps
- Focus on writing code, no need to concern about server, cluster, OS, or other infrastructure.
- Support for several languages... or bring your own app runtime.
- Use cases:
>> Web sites; Mobile app and gaming backends
>> RESTful APIs
>> Internet of things (IoT) apps.

Container Engine:
- Logical infra powered by Kubernetes, the open source container orchestration system.
- Increase velocity and improve operability by separating the app from the OS.
- Don't have dependencies on a specific operating system & run the application anywhere.
- Use cases
>> Containerized workloads
>> Cloud-native distributed systems
>> Hybrid applications.

Compute Engine:
- Virtual machines running in Google's global data center network.
- Gives complete control over infra and direct access to high-performance hardware (GPUs and local SSDs).
- Need to make OS-level changes, and necessary drivers for optimal performance.
- Direct access to GPUs that you can use to accelerate specific workloads.
- Use cases
>> Any workload requiring a specific OS or OS configuration
>> Currently deployed, on-premises software that you want to run in the cloud.
>> Anything which can't be containerised easily; or need existing VM images

Google AppEngine Environments
Standard: Pre-configured with: Java 7, Python 2.7, Go, PHP
Flexible: More choices: Java 8, Python 3.x, .NET

- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use

Google AppEngine: Standard Environment
- Based on container instances running on Google's infrastructure
- Preconfigured with one of several available runtimes (Java 7, Python 2.7, Go and PHP)
- Each runtime also includes libraries that support App Engine Standard APIs
- Mostly, this is all you will need.

- Applications run in a secure, sandboxed environment
- App Engine standard environment distributes requests across multiple servers, and scaling servers to meet traffic demands
- Your application runs within its own secure, reliable environment that is independent of the hardware, operating system, or physical location of the server.

Google AppEngine: Flexible Environment
- Allows you to customize your runtime and even the operating system of your virtual machine using Dockerfiles
- Under the hood, they are merely instances of Google Compute Engine VMs

Google Cloud Functions
Google Cloud Functions is a serverless execution environment for building and connecting cloud services. With Cloud Functions you write simple, single-purpose functions that are attached to events emitted from your cloud infrastructure and services. Your Cloud Function is triggered when an event being watched is fired. Your code executes in a fully managed environment. There is no need to provision any infrastructure or worry about managing any servers.

Cloud Functions are written in Javascript and execute in a Node.js v6.14.0 environment on Google Cloud Platform. You can take your Cloud Function and run it in any standard Node.js runtime which makes both portability and local testing a breeze.

Cloud Storage: Key Terms: Projects
All data in Cloud Storage belongs inside a project. A project consists of a set of users, a set of APIs, and billing, authentication, and monitoring settings for those APIs. You can have one project or multiple projects.

Cloud Storage: Key Terms: Buckets
Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. You can use buckets to organize your data and control access to your data, but unlike directories and folders, you cannot nest buckets.

When you create a bucket, you specify a globally-unique name, a geographic location where the bucket and its contents are stored, and a default storage class. The default storage class you choose applies to objects added to the bucket that don't have a storage class specified explicitly.

Bucket names
Bucket names have more restrictions than object names and must be globally unique, because every bucket resides in a single Cloud Storage namespace.

Bucket labels
Bucket labels are key:value metadata pairs that allow you to group your buckets along with other Google Cloud Platform resources such as virtual machine instances and persistent disks.

Cloud Storage: Key Terms: Objects
Objects are the individual pieces of data that you store in Cloud Storage. There is no limit on the number of objects that you can create in a bucket.

Objects have two components: object data and object metadata.
- Object data is typically a file that you want to store in Cloud Storage.
- Object metadata is a collection of name-value pairs that describe various object qualities.

Object names
- An object's name is treated as a piece of object metadata in Cloud Storage.
- Object names can contain any combination of Unicode characters (UTF-8 encoded) and must be less than 1024 bytes in length.
- Can use '/' to give an impression of directory type structure.

Cloud Storage: Key Terms: Resources
A resource is an entity within Google Cloud Platform. Each project, bucket, and object in Google Cloud Platform is a resource, as are things such as Compute Engine instances.

Cloud Storage: Key Terms: Data opacity
An object's data component is completely opaque to Cloud Storage. It is just a chunk of data to Cloud Storage.

Cloud Storage: Key Terms: Namespace
There is only one Cloud Storage namespace, which means every bucket must have a unique name across the entire Cloud Storage namespace. Object names must be unique only within a given bucket.

Storage Options for Compute
- Standard persistent disks: Efficient and reliable block storage.
- Regional persistent disks: Efficient and reliable block storage with synchronous replication across two zones in a region.
- SSD persistent disks: Fast and reliable block storage.
- Regional SSD persistent disks: Fast and reliable block storage with synchronous replication across two zones in a region.
- Local SSD: High performance local block storage.
- Cloud Storage Buckets: Affordable object storage.

DevOps
- Compute Engine Management with Puppet, Chef, Salt and Ansible
- Automated Image Builds with Jenkins, Packer, and Kubernetes.
- Distributed Load Testing with Kuburnetes.
- Continuous Delivery with Travis CI.
- Managing Deployments with Spinnaker.

GCE: Image Types
- Public images for Linux and Windows Server that Google provides
- Private images that you create or import to Compute Engine
- Community supported images of other OS.

GCE: Image Creation
- Creator has full root privileges, SSH capability
>> Can share with other users

Steps:
- In the Google Cloud Platform Console, go to the Create an image page.
- Specify the source from which you want to create an image. This can be a persistent disk, another image, or a disk.raw file in Google Cloud Storage.
- Specify the properties for your image. For example, you can specify an image family name for your image to organize this image as part of an image family.
- If you are creating an image from a disk attached to a running image, check "Force creation from running instance" to confirm that you want to create the image while the instance is running.
- Click Create to create the image.

GCE: Projects and Instances
- Each instance belongs to a project
- Projects can have any number of instances
- Projects can have upto 5 VPC (Virtual Private Networks)
- Each instance belongs in one VPC
>> instances within VPC communicate on LAN
>> instances across VPC communicate on internet

GCE: Machine Types
- Standard
- High-memory
- High-CPU
- Shared-core (small, non-resource intensive)

- Can attach GPU dies to most machine types

GCE: Standard Machine Types
- 3.75 GB memory per vCPU
- naming: n1-standard-<1,2,4,8,16,32,64,96 vCPUs>
- Fixed at 16 persistent disks, 64TB total size.

GCE: High-memory Machine Types
- 6.5 GB memory per vCPU
- naming: n1-highmem-<2,4,8,16,32,64,96 vCPUs>. Total RAM = 6.5 x vCPU count.
- Fixed at 16 persistent disks, 64TB total size.

GCE: High-CPU Machine Types
- 0.9 GB memory per vCPU
- naming: n1-highcpu-<2,4,8,16,32,64,96 vCPUs>.
- Fixed at 16 persistent disks, 64TB total size.

GCE: Other Machine Types
- Memory-optimized machine types
- Shared-core machine types (f1-macro, g1-small)
- Custom machine types
- Provides GPUs that you can add to your VM instances (NVIDIA Tesla V100, P100 and K80 GPUs)

GCE: Preemptible Instances
- Much much cheaper than regular Compute Engine instances
- Can be terminated at any time if GCE needs the resources and definitely after running for 24 hours.
- Suitable for batch or fault-tolerant applications.
- Probability of termination varies by day/zone etc.
- Cannot live migrate (stay alive during updates) or auto-restart on maintenance.
- Not billed for instances preempted in the first 10 minutes.

GCE: Preemptible Instances: Termination Steps
Step 1: Compute Engine sends a Soft Off signal
Step 2: Shutdown script should clean up and give up control within 30 seconds.
Step 3: If not, Compute Engine sends a Mechanical Off signal.
Step 4: Compute Engine transitions to Terminated state

GCE: Preemptible Instances: Ways for handling graceful shutdown
- Using metadata
>> startup-script-url or startup-script
>> shutdown-script or shutdown-script-url
- API: instances.delete request or instances.stop

GCE: Storage Options
Each instance comes with a small root persistent disk containing the OS

Add additional storage options:
- Standard Persistent disks
- SSD
- Local SSDs
- Cloud Storage

GCE: Storage: Persistent Disks
- Durable network storage devices that instances can access like physical disks in a desktop or a server.
- Compute Engine manages physical disks and data distribution to ensure redundancy and optimize performance.
- Encrypted (custom encryption possible)
- Built-in redundancy
- Restricted to the zone where instance is located

Two types: Standard and SSD
- Standard Persistent: These are regular hard disks. They are cheap. Ok to use for for sequential access.
- SSD Persistent: Fast and expensive. Good for for random access.

GCE: Storage: Local SSD
- Physically attached to the server that hosts your virtual machine instance
- Local SSDs have higher throughput and lower latency
- The data that you store on a local SSD persists only until you stop or delete the instance
- Small - each local SSD is 375 GB in size (can go up to 8SSDs, i.e. 3TB per instance).
- Very high IOPS and low latency
- Unlike persistent disks, you must manage the striping on local SSDs yourself
- Encrypted, custom encryption not possible

GCE: Storage: Cloud Storage Buckets
Use when latency and throughput are not a priority && when you must share data easily between multiple instances or zones.

- Flexible, scalable, durable - infinite size possible
- Performance depends on storage class.

Cloud Storage: Lifecycle Management
- Assign a lifecycle management configuration to a bucket, applies to a bucket, applies to current and future objects in the bucket.
- Each lifecycle management contains a set of rules. When defining a rule, you can specify any set of conditions for any action.
- Each rule should contain only one action.

Cloud Storage: Create Bucket
gsutil mb -p [PROJECT_NAME] -c [STORAGE_CLASS] -l [BUCKET_LOCATION] gs://[BUCKET_NAME]/

Notes:
- you cannot nest buckets
- you need to specify a globally-unique name, which must be less than 1024 bytes in length.
- you can only change the bucket name and location by deleting and re-creating the bucket bucket.
- you store objects in the bucket, which are immutable (cannot change throughout its storage lifetime).
- objects have two components: object data and object metadata.

Cloud Storage: Moving and Renaming Buckets
When you create a bucket, you permanently define its name, its geographic location, and the project it is part of.

However, you can effectively move or rename your bucket:
- If there is no data in your old bucket, simply delete the bucket and create another bucket with a new name, in a new location, or in a new project.
- If you have data in your old bucket, create a new bucket with the desired name, location, and/or project, copy data from the old bucket to the new bucket, and delete the old bucket and its contents.

Cloud Storage: Object Metadata
Objects stored in Cloud Storage have metadata associated with them. Metadata identifies properties of the object, as well as specifies how the object should be handled when it's accessed. Metadata exists as key:value pairs.

The mutability of metadata varies: some metadata you can edit at any time, some metadata you can only set at the time the object is created, and some metadata you can only view. For example, you can edit the value of the Cache-Control metadata at any time, but you can only assign the storageClass metadata when the object is created or rewritten, and you cannot directly edit the value for the generation metadata, though the generation value changes when the object is overwritten.

There are two categories of metadata that users can change for objects:
- Fixed-key metadata: Metadata whose keys are set, but for which you can specify a value.
- Custom metadata: Metadata that you add by specifying both a key and a value associated with the key.

Setting metadata:
gsutil setmeta -h "[METADATA_KEY]:[METADATA_VALUE]" gs://[BUCKET_NAME]/[OBJECT_NAME]

Cloud Storage: List Bucket
gsutils ls

gsutils ls -L -b gs://[BUCKET_NAME]/
(it will give you geographic Location and default Storage class of the bucket)

To get the size of the bucket:
gsutil ds -s gs://[BUCKET_NAME]/

Cloud Storage: List Objects in Bucket
gsutils ls -r gs://[BUCKET_NAME]/**

Cloud Storage: Copy Bucket
gsutil cp gs://[bucket_name]/[object_name] [object_destination]

Cloud Storage: Bucket Labels
- Bucket labels are key:value metadata pairs that allow you to group your buckets.
- You can apply multiple labels to each bucket, with a maximum of 64 labels per bucket.

View bucket labels:
gsutil ls -L -b gs://[BUCKET_NAME]/

Remove a bucket label:
gsutil label ch -d [KEY_1] gs://[BUCKET_NAME]/

Cloud Storage: Bucket Versioning
Object Versioning on and off can be done using using the gsutiltool tool, the JSON API, and XML API. Not with Console.

- gsutil versioning set on gs://[BUCKET_NAME]
- gsutil versioning get gs://[BUCKET_NAME]
- gsutil ls -a gs://[BUCKET_NAME]

Compute Instance Creation Example
CLI:
$ gcloud compute instances create example-instance-1 example-instance-2 example-instance-3 --zone us-central1-a

To create an instance with the latest Red Hat Enterprise Linux 7 image available, run:
$ gcloud compute instances create example-instance --image-family rhel-7 --image-project rhel-cloud --zone us-central1-a

API Method: instances.insert

How can a user switch between two different projects from cloud shell using gcloud command
Get which projects you have:
- gcloud projects list

Get which project you are into:
- gcloud config get-value project
- gcloud config list

Set project you want to get into:
- gcloud config set project project-id

- In other commands, give --project "Project_ID" as flag.
- Set the CLOUDSDK_CORE_PROJECT environment variable
>> export CLOUDSDK_CORE_PROJECT="my-project-123456"

How do you prevent VM from accidental deletion
- Set deletionProtection property (Only a user that has been granted a role with compute.instances.create permission can reset the flag to allow the resource to be deleted.)
- gcloud compute instances describe example-instance | grep "deletionProtection"
- gcloud compute instances update [INSTANCE_NAME] [--deletionprotection | --no-deletionprotection]

VM: How to reset or restart an instance
- gcloud compute instances reset example-instance
- gcloud compute instances stop example-instance
- gcloud compute instances start example-instance

A stopped instance does not incur charges, but all of the resources that are attached to the instance will still be charged.

VM: How to move instance between Zones
gcloud compute instances move jjain1-vm -zone us-east1-b -destination-zone us-east1-c

Instances can be moved along with their resources only within a region: from one zone to another.

VM: How do you add persistent disks after you have created the instance?
- Remember, you have had created Boot disk while creating instance.
- Disks are zonal resources, so they reside in a particular zone for their entire lifetime.
- A persistent disk can be a standard (HDD) or solid-state (SSD) drive. You can also attach an ephemeral local SSD for high-performance I/O. Each local SSD is 375 GB in size, but you can attach up to eight devices for 3 TB of total SSD storage space per instance.

Commands:
- gcloud compute disks create my-disk-1 my-disk-2 --zone us-east1-a --size 100GB
- gcloud compute instances attach-disk INSTANCE_NAME --disk=DISK

VM: Availability Policy
--maintenance-policy migrated | terminated

Default, it is migrated.
By default, instances are automatically set to restart unless you provide --no-restart-on-failure flag.

VM: How do you retrieve Meta Data?
gcloud compute instances create example-instance --metadata foo=bar

gcloud compute instances add-metadata INSTANCE --metadata lettuce=green

gcloud compute instances remove-metadata INSTANCE --keys

curl "http://metadata.google.internal/computeMetadata/v1/instance/disks /0/type /0/type "

VM: Apply tags
gcloud compute instances add-tags jjain3 --tags "http-server"

VM: Instances and projects
Each instance belongs to a Google Cloud Platform Console project, and a project can have one or more instances. When you create an instance in a project, you specify the zone, operating system, and machine type of that instance. When you delete an instance, it is removed from the project.

VM: Instances and storage options
By default, each Compute Engine instance has a small root persistent disk that contains the operating system. When applications running on your instance require more storage space, you can add additional storage options to your instance.

VM: Instances and networks
A project can have up to five VPC networks, and each Compute Engine instance belongs to one VPC network. Instances in the same network communicate with each other through a local area network protocol. An instance uses the Internet to communicate with any machine, virtual or physical, outside of its own network.

VM: Create a Image and launch an instance using it
- Stop the instance
- Go to Images link on left navigator
- Click Create Image
- Choose source as Disk and disk source as Instance name of Stopped Instance.
- Give a nice name and create image.
- Create a new instance with this image. Choose image from custom images.
- Launch it with HTTP / HTTPs checked - so you can verify the instance

Cloud Launcher
A way to launch common software packages and stacks on Google Compute Engine with just a few clicks.

- Click on Cloud Launcher
- Choose WordPress
- Choose - zone, region, machine type, boot disk size, network and admin name.
- Click Deploy
- Open the web page and see if you are able to access it's front end (assuming you have installed Word press or something like that)

Load Balancer
- It's global load balancer.
- Provides both layer 4 (transport layer) and layer 7 (application layer) load balancing functionality.

Load Balancing: Few concepts of L7/HTTP or HTTPS LB
- Port 80 or port 8080 or port 443.
- Support URL-based or Content Based routing. Create URL Maps and direct traffic to different instances based on the incoming URL.
>> you can send requests for http://www.example.com/audio to one backend service, which contains instances configured to deliver audio files, and requests for http://www.example.com/video to another backend service, which contains instances configured to deliver video files.
>> Route requests for static content to a Cloud Storage bucket.
- Supports session affinity - sends all request from the same client to same virtual machine instance as long as the instance stays healthy and has capacity.
>> (two type: client IP affinity, cookie affinity)
- The health of each backend instance is verified using an HTTP health check

GKE: Advantages
- Componentization - microservices
- Portability
- Rapid deployment
- Orchestration - Kubernetes clusters
- Image registration - pull images from container registry
- Flexibility - mix-and-match with other cloud providers, on-premise

GKE: Storage options
- Storage options as with Compute Engine
- However, remember that container disks are ephemeral.
- Need to use gcePersistentDisk abstraction for persistent disk

GKE: Load Balancing
- Network load balancing works out-of-box with Container Engine
- For HTTP load balancing, need to integrate with Compute Engine load balancing

GKE: Container Cluster
- Group of Compute Engine instances running Kubernetes.
- It consists of
-- one or more node instances, and
-- a managed Kubernetes master endpoint.

GKE: Container Cluster: Node Instances
- Managed from the master
- Run the services necessary to support Docker containers
- Each node runs the docker runtime and hosts a Kubelet agent, which manages the Docker containers scheduled on the host

GKE: Container Cluster: Master Endpoint
Managed master also runs the Kubernetes API server, which
- services REST requests
- schedules pod creation and deletion on worker nodes
- synchronizes pod information (such as open ports and location)

GKE: Container Cluster: Node Pool
- Subset of machines within a cluster that all have the same configuration.
- Useful for customizing instance profiles in your cluster
- You can also:
>> run multiple Kubernetes node versions on each node pool in your cluster
>> update each node pool independently
>> target different node pools for specific deployments.

GKE: Container Builder
Tool that executes your container image builds on Google Cloud Platform's infrastructure

- Working:
>> import source code from a variety of repositories or cloud storage spaces
>> execute a build to your specifications
>> produce artifacts such as Docker containers or Java archives.

GKE: Container Registry
- Private registry for Docker images
- Can access Container Registry through secure HTTPS endpoints, which lets you push, pull, and manage images from any system, whether it's a Compute Engine instance or your own hardware
- Can use the Docker credential helper command-line tool to configure Docker to authenticate directly with Container Registry
- Can use third-party cluster management, continuous integration, or other solutions outside of Google Cloud Platform

GKE: Container Cluster: Autoscaling
- Automatic resizing of clusters with Cluster Autoscaler
- Periodically checks whether there are any pods waiting, resizes cluster if needed
- Also monitors usage of nodes and deletes nodes if all its pods can be scheduled elsewhere

CDN
Content Delivery Network

Use Case: Storage for Compute, Block Storage
- Use Persistent (hard disks), SSD.
- Same in GCP also.

Use Case: Storing media, Blob Storage
- File system - maybe HDFS.
- GCP: Cloud Storage

Use Case: SQL Interface atop file data
- Hive (SQL-like, but MapReduce on HDFS)
- GCP: BigQuery

Use Case: Document database, NoSQL
- CouchDB, MongoDB (key-value/indexed database)
- GCP: DataStore

Use Case: Fast scanning, NoSQL
- HBase (columnar database)
- GCP: BigTable

Use Case: Transaction Processing (OLTP)
- RDBMS
- GCP: Cloud SQL, Cloud Spanner

Use Case: Analytics/Data Warehouse (OLAP)
- Hive (SQL-like, but MapReduce on HDFS)
- GCP: BigQuery

Use Case: Storage for Compute, Block Storage along with mobile SDKs
- Cloud Storage for Firebase

Use Case: Fast random access with mobile SDKs
- Firebase Realtime DB

Transfer Service: Importing Data
- The transfer service helps get data into Cloud Storage from:
-- AWS, i.e. an S3 bucket
-- HTTP/HTTPS location
-- Local files
-- Another Cloud Storage Bucket

Bells & Whistles
- One-time vs recurring transfers
- Delete from destination if they don't exist in source
- Delete from source after copying over
- Periodic synchronization of source and destination based on file filters

gsutil or Transfer Service?
- gsutil can be used to get data into cloud storage buckets
- Prefer the transfer service when transferring from AWS, etc
- If copying files over from on-premise, use gsutil

Block Storage
Characteristics of block storage:
- This is the lowest level of storage without any abstraction and structure to data.
- Meant for use from VMs but independent of VM. Retains data through VM remove or reboots.
- Location tied to VM location.

Remember the options available on Compute Engine VMs:
- Standard persistent disks
- Regional SSDs
- Local SSDs

Bucket Storage Classes
- Multi-regional - frequent access from anywhere in the world
- Regional - frequent access from specific region
- Nearline - accessed once a month at max
- Coldline - accessed once a year at max

Cloud Storage: Bucket Storage Classes: Multi-regional Storage
- Frequently accessed ("hot" objects), such as serving website content, interactive workloads, or mobile and gaming applications.
- Highest availability of the storage classes
- Geo-redundant - Cloud Storage stores your data redundantly in at least two regions separated by at least 100 miles within the multi-regional location of the bucket.

Cloud Storage: Bucket Storage Classes: Regional Storage
- Appropriate for storing data that is used by Compute Engine instances.
- Better performance for data-intensive computations, as opposed to storing your data in a multi-regional location

Cloud Storage: Bucket Storage Classes: Nearline Storage
- Slightly lower availability. 99.0% availability SLA.
- 30-day minimum storage duration
- Data retrieval costs.
- Very low cost per GB stored. Higher per operation costs.

Use case:
- Data you plan to read or modify on average once a month or less
- Data backup, disaster recovery, and archival storage.

Cloud Storage: Bucket Storage Classes: Coldline Storage
- Unlike other "cold" storage services, same throughput and latency (i.e. not slower to access)
- 90-day minimum storage duration, costs for data access, and higher per-operation costs
- Infrequently accessed data, such as data stored for legal or regulatory reasons

Working with Cloud Storage
Different ways to interact with the could storage:
- XML and JSON APIs
- Command line (gsutil)
- GCP Console (web)
- Client SDK

Cloud Storage: Domain-Named Buckets
- Cloud Storage considers bucket names that contain dots to be domain names
- Must be syntactically valid DNS names
-- E.g bucket...example.com is not valid.
- End with a currently-recognized top-level domain, such as .com
- Pass domain ownership verification.

Cloud Storage: Domain Verification
Number of ways to demonstrate ownership of a site or domain, including:
-- Adding a special Meta tag to the site's homepage.
-- Uploading a special HTML file to the site.
-- Verifying ownership directly from Search Console.
-- Adding a DNS TXT or CNAME record to the domain's DNS configuration.

Cloud Storage: Access Control Options
- Identity and Access Management (IAM) permissions: Grant access to buckets as well as bulk access to a bucket's objects. IAM permissions give you broad control over your projects and buckets, but not fine-grained control over individual objects.
- Access Control Lists (ACLs): Grant read or write access to users for individual buckets or objects. In most cases, you should use IAM permissions instead of ACLs. Use ACLs only when you need fine-grained control over individual objects.
- Signed URLs (query string authentication): Give time-limited read or write access to an object through a URL you generate. Anyone with whom you share the URL can access the object for the duration of time you specify, regardless of whether or not they have a Google account.
- Signed Policy Documents: Specify what can be uploaded to a bucket. Policy documents allow greater control over size, content type, and other upload characteristics than signed URLs, and can be used by website owners to allow visitors to upload files to Cloud Storage.
- Firebase Security Rules: Provide granular, attribute-based access control to mobile and web apps using the Firebase SDKs for Cloud Storage. For example, you can specify who can upload or download objects, how large an object can be, or when an object can be downloaded.

Cloud Storage: Data Encryption Options
- Server-side encryption: encryption that occurs after Cloud Storage receives your data, but before the data is written to disk and stored.
>>> Google-managed encryption keys: Cloud Storage uses its server-side encryption keys to encrypt your data. This is the default for Cloud Storage encryption.
>>> Customer-supplied encryption keys: You can create and manage your own encryption keys for server-side encryption, which replace the Google-managed encryption keys.
>>> Customer-managed encryption keys: You can generate and manage your encryption keys using Cloud Key Management Service. These replace the Google-managed encryption keys.
- Client-side encryption: encryption that occurs before data is sent to Cloud Storage. Such data arrives at Cloud Storage already encrypted but also undergoes server-side encryption.

Cloud Storage: Consistency
Strongly consistent operations:
- Read-after-write
- Read-after-metadata-update
- Read-after-delete
- Bucket listing
- Object listing
- Granting access to resources

Eventually consistent operations:
- Revoking access from resources

Cache control and consistency
Cached objects that are publicly readable might not exhibit strong consistency. If you allow an object to be cached, and the object is in the cache when it is updated or deleted, the cached object is not updated or deleted until its cache lifetime expires. The cache lifetime of an object is defined by the Cache-Control metadata associated with the object.

Cloud Storage: Auto-Scaling
Cloud Storage is a multi-tenant service, meaning that users share the same set of underlying resources. In order to make the best use of these shared resources, buckets have an initial IO capacity of around 1000 write requests per second and 5000 read requests per second, which average to 2.5PB written and 13PB read in a month for 1MB objects. As the request rate for a given bucket grows, Cloud Storage automatically increases the IO capacity for that bucket by distributing the request load across multiple servers.

Cloud Storage: Load redistribution time
As a bucket approaches its IO capacity limit, Cloud Storage typically takes on the order of minutes to detect and accordingly redistribute the load across more servers. Consequently, if the request rate on your bucket increases faster than Cloud Storage can perform this redistribution, you may run into temporary limits, specifically higher latency and error rates. Ramping up the request rate gradually for your buckets avoids such latency and errors.

Cloud Storage: Object key indexing
Cloud Storage supports consistent object listing, which enables users to run data processing workflows easily against Cloud Storage. In order to provide consistent object listing, Cloud Storage maintains an index of object keys for each bucket. This index is stored in lexicographical order and is updated whenever objects are written to or deleted from a bucket. Adding and deleting objects whose keys all exist in a small range of the index naturally increases the chances of contention.

Cloud Storage detects such contention and automatically redistributes the load on the affected index range across multiple servers. Similar to scaling a bucket's IO capacity, when accessing a new range of the index, such as when writing objects under a new prefix, you should ramp up the request rate gradually. Not doing so may result in temporarily higher latency and error rates.

Cloud Storage: Best Practices
Ramp up request rate gradually
To ensure that Cloud Storage auto-scaling always provides the best performance, you should ramp up your request rate gradually for any bucket that hasn't had a high request rate in several weeks or that has a new range of object keys. If your request rate is less than 1000 write requests per second or 5000 read requests per second, then no ramp-up is needed. If your request rate is expected to go over these thresholds, you should start with a request rate below or near the thresholds and then double the request rate no faster than every 20 minutes.

Use a naming convention that distributes load evenly across key ranges
Auto-scaling of an index range can be slowed when using sequential names, such as object keys based on a sequence of numbers or timestamp. This occurs because requests are constantly shifting to a new index range, making redistributing the load harder and less effective.

In order to maintain a high request rate, avoid using sequential names. Using completely random object names will give you the best load distribution. If you want to use sequential numbers or timestamps as part of your object names, introduce randomness to the object names by adding a hash value before the sequence number or timestamp.

Reorder bulk operations to distribute load evenly across key ranges
Even if you are not able to choose the object names, you can control the order in which the objects are uploaded or deleted to achieve the same effect as using random names.

If you have many folders and many files under each folder to upload, a good strategy is to upload from multiple folders in parallel and randomly choose which folders and files are uploaded. Doing so allows the system to distribute the load more evenly across entire key ranges, which allows you to achieve a high request rate after the initial ramp-up.

Cloud SQL, Cloud Spanner
- Relational databases - super-structured data, constraints etc
- ACID properties - use for transaction processing (OLTP)
- Too slow and too many checks for analytics/BI/warehousing (OLAP)
- Recall that OLTP needs strict write consistency, OLAP does not

- Cloud Spanner is Google proprietary, more advanced than Cloud SQL
- Cloud Spanner offers "horizontal scaling" - i.e. bigger data, more instances, replication etc.

Relational Data
- Data organized in a structured schemas, with primary key, etc.
- Data is split into different tables and linked (normalized).
- Can't handle missing values well (as opposed to columnar database).

Cloud SQL: Concepts
- Coud SQL is a fully-managed MySQL and PostgreSQL database service.
- Fully Managed Relational Databases.
- SLA -99.95% availability.
- Cloud SQL - for up to 10TB of storage capacity, 40,000 IOPS, and 416GB of RAM per instance. Anything beyond - Use Spanner.
- Cloud SQL is SSAE 16, ISO 27001, PCI DSS v3.0, and HIPAA compliant.
- Import and export databases using mysqldump, or import and export CSV files.
- Data replication between multiple zones with automatic fail-over.
- Automated and on-demand backups, and point-in-time recovery.

Cloud SQL: Automatic Storage Increase
- Available storage is checked every 30 seconds. If available falls below a threshold size (calculated using a currently provisioned size), additional storage capacity is automatically added to your instance.
- Storage size can be increased, but it can't be decreased.

Cloud SQL: Backups and Binary logging
- Determine whether automated backups are performed and binary logging is enabled or not.
- Required for the creation of replicas and clones, and for point-in-time recovery.

Cloud SQL: Supported open source SQLs
MySQL - fast and the usual.
PostgreSQL - complex queries.

CloudSQL: Instances
- Instances need to be created explicitly
-- Not serverless and needs database instance.
-- Specify region while creating instance
- First vs. second generation instances
-- Second generation instances allow proxy support - no need to white list IP addresses or configure SSL
-- Higher availability configuration
-- Maintenance won't take down the server

CloudSQL: High Availability Configuration
- A Second Generation instance is in an high availability configuration when it has a failover replica
- The failover replica must be in a different zone than the original instance, also called the master.
- All changes made to the data on the master, including to user tables, are replicated to the failover replica using semisynchronous replication.

CloudSQL: External Read Replicas
- External read replicas are external MySQL instances that are replicating from Cloud SQL master.
- Example: A MySQL instance running on Compute Engine is considered an external instance.
- Replicating to a MySQL instance hosted by another cloud platform or on -premise is not possible.

CloudSQL: On-demand and Automatic Backups
- Cloud SQL retains up to 7 automated backups for each instance. They are incremental.
- Automatic back-ups are automatically deleted when master is deleted.
- On-demand backups are not automatically deleted.
- Backup data is stored in two regions for redundancy.

CloudSQL: Point-in-time-recovery (PITR)
- A point-in-time recovery always creates a new instance; you cannot perform a point-in-time recovery to an existing instance.
- The target instance should have the same database version as the instance from which the backup was taken.
- You cannot restore an instance using a backup taken in different GCP project.
- If you are restoring to an instance that is in a high availability configuration (it has a fail-over replica) or to an instance with read replicas, you must delete all replicas and recreate them after the restore operation completes.

CloudSQL: Operate SQL DB
- gcloud sql connect jj123 --user root
- give password
- CREATE DATABASE jj123db;
- USE jj123db;
- CREATE TABLE students (studentName VARCHAR(255), idStudent INT NOT NULL AUTO_INCREMENT, PRIMARY KEY(idStudent));
- SELECT * FROM students

CloudSQL: Cloud Proxy
- Provides secure access to your Cloud SQL Second Generation instances without having to whitelist IP addresses or configure SSL.
- Secure connections: The proxy automatically encrypts traffic to and from the database; SSL certificates are used to verify client and server identities.
- Easier connection management: The proxy handles authentication with Google Cloud SQL, removing the need to provide static IP addresses.

CloudSQL: Cloud Proxy: Operation
- The Cloud SQL Proxy works by having a local client, called the proxy, running in the local environment
- Your application communicates with the proxy with the standard database protocol used by your database.
- The proxy uses a secure tunnel to communicate with its companion process running on the server.

When you start the proxy, need to tell it:
- What Cloud SQL instances it should establish connections to
- Where it will listen for data coming from your application to be sent to Cloud SQL
- Where it will find the credentials it will use to authenticate your application to Cloud SQL

You can install the proxy anywhere in your local environment. The location of the proxy binaries does not impact where it listens for data from your application.

Cloud Spanner
Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale, schemas, SQL (ANSI 2011 with extensions), and automatic, synchronous replication for high availability.

Cloud Spanner offers:
- Strong consistency, including strongly consistent secondary indexes.
- SQL support, with ALTER statements for schema changes.
- Managed instances with high availability through transparent, synchronous, built-in data replication.

Cloud Spanner offers regional and multi-region instance configurations.

Use when you need high availability, strong consistency, transactional reads and writes (especially writes!).

Don't use if
- Data is not relational, or not even structured
- Want an open source RDBMS
- Strong consistency and availability is overkill

Cloud Spanner: Replication
Cloud Spanner automatically gets replication at the byte level from the underlying distributed filesystem that it's built on. Cloud Spanner writes database mutations to files in this filesystem, and the filesystem takes care of replicating and recovering the files when a machine or disk fails.

Cloud Spanner also replicates data to provide the additional benefits of data availability and geographic locality.

- Cloud Spanner creates multiple copies, or "replicas," of these rows, then stores these replicas in different geographic areas.
- Cloud Spanner uses a synchronous, Paxos-based replication scheme, in which voting replicas take a vote on every write request before the write is committed.
- This property of globally synchronous replication gives you the ability to read the most up-to-date data from any Cloud Spanner read-write or read-only replica.

Cloud Spanner: Benefits of replication
Benefits of Cloud Spanner replication include:
- Data availability: Having more copies of your data makes the data more available to clients that want to read it. Also, Cloud Spanner can still serve writes even if some of the replicas are unavailable, because only a majority of voting replicas are required in order to commit a write.
- Geographic locality: Having the ability to place data across different regions and continents with Cloud Spanner means data can be geographically closer, to the users and services that need it.
- Single database experience: Because of the synchronous replication and global strong consistency, at any scale Cloud Spanner behaves the same, delivering a single database experience.
- Easier application development: Cloud Spanner's ACID transactions with global strong consistency means developers don't have to add extra logic in the applications to deal with eventual consistency, making application development and subsequent maintenance faster and easier.

Cloud Spanner: Types of replicas
Read-write: Read-write replicas support both reads and writes.
- Maintain a full copy of your data.
- Serve reads.
- Can vote whether to commit a write.
- Participate in leadership election.
- Are eligible to become a leader.
- Are the only type used in single-region instances.

Read-only: Read-only replicas only support reads (not writes).
- Are only used in multi-region instances.
- Maintain a full copy of your data, which is replicated from read-write replicas.
- Serve reads.
- Do not participate in voting to commit writes.
- Can usually serve stale reads without needing a round-trip to the default leader region. Strong reads may require a round-trip to the leader replica.
- Are not eligible to become a leader.

Witness: Witness replicas don't support reads but do participate in voting to commit writes. These replicas:
- Are only used in multi-region instances.
- Do not maintain a full copy of data.
- Do not serve reads.
- Vote whether to commit writes.
- Participate in leader election but are not eligible to become leader.

Cloud Spanner: Instances
Instance configuration:
An instance configuration defines the geographic placement and replication of the databases in that instance. When you create an instance, you must configure it as either regional or multi-region. You make this choice by selecting an instance configuration, which determines where your data is stored for that instance.

Node count:
Your choice of node count determines the amount of serving and storage resources that are available to the databases in that instance. Each node provides up to 2 TiB of storage. The peak read and write throughput values that nodes can provide depend on the instance configuration, as well as on schema design and data-set characteristics.

Cloud Spanner: Data Model
- A Cloud Spanner database can contain one or more tables.
- Tables look like relational database tables in that they are structured with rows, columns, and values, and they contain primary keys.
- Data in Cloud Spanner is strongly typed: you must define a schema for each database and that schema must specify the data types of each column of each table. Allowable data types include scalar and array types.
- You can also define one or more secondary indexes on a table (parent - child relationship).

Cloud Spanner: Primary keys
Every table must have a primary key, and that primary key can be composed of zero or more columns of that table. If you declare a table to be a child of another table, the primary key column(s) of the parent table must be the prefix of the primary key of the child table. This means if a parent table's primary key is composed of N columns, the primary key of each of its child tables must also be composed of those same N columns, in the same order and starting with the same column.

Cloud Spanner: Parent - Child
- Parent-child relationships between tables
- These cause physical location for fast access
- If you query Students and Grades together, make Grades child of Students
-- Data locality will be enforced between 2 independent tables!
- Every table must have primary keys
- To declare table is child of another, prefix parent's primary key onto primary key of child.

(This storage model resembles HBase)

Cloud Spanner: Interleaving
Cloud Spanner stores rows in sorted order by primary key values, with child rows inserted between parent rows that share the same primary key prefix. This insertion of child rows between parent rows along the primary key dimension is called interleaving, and child tables are also called interleaved tables.

This enables fast access like HBase.

Cloud Spanner: Choosing a primary key
The primary key uniquely identifies each row in a table. If you want to update or delete existing rows in a table, then the table must have a primary key composed of one or more columns. Often your application already has a field that's a natural fit for use as the primary key.

There are techniques that can spread the load across multiple servers and avoid hotspots:
- Hash the key and store it in a column. Use the hash column as the primary key.
- Swap the order of the columns in the primary key.
- Use a Universally Unique Identifier (UUID). Version 4 UUID is recommended, because it uses random values in the high-order bits. Don't use a UUID algorithm that stores the timestamp in the high order bits.
- Bit-reverse sequential values.

Cloud Spanner: Hotspotting
- As in HBase - need to choose Primary key carefully
- Do not use monotonically increasing values, else writes will be on same locations - hot spotting
- Use hash of key value if you naturally monotonically ordered keys
- Under the hood, Cloud Scanner divides data among servers across key ranges

Cloud Spanner: Splits
Cloud Spanner let's you define hierarchies of parent-child relationships between tables up to seven layers deep, which means you can co-locate rows of seven logically independent tables.

As your database grows, Cloud Spanner divides your data into chunks called "splits", where individual splits can move independently from each other and get assigned to different servers, which can be in different physical locations.

A split is defined as a range of rows in a top-level table, where the rows are ordered by primary key. The start and end keys of this range are called "split boundaries". Cloud Spanner automatically adds and removes split boundaries, which changes the number of splits in the database.

Cloud Spanner splits data based on load: it adds split boundaries automatically when it detects high read or write load spread among many keys in a split.

The parent-child table relationships that you define, along with the primary key values that you set for rows of related tables, give you control over how data is split under the hood.

Cloud Spanner: Interleaved Table Example
-- Schema hierarchy:
-- + Singers
-- + Albums (interleaved table, child table of Singers)
-- + Songs (interleaved table, child table of Albums)

CREATE TABLE Singers (
SingerId INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);

CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
AlbumTitle STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;

CREATE TABLE Songs (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
TrackId INT64 NOT NULL,
SongName STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId, TrackId),
INTERLEAVE IN PARENT Albums ON DELETE CASCADE;

Cloud Spanner: Secondary Indices
A secondary index is helpful for quickly looking up data when searching by one or more non-key columns.

- Like in HBase, key-based storage ensures fast sequential scan of keys
>>> Remember that tables must have primary keys
- Unlike in HBase, can also add secondary indices
>>> Might cause same data to be stored twice
- Fine-grained control on use of indices
>>> Force query to use a specific index (index directives)
>>> Force column to be copied into a secondary index (STORING clause)

Example:
CREATE INDEX AlbumsByAlbumTitle ON Albums(AlbumTitle);

SELECT AlbumId, AlbumTitle, MarketingBudget
FROM Albums@{FORCE_INDEX=AlbumsByAlbumTitle}
WHERE SingerId = 1 AND AlbumTitle >= 'Aardvark' AND AlbumTitle < 'Goo'

Cloud Spanner: Data Types
- Remember that tables are strongly-typed (schemas must have types)
- Non-normalized types such as ARRAY and STRUCT available too
- STRUCTs are not OK in tables, but can be returned by queries (eg if query returns ARRAY of ARRAYs)
- ARRAYs are OK in tables, but ARRAYs of ARRAYs are not

Cloud Spanner: Transactions
- Supports serialisability
- Cloud Spanner transaction support is super-strong, even stronger than traditional ACID
-- Transactions commit in an order that is reflected in their commit timestamps
-- These commit timestamps are "real time" so you can compare them to your watch
- Two transaction modes
-- Locking read-write (slow)
-- Read-only (fast)
- If making a one-off read, use something known as a "Single Read Call"
-- Fastest, no transaction checks needed!

Cloud Spanner: Staleness
- Can set timestamp bounds
- Strong - "read latest data"
- Bounded Staleness - "read version no later than ..."
- Exact Staleness - "read at exactly ..."
-- (could be in past or future)
- Cloud Scanner has a version-gc that reclaims versions older than 1 hour old

Cloud Spanner: Efficient Bulk Loading
The common theme for optimal bulk loading performance is to minimize the number of machines that are involved in each write, because aggregate write throughput is maximized when fewer machines are involved.

Cloud Spanner uses load-based splitting to evenly distribute your data load across nodes: after a few minutes of high load, Cloud Spanner introduces split boundaries between rows of non-interleaved tables and assigns each split to a different server.

- Partition your data by primary key: A good rule of thumb for your number of partitions is 10 times the number of nodes in your Cloud Spanner instance. So if you have N nodes, with a total of 10*N partitions, you can assign rows to partitions by:
>>> Sorting your data by primary key.
>>> Dividing it into 10*N separate sections.
>>> Creating a set of worker tasks that upload the data.
- Commit between 1 MiB to 5 MiB mutations at a time
- Upload data before creating secondary indexes
- Periodic bulk uploads to an existing database

Inefficient Practices:
- Don't write rows one at a time
- Don't package N random rows into a commit with N mutations
- Don't sequentially add all rows in primary key order

Cloud Spanner: SQL Best Practices
- Use query parameters to speed up frequently executed queries
- Understand how Cloud Spanner executes queries
- Use secondary indexes to speed up common queries
- Write efficient queries for range key lookup
- Write efficient queries for joins
- Avoid large reads inside read-write transactions
- Use ORDER BY to ensure the ordering of your SQL results

BigQuery
- Latency bit higher than BigTable, DataStore - prefer those for low latency
- No ACID properties - can't use for transaction processing
- Great for analytics/business intelligence/data warehouse
- Superficially similar in use-case to Hive

DataStore
- Document data - eg XML or HTML - has a characteristic pattern
- Key-value structure, i.e. structured data
- Typically not used either for OLTP or OLAP
- Fast lookup on keys is the most common use-case
- Speciality of DataStore is that query execution time depends on size of returned result (not size of data set)
- So, a returning 10 rows will take the same length of time whether dataset is 10 rows, or 10 billion rows
- Ideal for "needle-in-a-haystack" type applications, i.e. lookups of nonsequential keys
- Indices are always fast to read, slow to write
- So, don't use for write-intensive data

Major Blocks of Hadoop
- HDFS
- MapReduce
- YARN - Yet another resourse negotiator.

Hadoop Blocks: Co-ordination
- User defines map and reduce tasks using the MapReduce API
- A job is triggered on the cluster
- YARN figures out where and how to run the job, and stores the result in HDFS

Hadoop Ecosystem
- Hive
- HBase
- Pig
- Kafka
- Spark
- Oozie

Hadoop Ecosystem: Hive
- Provides an SQL interface to Hadoop
- The bridge to Hadoop for folks who don't have exposure to OOP in Java

Hadoop Ecosystem: HBase
- A database management system on top of Hadoop
- Integrates with your application just like a traditional database

Hadoop Ecosystem: Pig
- A data manipulation language
- Transforms unstructured data into a structured format
- Query this structured data using interfaces like Hive

Hadoop Ecosystem: Kafka
Stream processing for unbounded datasets

Hadoop Ecosystem: Spark
- A distributed computing engine used along with Hadoop
- Interactive shell to quickly process datasets
- Has a bunch of built in libraries for machine learning, stream processing, graph processing etc.

Hadoop Ecosystem: Oozie
A tool to schedule workflows on all the Hadoop ecosystem technologies

HDFS
- Built on commodity hardware
- Highly fault tolerant, hardware failure is the norm
- Suited to batch processing - data access has high throughput rather than low latency
- Supports very large data sets

- Manage file storage across multiple disks
- A cluster of machines. Each machine is a node in the cluster.
- Each disk on a different machine in a cluster.
- One node is the name node and others are data nodes.

HDFS: Name Node
Manages the overall file system

Stores
- The directory structure
- Metadata of the files

HDFS: Data nodes
Physically stores the data in the files

HDFS: Block Size
- Size is 128MB.
- Block size is a trade off
-- Reduces parallelism
-- Increases overhead
- This size helps minimize the time taken to seek to the block on the disk

HDFS: Reading a File
- Use metadata in the name node to look up block locations
- Read the blocks from respective locations

HDFS: Replication
- Replicate blocks based on the replication factor
- Store replicas in different locations
- The replica locations are also stored in the name node.

HDFS: Choosing Replica Locations
Maximize redundancy:
- Store replicas "far away" i.e. on different nodes

Minimize write bandwidth:
- This requires that replicas be stored close to each other

Default Hadoop Replication Strategy
- First location chosen at random
- Second location has to be on a different rack (if possible)
- Third replica is on the same rack as the second but on different nodes
- Reduces inter-rack traffic and improves write performance
- Read operations are sent to the rack closest to the client

MapReduce
- Processing huge amounts of data requires running processes on many machines.
- MapReduce is a programming paradigm
- Takes advantage of the inherent parallelism in data processing
- A task of large scale is processed in two stages
-- map
-- reduce
- Programmer defines these 2 functions. Hadoop does the rest - behind the scenes

MapReduce: Map
An operation performed in parallel, on small portions of the dataset

MapReduce: Reduce
An operation to combine the results of the map step

YARN
- Co-ordinates tasks running on the cluster
- Assigns new nodes in case of failure

Two major components:
- Resource Manager
- Node Manager.

YARN: Resource Manager
- Runs on a single master node
- Schedules tasks across nodes

YARN: Node Manager
- Run on all other nodes
- Manages tasks on the individual node

YARN: Application Master Process
- All processes on a node are run within a container.
- This is the logical unit for resources the process needs -memory, CPU etc.
- A container executes a specific application
- 1 NodeManager can have multiple containers.
- The ResourceManager starts off the Application Master within the Container
- Performs the computation required for the task
- If additional resources are required, the Application Master makes the request

YARN: Scheduling Policies
- FIFO Scheduler
- Capacity Scheduler
- Fair Scheduler

Hive
- Hive runs on top of the Hadoop distributed computing framework
- Hive stores its data in HDFS
- Hive runs all processes in the form of MapReduce jobs under the hood
- Don't need to write MapReduce code to work with Hive?

HiveQL
Hive Query Language: A SQL-like interface to the underlying data

- Modeled on the Structured Query Language (SQL)
- Familiar to analysts and engineers
- Simple query constructs
-- select
-- group by
-- join

- Hive exposes files in HDFS in the form of tables to the user
- Write SQL-like query in HiveQL and submit it to Hive
- Hive will translate the query to MapReduce tasks and run them on Hadoop
- MapReduce will process files on HDFS and return results to Hive

Hive Metastore
Hive Metastore: The bridge between data stored in files and the tables exposed to users

- Stores metadata for all the tables in Hive
- Maps the files and directories in Hive to tables
- Holds table definitions and the schema for each table
- Has information on converting files to table representations

Hive Metastore: Requirements
- Any database with a JDBC driver can be used as a metastore.
- Development environments use the built-in Derby database: Embedded metastore
- Same Java process as Hive itself
- One Hive session to connect to the database

Production environments:
- Local metastore: Allows multiple sessions to connect to Hive
- Remote metastore: Separate processes for Hive and the metastore

Hive vs. RDBMS
Hive:
Large datasets
- Gigabytes or petabytes
- Calculating trends
Parallel computations
- Distributed system with multiple machines
- Semi-structured data files partitioned across machines
- Disk space cheap, can add space by adding machines
High latency
- Records not indexed, cannot be accessed quickly
- Fetching a row will run a MapReduce that might take minutes
Read operations
- Not the owner of data
- Schema-on-read
Not ACID compliant by default
- Data can be dumped into Hive tables from any source
HiveQL

RDBMS:
Small datasets
- Megabytes or gigabytes
- Accessing and updating individual records
Serial computations
- Single computer with backup
- Structured data in tables on one machine
- Disk space expensive on a single machine
Low latency
- Records indexed, can be accessed and updated fast
- Queries can be answered in milliseconds or microseconds
Read/write operations
- Sole gatekeeper for data
- Schema-on-write
ACID compliant
- Only data which satisfies constraints are stored in the database
SQL

Hive Data Ownership
- Hive stores files in HDFS
- Hive files can be read and written by many technologies - Hadoop, Pig, Spark
- Hive database schema cannot be enforced on these files

Schema-on-read
- Number of columns, column types, constraints specified at table creation
- Hive tries to impose this schema when data is read
- It may not succeed, may pad data with nulls

Hive vs RDBMS (cont)
Hive:
- Schema on read, no constraints enforced
- Minimal index support
- Row level updates, deletes as a special case
- Many more built-in functions
- Only equi-joins allowed
- Restricted subqueries

RDBMS:
- Schema on write keys, not null, unique all enforced
- Indexes allowed
- Row level operations allowed in general
- Basic built-in functions
- No restriction on joins
- Whole range of subqueries

OLAP
Online Analytical Processing

OLTP
Online Transactional Processing

OLAP: Splitting
- Two ways to narrow down the data to be examined or processed.
-- Partitioning
-- Bucketing
- Splits data into smaller, manageable parts
- Enables performance optimizations

OLAP: Partitioning
- Data may be naturally split into logical units. Eg. customers in US based on the state.
- Each of these units will be stored in a different directory
- State specific queries will run only on data in one directory
- Splits may not of the same size

OLAP: Bucketing
- Size of each split should be the same
- Based on hash of a column value - address, name, timestamp anything
- Each bucket is a separate file
- Makes sampling and joining data more efficient

OLAP: Queries on Big Data
- Partitioning and Bucketing of Tables
- Join Optimizations
- Window Functions

OLAP: Join Optimizations
- Join operations are MapReduce jobs under the hood.
- Optimize joins by reducing the amount of data held in memory.
- Or by structuring joins as a map-only operation

OLAP: Window Functions
- A suite of functions which are syntactic sugar for complex queries
- Make complex operations simple without needing many intermediate calculations
- Reduces the need for intermediate tables to store temporary data

Apache Pig
A high level scripting language to work with data with unknown or inconsistent schema.

- Part of the Hadoop eco-system
- Works well with unstructured, incomplete data
- Can work directly on files in HDFS

Used to get data into a data warehouse

Pig: Extract, Transform, Load
Pull unstructured, inconsistent data from source, clean it and place it in another database where it can be analyzed

Pig Latin
A procedural, data flow language to extract, transform and load data.

Procedural:
- Series of well-defined steps to perform operations
- No if statements or for loops
Data Flow:
- Focused on transformations applied to the data
- Written with a series of data operations in mind

- Data from one or more sources can be read, processed and stored in parallel.
- Cleans data, precomputes common aggregates before storing in a data warehouse

Pig vs. SQL
Pig:
- A data flow language, transforms data to store in a warehouse.
- Specifies exactly how data is to be modified at every step.
- Purpose of processing is to store in a queryable format.
- Used to clean data with inconsistent or incomplete schema.

SQL:
- A query language, is used for retrieving results
- Abstracts away how queries are executed
- Purpose of data extraction is analysis
- Extract insights, generate reports, drive decisions

Pig on Hadoop
- Pig runs on top of the Hadoop distributed computing framework.
- Reads files from HDFS, stores intermediate records in HDFS and writes its final output to HDFS.
- Decomposes operations into MapReduce jobs which run in parallel.
- Provides non-trivial, built-in implementations of standard data operations, which are very efficient.
- Pig optimizes operations before MapReduce jobs are run, to speed operations up

Pig on Other Technologies
Apache Tez:
- Tez is an extensible framework which improves on MapReduce by making its operations faster.

Apache Spark:
- Spark is another distributed computing technology which is scalable, flexible and fast.

Pig vs. Hive
Pig:
- Used to extract, transform and load data into a data warehouse
- Used by developers to bring together useful data in one place
- Uses Pig Latin, a procedural, data flow language

Hive:
- Used to query data from a data warehouse to generate reports
- Used by analysts to retrieve business information from data
- Uses HiveQL, a structured query language

Spark
An engine for data processing and analysis.

- General Purpose
-- Exploring
-- Cleaning and Preparing
-- Applying Machine Learning
-- Building Data Applications

- Interactive
-- REPL: Read-Evaluate-Print-Loop
-- Interactive environments, fast feedback

- Distributed Computing
-- Process data across a cluster of machines
-- Integrate with Hadoop
-- Read data from HDFS

Spark APIs
Scala
Python
Java

Spark: Resilient Distributed Datasets
- RDDs are the main programming abstraction in Spark
- RDDs are in-memory collections of objects
- With RDDs, you can interact and play with billions of rows of data

Spark Core
Spark Core is just a computing engine. It needs two additional components.
- A Storage System that stores the data to be processed
-- Local file system
-- HDFS
- A Cluster Manager to help Spark run tasks across a cluster of machines
-- Built-in Cluster Manager
-- YARN

Both of these are plug and play components.

PySpark
- This is just like a Python shell
- Use Python functions, dicts, lists etc
- You can import and use any installed Python modules
- Launches by default in a local non-distributed mode

SparkContext
- When the shell is launched it initializes a SparkContext.
- The SparkContext represents a connection to the Spark Cluster
- Used to load data into memory from a specified source
- The data gets loaded into an RDD

Resilient Distributed Datasets
Partitions
- Data is divided into partitions
- Distributed to multiple machines, called nodes
- Nodes process data in parallel

Read-only
- RDDs are immutable
- Only Two Types of Operations
-- Transformation: The user may define a chain of transformations on the dataset
-- Action: Request a result using an action

Lineage
- When created, an RDD just holds metadata
-- A transformation
-- It's parent RDD
- Every RDD knows where it came from
- Lineage can be traced back all the way to the source

Spark: Lazy Evaluation
- Spark keeps a record of the series of transformations requested by the user.
- It groups the transformations in an efficient way when an Action is requested.

Bounded vs Unbounded Datasets
- Bounded datasets are processed in batches
- Unbounded datasets are processed as streams

Batch vs. Stream Processing
Batch:
- Bounded, finite datasets
- Slow pipeline from data ingestion to analysis
- Periodic updates as jobs complete
- Order of data received unimportant
- Single global state of the world at any point in time

Stream:
- Unbounded, infinite datasets
- Processing immediate, as data is received
- Continuous updates as jobs run constantly
- Order important, out of order arrival tracked
- No global state, only history of events received

Stream Processing
Data is received as a stream
- Log messages
- Tweets
- Climate sensor data

Process the data one entity at a time
- Filter error messages
- Find references to the latest movies
- Track weather patterns

Store, display, act on filtered messages
- Trigger an alert
- Show trending graphs
- Warn of sudden squalls

Stream-first Architecture
Data can come from multiple sources.
- Files
- Databases
- Streams

Major components:
- Message transport
- Stream processing

Stream-first Arch: Message Transport
- Buffer for event data
- Performant and persistent
- Decoupling multiple sources from processing
- Popular products: Kafka, MapR streams

Stream-first Arch: Stream Processing
- High throughput, low latency
- Fault tolerance with low overhead
- Manage out of order events
- Easy to use, maintainable
- Replay streams
- Examples: Streaming Spark, Storm, Flink

Streams Using Micro-batches
- A stream of integers grouped into batches.
- If the batches are small enough... it approximates real-time stream processing.
- Exactly once semantics, replay micro-batches. Each item is processed exactly once.
- Latency-throughput trade-off based on batch sizes. Depending of the application, achieve the sweet spot.
- Examples: Spark Streaming, Storm Trident.

Streaming: Micro batches: Types of Windows
- Tumbling Window
- Sliding Window
- Session Window

Streaming: Micro batches: Tumbling Window
- Fixed window size.
- Non-overlapping time
- Number of entities differ within a window
- The window tumbles over the data, in a nonoverlapping manner

Streaming: Micro batches: Sliding Window
- Fixed window size
- Overlapping time - sliding interval
- Number of entities differ within a window

Streaming: Micro batches: Session Window
- Changing window size based on session data
- No overlapping time
- Number of entities differ within a window
- Session gap determines window size

Apache Flink
Apache Flink is an open source, distributed system built using the stream-first architecture.

The stream is the source of truth

Streaming execution model: Processing is continuous, one event at a time.
Everything is a stream: Batch processing with bounded datasets are a special case of the unbounded dataset.
Same engine for all: Streaming and batch APIs both use the same underlying architecture

Stream Processing with Flink
- Handles out of order or late arriving data.
- Exactly once processing for stateful computations.
- Flexible windowing based on time, sessions, count, etc.
- Lightweight fault tolerance and checkpointing.
- Distributed, runs in large scale clusters.

Apache Flink Stack
Layer 3:
- CEP Event Processing
- Table Relational
- Gelly Graph
- Flink ML Machine Learning

Layer 2:
- DataStream API Stream Processing
- DataSet API Batch Processing

Layer 1:
- Runtime

Layer 0:
- Local Single JVM
- Cluster Standalone, YARN
- Cloud GCE, EC2

CRUD
Create
Read
Update
Delete

BigTable
- Fast scanning of sequential key values - use BigTable
- Columnar database, good for sparse data
- Sensitive to hot spotting - need to design key structure carefully
- Similar to HBase

BigTable and HBase
- BigTable is basically GCP's managed HBase
- Usual advantages of GCP -
-- scalability
-- low ops/admin burden
-- cluster resizing without downtime
-- many more column families before performance drops (~100 OK)

Properties of HBase
Columnar store
Denormalized storage
Only CRUD operations
ACID at the row level

Columnar Store
Different columns values of each RowId are stored as cotinuous rows.
Good for:
- Sparse tables: No wastage of space when storing sparse data
- Dynamic attributes: Update attributes dynamically without changing storage structure
Empty cells are ok in traditional databases, but not in big data whose size is in the range of peta bytes.

Denormalized storage
Traditional databases use normalized forms of database design to minimize redundancy.
In normalized form, data is made more granular by splitting it across multiple tables. This will optimize the memory.
In distributed systems, network bandwidth is more costly than storage. Storage is cheap.
Need to optimize number of disk seeks.
Read a single record to get all details about an employee in one read operation.

CRUD operations
Traditional databases and SQL support:
- Joins: Combining information across tables using keys
- Group By: Grouping and aggregating data for the groups
- Order By: Sorting rows by a certain column

HBase does not support SQL (NoSQL).
Only a limited set of operations are allowed in HBase (CRUD).
- No operations involving multiple tables
- No indexes on tables
- No constraints

ACID at the row level
- Updates to a single row are atomic.
- All columns in a row are updated or none are updated.

Traditional RDBMS vs. HBase
Traditional RDBMS:
- Data arranged in rows and columns
- Supports SQL
- Complex queries such as grouping, aggregates, joins etc
- Normalized storage to minimize redundancy and optimize space
- ACID compliant

HBase:
- Data arranged in a column-wise manner
- NoSQL database
- Only basic operations such as create, read, update and delete
- Denormalized storage to minimize disk seeks
- ACID compliant at the row level

HBase: 4-dimensional Data Model
#
- Row Key
- Column Family
- Column
- Timestamp

Example:
Row Key = Employee ID
Column Family = Work, Personal
Column = Dept Grade Title (Work), Name SSN (Personal)

HBase: Row Key
- Uniquely identifies a row
- Can be primitives, structures, arrays
- Represented internally as a byte array
- Sorted in ascending order

HBase: Column Family
- All rows have the same set of column families
- Each column family is stored in a separate data file
- Set up at schema definition time
- Can have different columns for each row

Hbase: Column
- Columns are units within a column family
- New columns can be added on the fly
- ColumnFamily:ColumnName = Work:Department

HBase: Timestamp
- Used as the version number for the values stored in a column
- The value for any version can be accessed

SQL vs. HBase Shell Commands
SQL:

1. select * from census

2. select name from census

3. select * from census limit 1

4. select * from census where rowkey = 1

HBase Shell:

1. scan 'census'

2. scan 'census', {COLUMNS => ['personal:name']}

3. scan 'census', {LIMIT => 1}

4. get 'census', 1

HBase: Filters
Filters allow you to control what data is returned from a scan operation.

Built-in Filters:
- Conditions on row keys
- Conditions on columns
- Multiple conditions on columns
- Timestamp range

BigTable: Overview
- a sparsely populated table that can scale to billions of rows and thousands of columns, enabling you to store terabytes or even petabytes of data.
- A single value in each row is indexed; this value is known as the row key.
- ideal for storing very large amounts of single-keyed data with very low latency.
- It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations.

exposed to applications through multiple client libraries, including a supported extension to the Apache HBase library for Java. As a result, it integrates with the existing Apache ecosystem of open-source Big Data software.

Cloud Bigtable: storage model
Cloud Bigtable stores data in massively scalable tables, each of which is a sorted key/value map. The table is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row. Each row is indexed by a single row key, and columns that are related to one another are typically grouped together into a column family. Each column is identified by a combination of the column family and a column qualifier, which is a unique name within the column family.

Each row/column intersection can contain multiple cells at different timestamps, providing a record of how the stored data has been altered over time. Cloud Bigtable tables are sparse; if a cell does not contain any data, it does not take up any space.

BigTable: Avoid BigTable When
- Don't use if you need transaction support (OLTP) - use Cloud SQL or Cloud Spanner
- Don't use for data less than 1 TB (can't parallelize)
- Don't use if analytics/business intelligence/data warehousing - use BigQuery instead
- Don't use for documents or highly structured hierarchies - use DataStore instead
- Don't use for immutable blobs like movies each > 10 MB - use Cloud Storage instead

BigTable: Use BigTable When
- Use for very fast scanning and high throughput
- Use for non-structured key/value data
- Where each data item < 10 MB and total data > 1 TB
- Use where writes infrequent/unimportant (no ACID) but fast scans crucial
- Use for Time Series data

BigTable: Use for Time Series
- BigTable is a natural fit for Timestamp data (range queries)
- Say IOT sensor network emitting data at intervals
-- Use Device ID # Time as row key if common query = "All data for a device over period of time"
-- Use Time # Device ID as row key if common query = "All data for a period for all devices"

BigTable: What it's good for
- Time-series data, such as CPU and memory usage over time for multiple servers.
- Marketing data, such as purchase histories and customer preferences.
- Financial data, such as transaction histories, stock prices, and currency exchange rates.
- Internet of Things data, such as usage reports from energy meters and home appliances.
- Graph data, such as information about how users are connected to one another.

BigTable: Hotspotting and Schema Design
- Like Cloud Spanner, data stored in sorted lex order of keys
- Data is distributed based on key values
- So, performance will be really poor if
-- Reads/writes are concentrated in some ranges
-- For instance if key values are sequential
- Use hashing of key values, or non-sequential keys

BigTable: Avoiding Hotspotting
Field Promotion: Use in reverse URL order like Java package names
-- This way keys have similar prefixes, differing endings
Salting
-- Hash the key value

BigTable: Warming the Cache
- BigTable will improve performance over time
- Will observe read and write patterns and redistribute data so that shards are evenly hit
- Will try to store roughly same amount of data in different nodes
- This is why testing over hours is important to get true sense of performance

BigTable: SSD or HDD Disks
- Use SSD unless skimping on cost
- SSD can be 20x faster on individual row reads
- More predictable throughput too (no disk seek variance)
- Don't even think about HDD unless storing > 10 TB and all batch queries
- The more random access, the stronger the case for SSD

BigTable: Reasons for Poor Performance
- Poor schema design (eg sequential keys)
- Inappropriate workload
-- too small (<300 GB)
-- used in short bursts (needs hours to tune performance internally)
- Cluster too small
- Cluster just fired up or scaled up
- HDD used instead of SSD
- Development v Production instance

BigTable: Schema Design
- Each table has just one index - the row key. Choose it well
- Rows are sorted lexicographically by row key
- All operations are atomic at row level
- Related entities in adjacent rows

BigTable: Size Limits
- Row keys: 4KB per key
- Column Families: ~100 per table
- Column Values: ~ 10 MB each
- Total Row Size: ~100 MB

BigTable: Types of Row Keys
- Reverse domain names
- String identifiers
- Timestamps as suffix in key

BigTable: Row Keys to Avoid
- Domain names
- Sequential numeric values
- Timestamps alone
- Timestamps as prefix of row-key
- Mutable or repeatedly updated values

BigTable: Architecture
All client requests go through a front-end server before they are sent to a Cloud Bigtable node. The nodes are organized into a Cloud Bigtable cluster, which belongs to a Cloud Bigtable instance, a container for the cluster.

Each node in the cluster handles a subset of the requests to the cluster. By adding nodes to a cluster, you can increase the number of simultaneous requests that the cluster can handle, as well as the maximum throughput for the entire cluster. If you enable replication by adding a second cluster, you can also send different types of traffic to different clusters, and you can fail over to one cluster if the other cluster becomes unavailable.

A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.

In addition to the SSTable files, all writes are stored in Colossus's shared log as soon as they are acknowledged by Cloud Bigtable, providing increased durability.

Importantly, data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus.

BigTable: Load Balancing
Each Cloud Bigtable zone is managed by a master process, which balances workload and data volume within clusters. The master splits busier/larger tablets in half and merges less-accessed/smaller tablets together, redistributing them between nodes as needed. Cloud Bigtable manages all of the splitting, merging, and rebalancing automatically, saving users the effort of manually administering their tablets.

To get the best write performance from Cloud Bigtable, it's important to distribute writes as evenly as possible across nodes. One way to achieve this goal is by using row keys that do not follow a predictable order.

At the same time, it's useful to group related rows so they are adjacent to one another, which makes it much more efficient to read several rows at the same time.

Cloud Bigtable and other storage options
Cloud Bigtable is not a relational database; it does not support SQL queries or joins, nor does it support multi-row transactions. Also, it is not a good solution for storing less than 1 TB of data.

- If you need full SQL support for an online transaction processing (OLTP) system, consider Cloud Spanner or Cloud SQL.
- If you need interactive querying in an online analytical processing (OLAP) system, consider BigQuery.
- If you need to store immutable blobs larger than 10 MB, such as large images or movies, consider Cloud Storage.
- If you need to store highly structured objects in a document database, with support for ACID transactions and SQL-like queries, consider Cloud Datastore.

BigTable: Instances
A Cloud Bigtable instance is mostly just a container for your clusters and nodes, which do all of the real work.

Important properties:
- The instance type (production or development)
- The storage type (SSD or HDD)
- The application profiles, for instances that use replication.

BigTable: Application Profiles
app profiles stores settings that tell your Cloud Bigtable instance how to handle incoming requests from an application. App profiles affect how your applications communicate with an instance that uses replication. As a result, app profiles are especially useful for instances that have 2 clusters. An app profile defines the routing policy that Cloud Bigtable uses. It also controls whether single-row transactions are allowed.

BigTable: Routing policy
An app profile specifies the routing policy that Cloud Bigtable should use for each request:
- Single-cluster routing routes all requests to 1 cluster in your instance.
- Multi-cluster routing tells Cloud Bigtable that it can route each request to any available cluster.

If one cluster becomes unavailable, and an app profile uses multi-cluster routing, any traffic that uses that app profile automatically fails over to the other cluster. In contrast, if an app profile uses single-cluster routing, you must manually fail over.

BigTable: Single-row transactions
In Cloud Bigtable, reads and writes are always atomic at the row level. Cloud Bigtable does not provide atomicity above the row level; for example, Cloud Bigtable does not support transactions that atomically update more than one row.

However, Cloud Bigtable also supports some write operations that would require a transaction in other databases:
- Read-modify-write operations, including increments and appends. A read-modify-write operation reads an existing value; increments or appends to the existing value; and writes the updated value to the table.
- Check-and-mutate operations, also known as conditional mutations or conditional writes. In a check-and-mutate operation, Cloud Bigtable checks a row to see if it meets a specified condition. If the condition is met, Cloud Bigtable writes new values to the row.

BigTable: Clusters
A cluster represents the actual Cloud Bigtable service. Each cluster belongs to a single Cloud Bigtable instance, and an instance can have up to 2 clusters. When your application sends requests to a Cloud Bigtable instance, those requests are actually handled by one of the clusters in the instance.

BigTable: Nodes
Each cluster in a production instance has 3 or more nodes, which are compute resources that Cloud Bigtable uses to manage your data.

Cloud Bigtable splits all of the data from your tables into smaller tablets, which are stored on disk, separate from the nodes. Each node is responsible for keeping track of specific tablets on disk; handling incoming reads and writes for its tablets; and performing maintenance tasks on its tablets, such as periodic compactions.

A cluster must have enough nodes to support its current workload and the amount of data it stores. Otherwise, the cluster might not be able to handle incoming requests, and latency could go up.

BigTable: Designing Schema: General Concepts
- Each table has only one index, the row key. There are no secondary indices.
- Rows are sorted lexicographically by row key, from the lowest to the highest byte string. Row keys are sorted in big-endian, or network, byte order, the binary equivalent of alphabetical order.
- All operations are atomic at the row level. Avoid schema designs that require atomicity across rows.
- Ideally, both reads and writes should be distributed evenly across the row space of the table.
- In general, keep all information for an entity in a single row. An entity that doesn't need atomic updates and reads can be be split across multiple rows. Splitting across multiple rows is recommended if the entity data is large (hundreds of MB).
- Related entities should be stored in adjacent rows, which makes reads more efficient.
- Cloud Bigtable tables are sparse. Empty columns don't take up any space. As a result, it often makes sense to create a very large number of columns, even if most columns are empty in most rows.

Datastore
Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development.

- Document data - eg XML or HTML - has a characteristic pattern
- Key-value structure, i.e. structured data
- Typically not used either for OLTP or OLAP
- Fast lookup on keys is the most common use-case
- SQL-like queries and REST API
- For mobile and web development
- Stack Driver monitoring is integrated.
- Support MapReduce framework on top of Datastore, for processing large amounts of data in parallel and distributed fashion.

- Speciality of DataStore is that query execution time depends on size of returned result (not size of data set)
- Ideal for "needle-in-a-haystack" type applications, i.e. lookups of nonsequential keys

Datastore Features
- Atomic transactions: can execute a set of operations where either all succeed, or none occur.
- High availability of reads and writes.
- Massive scalability with high performance: uses a distributed architecture to automatically manage scaling. Uses a mix of indexes and query constraints so your queries scale with the size of your result set, not the size of your data set.
- Flexible storage and querying of data: maps naturally to object-oriented and scripting languages, and is exposed to applications through multiple clients. It also provides a SQL-like query language.
- Balance of strong and eventual consistency: ensures that entity lookups by key and ancestor queries always receive strongly consistent data. All other queries are eventually consistent.
- Encryption at rest: automatically encrypts all data before it is written to disk and automatically decrypts the data when read by an authorized user.
- Fully managed with no planned downtime.

Traditional RDBMS vs. DataStore: Similarities
Traditional RDBMS:
- Atomic transactions
- Indices for fast lookup
- Some queries use indices - not all
- Query time depend on both size of data set and size of result set

DataStore:
- Atomic transactions
- Indices for fast lookup
- All queries use indices!
- Query time independent of data set, depends on result set alone

Traditional RDBMS vs. DataStore: Differences
Traditional RDBMS:
- Structured relational data
- Rows stored in Tables
- Rows consist of fields
- Primary Keys for unique ID
- Rows of table have same properties (Schema is strongly enforced)
- Types of all values in a column are the same
- Lots of joins
- Filtering on subqueries
- Multiple inequality conditions

DataStore:
- Structured hierarchical data (XML, HTML)
- Entities of different Kinds (think HTML tags)
- Entities consist of Properties
- Keys for unique ID
- Entities of a kind can have different properties (think optional tags in HTML)
- Types of different properties with same name in an entity can be different.
- No joins
- No filtering on subqueries
- Only one inequality filter OK per query

DataStore: When to avoid
- Don't use if you need very strong transaction support (OLTP) - OK for basic ACID support though
- Don't use for non-hierarchical or unstructured data - BigTable is better
- Don't use if analytics/business intelligence/data warehousing - use BigQuery instead
- Don't use for immutable blobs like movies each > 10 MB - use Cloud Storage instead
- Don't use if application has lots of writes and updates on key columns

DataStore: When to use
- Use for crazy scaling of read performance - to virtually any size
- Use for hierarchical documents with key/value data.
- Product catalogs that provide real-time inventory and product details for a retailer.
- User profiles that deliver a customized experience based on the user's past activities and preferences.
- Transactions based on ACID properties, for example, transferring funds from one bank account to another.

DataStore: Index
An index is defined on a list of properties of a given entity kind, with a corresponding order (ascending or descending) for each property. For use with ancestor queries, the index may also optionally include an entity's ancestors.

Two types of indexes:
- Built-in: By default, Cloud Datastore automatically predefines an index for each predefines an index for each property of each entity kind. These single property indexes are suitable for simple types of queries.
- Composite Index: Composite indexes index multiple property values per indexed entity. Composite indexes support complex queries and are defined in an index configuration file (index.yaml)

DataStore: Full Indexing
- "Built-in" Indices on each property (~field) of each entity kind (~table row)
- "Composite" Indices on multiple property values
- If you are certain a property will never be queried, can explicitly exclude it from indexing
- Each query is evaluated using its "perfect index"

DataStore: Perfect Index
- Given a query, which is the index that most optimally returns query results?
- Depends on following (in order)
-- equality filter
-- inequality filter (only 1 allowed)
-- sort conditions if any specified

DataStore: Implications of Full Indexing
- Updates are really slow
- No joins possible
- Can't filter results based on subquery results
- Can't include more than one inequality filter (one is OK)

DataStore: Multitenancy
- Separate data partitions for each client organization
- Can use the same schema for all clients, but vary the values
- Specified via a namespace (inside which kinds and entities can exist)

DataStore: Transaction Support
- Can optionally use transactions - not required
- Not as strong as Cloud Spanner (which is ACID++), but stronger than BigQuery or BigTable

DataStore: Consistency
- Two consistency levels possible for query results
-- Strongly consistent: return up-to-date result, however long it takes
-- Eventually consistent: faster, but might return stale

DataStore: Locations
You can store your Cloud Datastore data in either a multi-region location or a regional location.

Data in a multi-region location operates in a multi-zone and multi-region replicated configuration. Select a multi-region location if you want to maximize the availability and durability of your database.

Data in a regional location operates in a multi-zone replicated configuration. Select a regional location if your application is more sensitive to write latency or if you want co-location with other Google Cloud Platform resources that your application may use.

DataStore: Server-Side Encryption
Google Cloud Datastore automatically encrypts all data before it is written to disk. There is no setup or configuration required and no need to modify the way you access the service. The data is automatically and transparently decrypted when read by an authorized user.

DataStore: Best Practices: API calls
- Use batch operations for your reads, writes, and deletes instead of single operations. Batch operations are more efficient because they perform multiple operations with the same overhead as a single operation.
- If a transaction fails, ensure you try to rollback the transaction. The rollback minimizes retry latency for a different request contending for the same resource(s) in a transaction. Note that a rollback itself might fail, so the rollback should be a best-effort attempt only.
- Use asynchronous calls where available instead of synchronous calls.

DataStore: Best Practices: Entities
- Group highly related data in entity groups. Entity groups enable ancestor queries, which return strongly consistent results. Ancestor queries also rapidly scan an entity group with minimal I/O because the entities in an entity group are stored at physically close places on Cloud Datastore servers.
- Avoid writing to an entity group more than once per second. Writing at a sustained rate above that limit makes eventually consistent reads more eventual, leads to time outs for strongly consistent reads, and results in slower overall performance of your application. A batch or transactional write to an entity group counts as only a single write against this limit.
- Do not include the same entity (by key) multiple times in the same commit. Including the same entity multiple times in the same commit could impact Cloud Datastore latency.

DataStore: Best Practices: Keys
- Key names are auto-generated if not provided at entity creation. They are allocated so as to be evenly distributed in the key space.
- For a key that uses a custom name, always use UTF-8 characters except a forward slash (/).
- For a key that uses a numeric ID:
>>> Do not use a negative number for the ID.
>>> Do not use the value 0(zero) for the ID. If you do, you will get an automatically allocated ID.
>>> If you wish to manually assign your own numeric IDs to the entities you create, have your application obtain a block of IDs with the allocateIds() method. This will prevent Cloud Datastore from assigning one of your manual numeric IDs to another entity.

DataStore: Best Practices: Indexes
- If a property will never be needed for a query, exclude the property from indexes. Unnecessarily indexing a property could result in increased latency to achieve consistency, and increased storage costs of index entries.
- Avoid having too many composite indexes. Excessive use of composite indexes could result in increased latency to achieve consistency, and increased storage costs of index entries. If you need to execute ad hoc queries on large datasets without previously defined indexes, use Google BigQuery.
- Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates.

DataStore: Best Practices: Queries
- If you need to access only the key from query results, use a keys-only query. A keys-only query returns results at lower latency and cost than retrieving entire entities.
- If you need to access only specific properties from an entity, use a projection query. A projection query returns results at lower latency and cost than retrieving entire entities.
- Likewise, if you need to access only the properties that are included in the query filter (for example, those listed in an order by clause), use a projection query.
- Do not use offsets. Instead use cursors. Using an offset only avoids returning the skipped entities to your application, but these entities are still retrieved internally. The skipped entities affect the latency of the query, and your application is billed for the read operations required to retrieve them.
- If you need strong consistency for your queries, use an ancestor query. To use ancestor queries, you first need to structure your data for strong consistency.

DataStore: Designing for scale
- A single entity group in Cloud Datastore should not be updated too rapidly.
- Avoid high read or write rates to Cloud Datastore keys that are lexicographically close.
- Gradually ramp up traffic to new Cloud Datastore kinds or portions of the keyspace.
- Avoid deleting large numbers of Cloud Datastore entities across a small range of keys.
- Use sharding or replication for hot Cloud Datastore keys.
>>> You can use replication if you need to read a portion of the key range at a higher rate than permitted.
>>> You can use sharding if you need to write to a portion of the key range at a higher rate than permitted.

Google App Engine: Standard environment
Application instances run in a sandbox, using the runtime environment of a supported language.

Optimal for applications with the following characteristics:
- Python, JAVA, Node.js, PHP, Go
- Intended to run for free or at very low cost, where you pay only for what you need and when you need it.
- Experiences sudden and extreme spikes of traffic which require immediate scaling.

BigQuery Overview
- Google's serverless, highly scalable, low cost enterprise data warehouse
- no infrastructure to manage
- enables you to analyze all your data by creating a logical data warehouse over managed, columnar storage as well as data from object storage, and spreadsheets.
- allows organizations to capture and analyze data in real-time using its powerful streaming ingestion capability so that your insights are always current.
- free for up to 1TB of data analyzed each month and 10GB of data stored.

BigQuery: Data Model
- Dataset = set of tables and views
- Table must belong to dataset
- Dataset must belong to a project
- Tables contain records with rows and columns (fields)
- Nested and repeated fields are supported.

BigQuery: Table Schema
contains individual records organized in rows. Each record is composed of columns (also called fields).
defined by a schema that describes the column names, data types, and other information. You can specify the schema of a table when it is created, or you can create a table without a schema and declare the schema in the query job or load job that first populates it with data.

BigQuery: Table Types
- Native tables: tables backed by native BigQuery storage.
- External tables: tables backed by storage external to BigQuery. BigTable, Cloud Storage, Google Drive
- Views: Virtual tables defined by a SQL query.

BigQuery: Schema Auto-Detection
- Available while
-- Loading data
-- Querying external data
- selects a random file in the data source and scans up to 100 rows of data to use as a representative sample
- Then examines each field and attempts to assign a data type to that field based on the values in the sample

BigQuery: Loading Data

1. manages the technical aspects of storing your structured data, including compression, encryption, replication, performance tuning, and scaling

2. stores data in the Capacitor columnar data format, and offers the standard database concepts of tables, partitions, columns, and rows.

BigQuery: Batch Loads
-- CSV
-- JSON (newline delimited)
-- Avro
-- GCP Datastore backups
- Streaming loads
-- High volume event tracking logs
-- Realtime dashboards

BigQuery: Other Sources
- Cloud storage
- Analytics 360
- Datastore
- Dataflow
- Cloud storage logs

BigQuery: Data Formats
- CSV
- JSON (newline delimited)
- Avro (open source data format that bundles serialized data with the data's schema in the same file)
- Cloud Datastore backups (BigQuery converts data from each entity in Cloud Datastore backup files to BigQuery's data types)

BigQuery: Alternatives to Loading
- Public datasets
- Shared datasets
- Stackdriver log files (needs export - but direct)

BigQuery: Querying and Viewing
- Interactive queries
- Batch queries
- Views
- Partitioned tables

BigQuery: Interactive Queries
- Default mode (executed as soon as possible)
- Count towards limits on
-- Daily usage
-- Concurrent usage

BigQuery: Batch Queries
- BigQuery will schedule these to run whenever possible (idle resources)
- Don't count towards limit on concurrent usage
- If not started within 24 hours, BigQuery makes them interactive

BigQuery: Views
A view is a virtual table defined by a SQL query. When you create a view, you query it in the same way you query a table. When a user queries the view, the query results contain data only from the tables and fields specified in the query that defines the view.

- Can't assign access control - based on user running view
- Can create authorised view: share query results with groups without giving read access to underlying data
- Can give row-level permissions to different users within same view
- Can't export data from a view
- Can't use JSON API to retrieve data
- Can't mix standard and legacy SQL, e.g., standard SQL query can't access legacy-SQL view.
- No user-defined functions allowed
- No wildcard table references allowed
- Limit of 1000 authorized views per data set.

Queries are billed according to the total amount of data in all table fields referenced directly or indirectly by the top-level query.

BigQuery: Partitioned Tables
- Special table where data is partitioned for you
- No need to create partitions manually or programmatically
- Manual partitions - performance degrades
- Limit of 1000 tables per query does not apply
- Date-partitioned tables offered by BigQuery
- Need to declare table as partitioned at creation time
- No need to specify schema (can do while loading data)
- BigQuery automatically creates date partitions

BigQuery: Query Plan Explanation
In the web UI, click on "Explanation". Helps in debugging complex queries.

Embedded within query jobs, BigQuery includes diagnostic query plan and timing information. This is similar to the information provided by statements such as EXPLAIN in other database and analytical systems. This information can be retrieved from the API responses of methods such as jobs.get.

For long running queries, BigQuery will periodically update these statistics. These updates happen independently of the rate at which the job status is polled, but typically will not happen more frequently than every 30 seconds. Additionally, query jobs that do not leverage execution resources, such as dry run requests or results that can be served from cached results will not include the additional diagnostic information, though other statistics may be present.

BigQuery: Slots
- Unit of Computational capacity needed to run queries
- BigQuery calculates on basis of query size, complexity
- Usually default slots sufficient
- Might need to be expanded for very large, complex queries
- Slots are subject to quota policies
- Can use StackDriver Monitoring to track slot usage

BigQuery Best Practices: Controlling Costs
- Avoid SELECT *. Query only the columns that you need.
- Sample data using preview options.
- Price your queries before running them.
- Limit query costs by restricting the number of bytes billed. Best practice: Use the maximum bytes billed setting to limit query costs.
- LIMIT doesn't affect cost.
- Partition data by date.
- Materialize query results in stages.
- Keeping large result sets in BigQuery storage has a cost. If you don't need permanent access to the results, use the default table expiration to automatically delete the data for you.
- There is no charge for loading data into BigQuery. There is a charge, however, for streaming data into BigQuery.

BigQuery: Estimating Costs
- When you enter a query in the web UI, the query validator verifies the query syntax and provides an estimate of the number of bytes read. You can use this estimate to calculate query cost in the pricing calculator.
- When you run a query in the CLI, you can use the --dry_run flag to estimate the number of bytes read. You can use this estimate to calculate query cost in the pricing calculator.

BigQuery: Performance Factors
the amount of work required depends on a number of factors:
- Input data and data sources (I/O): How many bytes does your query read?
- Communication between nodes (shuffling): How many bytes does your query pass to the next stage? How many bytes does your query pass to each slot?
- Computation: How much CPU work does your query require?
- Outputs (materialization): How many bytes does your query write?
- Query anti-patterns: Are your queries following SQL best practices?

BigQuery: Performance: Input Data and Data Sources
Best Practices:
- Control projection - Query only the columns that you need. Avoid SELECT *
- When querying a time-partitioned table, use the _PARTITIONTIME pseudo column to filter the partitions.
- BigQuery performs best when your data is denormalized. Rather than preserving a relational schema such as a star or snowflake schema, denormalize your data and take advantage of nested and repeated fields.
- Querying tables in BigQuery managed storage is typically much faster than querying external tables in Google Cloud Storage, Google Drive, or Google Cloud Bigtable. Use an external data source for these use cases:
>>> Performing extract, transform, and load (ETL) operations when loading data
>>> Frequently changing data
>>> Periodic loads such as recurring ingestion of data from Cloud Bigtable
- Use wildcards to query multiple tables by using concise SQL statements. Wildcard tables are a union of tables that match the wildcard expression. Wildcard tables are useful if your dataset contains:
>>> Multiple, similarly named tables with compatible schemas
>>> Sharded tables

BigQuery: Performance: Optimizing Communication Between Slots
When evaluating your communication throughput, consider the amount of shuffling that is required by your query. How many bytes are passed between stages? How many bytes are passed to each slot? For example, a GROUP BY clause passes like values to the same slot for processing. The amount of data that is shuffled directly impacts communication throughput and as a result, query performance.

Best practice:
- Trim the data as early in the query as possible, before the query performs a JOIN.
- WITH clauses are used primarily for readability because they are not materialized. For example, placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause.
- Partitioned Tables perform better than date-named tables. When you create tables sharded by date, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table.
- Table sharding refers to dividing large datasets into separate tables and adding a suffix to each table name. Avoid creating too many table shards. If you are sharding tables by date, use time-partitioned tables instead.

BigQuery: Performance: Optimizing Query Computation
Best practice:
- Avoid using JavaScript user-defined functions. If possible, use a native (SQL) UDF instead.
- If the SQL aggregation function you're using has an equivalent approximation function, the approximation function will yield faster query performance. For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT().
- Use ORDER BY only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query.
- For queries that join data from multiple tables, optimize your join patterns. Start with the largest table.
- When you query partitioned tables, use the _PARTITIONTIME pseudo column. Filtering the data using _PARTITIONTIME allows you to specify a date or range of dates.

BigQuery: Performance: Input Data and Data Sources
Best Practices:
- Control projection - Query only the columns that you need. Avoid SELECT *
- When querying a time-partitioned table, use the _PARTITIONTIME pseudo column to filter the partitions.
- Instead of a relational schema like star or snowflake, denormalize your data and take advantage of nested and repeated fields.
- Querying tables in BigQuery managed storage is typically much faster than querying external tables in Google Cloud Storage, Google Drive, or Google Cloud Bigtable. Use an external data source for these use cases:
>>> Performing extract, transform, and load (ETL) operations when loading data
>>> Frequently changing data
>>> Periodic loads such as recurring ingestion of data from Cloud Bigtable
- Use wildcards to query multiple tables by using concise SQL statements. Wildcard tables are a union of tables that match the wildcard expression. Wildcard tables are useful if your dataset contains:
>>> Multiple, similarly named tables with compatible schemas
>>> Sharded tables

BigQuery: Performance: Avoiding SQL Anti-Patterns
Best practice:
- self-joins are used to compute row-dependent relationships. Doubles the number of output rows -> poor performance. Use a window (analytic) function to reduce the number of additional bytes that are generated by the query.
- Partition skew, or data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can't share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition.
- Cross joins are queries where each row from the first table is joined to every row in the second table (there are non-unique keys on both sides). The worst case output is the number of rows in the left table multiplied by the number of rows in the right table. In extreme cases, the query might not finish.
- Using point-specific DML statements is an attempt to treat BigQuery like a OLTP. BigQuery focuses analytics(OLAP) by using table scans and not point lookups. If you need OLTP-like behavior (single-row updates or inserts), consider a database designed to support OLTP use cases such as Google Cloud SQL.

Google Data Studio: Introduction
Visualize your data
Data Studio turns your data into informative, easy to read, easy to share, and fully customizable dashboards and reports. Use the drag and drop report editor to create charts, apply filters, color themes, etc.

Connect to your data
Reports in Data Studio get their information from one or more data sources. Using the Data Sources tool, you can easily connect to wide variety of data, without programming.

Share and collaborate
Data Studio reports and data sources are stored as files on Google Drive. Just as in Drive, it's easy to share your files with individuals, teams, or the world. To tell your data stories as broadly as possible, embed your reports in other pages, such as Google Sites, blog posts, marketing articles, and annual reports.

Google Data Studio: Dimensions and Metrics
Dimensions describe. Metrics measure.
Dimensions are data categories. Dimension values are names, descriptions or other characteristics of a category.

Metrics measure the things contained in dimensions. In Data Studio, metrics values are always aggregated: given metric X, the value can be a sum, a count, a ratio of X, etc.

Metrics are always numbers. Dimensions can be any other kind of data, including unaggregated numbers, dates, text, and boolean (true/false) values.

Google Data Studio: Filters
Filters work by including or excluding records in your data that meet a set of conditions that you specify. Include filters retrieve only the records that match the conditions, while exclude filters retrieve only the records that DON'T match the conditions. Note that filters do not transform your data in any way. They simply reduce the amount of data displayed in the report.

Filters conditions consist of one or more clauses. Multiple clauses can be joined with "OR" logic (true if any conditions are met), "AND" logic (true if all conditions are met), or both.

Google Data Studio: What you can filters?
You can apply filters to the following components:
- Charts. For example, you can display a pie chart of new versus returning users in your biggest markets with a filter including Country IN "United States,Canada,Mexico,Japan"
- Filter controls. For example, you can let your viewers select from a list of best selling products on Quantity Sold Greater than (>) 100.
- Groups. For example, you can group 2 sets of charts and filter on Device Category to show website traffic in one set, and on the other to show mobile traffic.
- Pages. Page-level filters apply to every chart on that page. For example, you can dedicate page 1 of your Google Analytics report to mobile app traffic, and page 2 to desktop traffic by filtering on the Device Category dimension.
- Reports. Every chart in the report is subject to the filter. For example, you can create a report that focuses on your best customers by setting the report-level filter property to Lifetime Value Greater than or equal to 10,000.

Google Data Studio: How the cache works
There are 2 parts to Data Studio cache: the query cache, and the prefetch cache.

Query cache: The query cache remembers the queries (requests for data) issued by the components in a report. When a person viewing the report requests the exact same query (i.e., the same dimensions, metrics, filter conditions, and date range) as a previously received query, then the data is served from the cache. If the response can't be served from the query cache, Data Studio next looks to the prefetch cache.

Prefetch cache: The prefetch cache (A.K.A. the "Smart cache") predicts the data that a component could request by analyzing the dimensions, metrics, filters, and date range properties and controls on the report. Data Studio then stores (prefetches) as much of the data as possible that could be used to answer the predicted queries.

When a query can't be answered by the query cache, Data Studio tries to answer it using this prefetched data. If the query can't be answered by the prefetch cache, the data will come from the underlying data set.

Cache refresh and expiration: Both the query cache and prefetch cache automatically expire periodically (approximately every 12 hours). If you can edit the report, you can refresh both caches at any time by viewing the report and clicking Refresh data.

Google Data Studio: Sharing Permissions
Owner access: Is Owner access means you have complete control over the file.

View access: Can view access lets users see the report as a whole, or see the schema of the data source. View access to a report lets people interact with any filters or date range controls available. It does not let users change the data source or report in any way.

Edit access: Can edit access lets users modify the report or data source. For reports, users can add, change or remove charts and controls. They can add and remove data sources, change the report styling, and set up new filters or modify existing ones.

Edit access to a data source lets users modify its schema. They can add or change calculated fields, disable and enable fields, and change data types and field aggregations (when permitted by the data source).

DAG
Directed Acyclic Graph

Examples: Flink, Apache Beam, TensorFlow

DataFlow
- Cloud Dataflow is a unified programming model and a managed service for developing and executing a wide variety of data processing patterns.
- Cloud Dataflow includes SDKs for defining data processing workflows, and a Cloud Platform managed service to run those workflows on Google Cloud Platform resources such as Compute Engine, BigQuery, and more.
- Used to transform data
- Loosely semantically equivalent to Apache Spark
- Based on Apache Beam. Dataflow (1.x) was not based on Beam

Dataflow Programming Model
The Dataflow programming model is designed to simplify the mechanics of large-scale data processing. When you program with a Dataflow SDK, you are essentially creating a data processing job to be executed by one of the Cloud Dataflow runner services. This model lets you concentrate on the logical composition of your data processing job, rather than the physical orchestration of parallel processing.

The Dataflow model provides a number of useful abstractions that insulate you from low-level details of distributed processing, such as coordinating individual workers, sharding data sets, and other such tasks. These low-level details are fully managed for you by Cloud Dataflow's runner services.

DataFlow (Apache Beam): Programming Model: Major Concepts
When you think about data processing with Dataflow, you can think in terms of four major concepts:

- Pipeline: single, potentially repeatable job, from start to finish, in Dataflow
- PCollection: specialized container classes that can represent data sets of virtually unlimited size
- Tranforms: takes one or more PCollections as input, performs a processing function that you provide on the elements of that PCollection, and produces an output PCollection.
- I/O Sources & Sinks: different data storage formats, such as files in Google Cloud Storage, BigQuery tables

DataFlow (Apache Beam): Pipeline
- Pipeline: single, potentially repeatable job, from start to finish, in Dataflow
- Encapsulates series of computations that accepts some input data from external sources, transforms data to provide some useful intelligence, and produce output
- A pipeline consists of two parts: data (PCollections) and transforms applied to that data (Transforms).
- Defined by driver program
-- The actual pipeline computations run on a backend, abstracted in the driver by a runner.

DataFlow (Apache Beam): Driver and Runner
- Driver defines computation DAG (pipeline)
- Runner executes DAG on a backend
- Beam supports multiple backends
-- Apache Spark
-- Apache Flink
-- Google Cloud Dataflow
-- Beam Model

DataFlow (Apache Beam): Typical Beam Driver Program
- Create a Pipeline object
- Create an initial PCollection for pipeline data
-- Source API to read data from an external source
-- Create transform to build a PCollection from in-memory data.
- Define the Pipeline transforms to change, filter, group, analyse PCollections
-- transforms do not change input collection
- Output the final, transformed PCollection(s), typically using the Sink API to write data to an external source.
- Run the pipeline using the designated Pipeline Runner.

DataFlow (Apache Beam): pCollection
- A PCollection represents a set of data in your pipeline.
- The Dataflow PCollection classes are specialized container classes that can represent data sets of virtually unlimited size.
- A PCollection can hold a data set of a fixed size (such as data from a text file or a BigQuery table), or an unbounded data set from a continuously updating data source (such as a subscription from Google Cloud Pub/Sub).

PCollections are the inputs and outputs for each step in your pipeline.

DataFlow: pCollection: Limitations
A PCollection has several key aspects in which it differs from a regular collection class:
- A PCollection is immutable. Once created, you cannot add, remove, or change individual elements.
- A PCollection does not support random access to individual elements.
- A PCollection belongs to the pipeline in which it is created. You cannot share a PCollection between Pipeline objects.

- A PCollection may be physically backed by data in existing storage, or it may represent data that has not yet been computed.
- You can use a PCollection in computations that generate new pipeline data (as a new PCollection); however, you cannot change the elements of an existing PCollection once it has been created.

DataFlow (Apache Beam): Side Inputs
Inject additional data into some PCollection

DataFlow (Apache Beam): Transforms
A transform is a data processing operation, or a step, in your pipeline. A transform takes one or more PCollections as input, performs a processing function that you provide on the elements of that PCollection, and produces an output PCollection.

Your transforms don't need to be in a strict linear sequence within your pipeline. You can use conditionals, loops, and other common programming structures to create a branching pipeline or a pipeline with repeated structures. You can think of your pipeline as a directed graph of steps, rather than a linear sequence.

DataFlow: Transforms: Types
- Core Transforms
- Composite Transforms

DataFlow: Transforms: Core Transforms
Core transforms form the basic building blocks of pipeline processing. Each core transform provides a generic processing framework for applying business logic that you provide to the elements of a PCollection.

When you use a core transform, you provide the processing logic as a function object. The function you provide gets applied to the elements of the input PCollection(s). Instances of the function may be executed in parallel across multiple Google Compute Engine instances, given a large enough data set, and pending optimizations performed by the pipeline runner service. The worker code function produces the output elements, if any, that are added to the output PCollection(s).

Requirements for User-Provided Function Objects:
- Your function object must be serializable.
- Your function object must be thread-compatible, and be aware that the Dataflow SDKs are not thread-safe.
- We recommend making your function object idempotent.

DataFlow: Transforms: Composite Transforms
The model of transforms in the Dataflow SDKs is modular, in that you can build a transform that is implemented in terms of other transforms. You can think of a composite transform as a complex step in your pipeline that contains several nested steps.

Apache Beam: What
- ParDo (Parallel Do)
- GroupByKey
- Flatten
- Combine
- Composite Transforms.
- Side Inputs
- Source API
- Metrics
- Stateful Processing

Apache Beam: Where
- Global windows
- Fixed windows
- Sliding windows
- Session windows
- Custom windows
- Custom merging windows
- Timestamp control

Apache Beam: When
- Configurable triggering
- Event-time triggers
- Processing-time triggers
- Count triggers
- [Meta]data driven triggers
- Composite triggers
- Allowed latencies
- Timers

Apache Beam: How
- Discarding
- Accumulating
- Accumulating & Retracting

DataFlow (Apache Beam): I/O Sources and Sinks
- Source & Sink: different data storage formats, such as files in Google Cloud Storage, BigQuery tables
- Custom sources and sinks possible too

Source:
- Twitter feed
- log messages

Sink:
- BigQuery
- BigTable

Dataproc
- Managed Hadoop + Spark
- Includes: Hadoop, Spark, Hive and Pig
- "No-ops": create cluster, use it, turn it off using Cloud Dataproc automation.
-- Use Google Cloud Storage, not HDFS - else billing will hurt
- Ideal for moving existing code to GCP

Dataproc: Cluster Web Interfaces
Some of the core open source components included with Google Cloud Dataproc clusters, such as Apache Hadoop and Apache Spark, provide Web interfaces. These interfaces can be used to manage and monitor cluster resources and facilities, such as the YARN resource manager, the Hadoop Distributed File System (HDFS), MapReduce, and Spark. Other components or applications that you install on your cluster may also provide Web interfaces (for example, Install and run a Jupyter notebook on a Cloud Dataproc cluster).

YARN ResourceManager: http://master-host-name:8088
HDFS NameNode: http://master-host-name:9870 (In earlier Cloud Dataproc releases (pre-1.2), the HDFS Namenode Web UI port was 50070)

Dataproc: Cluster Machine Types
- Built using Compute Engine VM instances
- Cluster - Need at least 1 master and 2 workers
- Preemptible instances - OK if used with care

Dataproc: Using Preemptible VMs
All preemptible instances added to a cluster use the machine type of the cluster's non-preemptible worker nodes. The addition or removal of preemptible workers from a cluster does not affect the number of non-preemptible workers in the cluster.

Cloud Dataproc adds preemptible instances as secondary workers in a managed instance group, which contains only preemptible workers. The managed group automatically re-adds workers lost due to reclamation as capacity permits.

Rules for using preemptible workers with a Cloud Dataproc cluster:
- Processing only â€” Since preemptibles can be reclaimed at any time, preemptible workers do not store data. Preemptibles added to a Cloud Dataproc cluster only function as processing nodes.
- No preemptible-only clusters â€” To ensure clusters do not lose all workers, Cloud Dataproc cannot create preemptible-only clusters. Cloud Dataproc will automatically add two non-preemptible workers to the cluster.
- Persistent disk size â€” As a default, all preemptible workers are created with the smaller of 100GB or the primary worker boot disk size.

Dataproc: Initialisation Actions
When creating a Cloud Dataproc cluster, you can specify initialization actions in executables or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

- Can specify scripts to be run from GitHub or Cloud Storage
- Can do so via GCP console, gcloud CLI or programmatically
- Run as root (i.e. no sudo required)
- Use absolute paths
- Use shebang line to indicate script interpreter

Dataproc: Scaling Clusters
- Can scale up/down even when jobs are running
- Operations for scaling are:
-- Add workers to run jobs faster.
-- Remove workers to save on cost.
-- Add HDFS storage
- Because clusters can be scaled more than once, you might want to increase/decrease the cluster size at one time, and then decrease/increase the size later.

Graceful Decommissioning:
When you downscale a cluster, work in progress may terminate before completion. If you are using Cloud Dataproc v 1.2 or later, you can use Graceful Decommissioning, which incorporates graceful YARN decommissioning to finish work in progress on a worker before it is removed from the Cloud Dataproc cluster.

Dataproc: High Availability (Beta)
When creating a Google Cloud Dataproc cluster, you can put the cluster into Hadoop High Availability (HA) mode by specifying the number of master instances in the cluster. The number of masters can only be specified at cluster creation time.

Currently, Cloud Dataproc supports two master configurations:
- 1 master (default, non HA)
- 3 masters (Hadoop HA)

Instance Names
The default master is named cluster-name-m; HA masters are named cluster-name-m-0, cluster-name-m-1, cluster-name-m-2.

Apache ZooKeeper
In an HA Cloud Dataproc cluster, all masters participate in a ZooKeeper cluster, which enables automatic failover for other Hadoop services.

Dataproc: Single Node Clusters
Single node clusters are Cloud Dataproc clusters with only one node. This single node acts as the master and worker for your Cloud Dataproc cluster.

There are a number of situations where single node Cloud Dataproc clusters can be useful, including:
- Trying out new versions of Spark and Hadoop or other open source components
- Building proof-of-concept (PoC) demonstrations
- Lightweight data science
- Small-scale non-critical data processing
- Education related to the Spark and Hadoop ecosystem

Limitations:
- Single node clusters are not recommended for large-scale parallel data processing.
- n1-standard-1 machine types have limited resources and are not recommended for YARN applications.
- Single node clusters are not available with high-availability since there is only one node in the cluster.
- Single node clusters cannot use preemptible VMs.

Dataproc: Restartable Jobs
- By default, Dataproc jobs do NOT restart on failure
- Can optionally change this - useful for long-running and streaming jobs (eg Spark Streaming)
- Specify the maximum number of retries per hour (the upper limit is 10 retries per hour)
- Mitigates out-of-memory, unscheduled reboots

Dataproc: Connectors
- BigQuery
- BigTable
- Cloud Storage

Dataproc: BigQuery Connector
- You can use a BigQuery connector to enable programmatic read/write access to BigQuery.
- This is an ideal way to process data that is stored in BigQuery. No command-line access is exposed.
- The BigQuery connector is a Java library that enables Hadoop to process data from BigQuery.

Pricing considerations:
When using the connector, you will also be charged for any associated BigQuery usage fees. Additionally, the BigQuery connector downloads data into a Cloud Storage bucket before running a Hadoop job. After the Hadoop job successfully completes, the data is deleted from Cloud Storage. You are charged for storage according to Cloud Storage pricing.

Dataproc: Cloud Storage connector
The Cloud Storage connector is an open source Java library that lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage, and offers a number of benefits over choosing the Hadoop Distributed File System (HDFS).

Benefits of the Cloud Storage connector:
- Direct data access - Store your data in Cloud Storage and access it directly, with no need to transfer it into HDFS first.
- HDFS compatibility - You can easily access your data in Cloud Storage using the gs:// prefix instead of hdfs://.
- Interoperability - Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
- Data accessibility - When you shut down a Hadoop cluster, you still have access to your data in Cloud Storage, unlike HDFS.
- High data availability - Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
- No storage management overhead - Unlike HDFS, Cloud Storage requires no routine maintenance such as checking the file system, upgrading or rolling back to a previous version of the file system, etc.
- Quick startup - In HDFS, a MapReduce job can't start until the NameNode is out of safe modeâ€” a process that can take from a few seconds to many minutes depending on the size and state of your data. With Cloud Storage, you can start your job as soon as the task nodes start, leading to significant cost savings over time.

Cloud Dataproc vs Cloud Dataflow
Cloud Dataproc:
Cloud Dataproc is good for environments dependent on specific components of the Apache big data ecosystem:
- Tools/packages
- Pipelines
- Skill sets of existing resources

Cloud Dataflow:
Cloud Dataflow is typically the preferred option for greenfield environments:
- Less operational overhead
- Unified approach to development of batch or streaming pipelines
- Uses Apache Beam
- Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes

Cloud Dataproc/Dataflow: Recommended Workflows
- Stream processing (ETL): Dataflow
- Batch processing (ETL): Both Dataproc & Dataflow
- Iterative processing and notebooks: Dataproc
- Machine learning with Spark ML: Dataproc
- Preprocessing for machine learning: Dataflow (with Cloud ML Engine)

Cloud Dataflow vs. Cloud Dataproc: Which should you use?
Cloud Dataproc:
If you have dependencies on specific tools/packages in the Apache Hadoop/Spark ecosystem or if you favour a hands-on/DevOps approach to operations.

Cloud Dataflow:
If you don't have any dependencies on Hadoop/Spark acosystem or facour hands-off/serverless approach.

Dataproc vs Dataflow Workloads
- Stream processing (ETL): Dataflow
- Batch processing (ETL): Both
- Iterative processing and notebooks: Dataproc
- Machine learning with Spark ML: Dataproc
- Preprocessing for machine learning: Dataflow (with Cloud ML Engine)

Pub/Sub
- Messaging "middleware"
- Many-to-many asynchronous messaging
- Decouple sender and receiver

Pub/Sub: Message Life
- A publisher application creates a topic in the Google Cloud Pub/Sub service and sends messages to the topic. A message contains a payload and optional attributes that describe the payload content.
- Messages are persisted in a message store until they are delivered and acknowledged by subscribers.
- The Pub/Sub service forwards messages from a topic to all of its subscriptions, individually.
- Each subscription receives messages either by Pub/Sub pushing them to the subscriber's chosen endpoint, or by the subscriber pulling them from the service.
- The subscriber receives pending messages from its subscription and acknowledges each one to the Pub/Sub service.
- When a message is acknowledged by the subscriber, it is removed from the subscription's queue of messages.

Pub/Sub: Basics
- Publisher apps create and send messages on a Topic
- Subscriber apps subscribe to a topic to receive messages
- Subscription is a queue (message stream) to a subscriber
- Message = data + attributes sent by publisher to a topic
- Message Attributes = key-value pairs sent by publisher with message
- Publisher apps create and send messages on a Topic
-- Messages persisted in a message store until delivered/acknowledged
-- One queue per subscription
- Subscriber apps subscribe to a topic to receive messages
-- Push - WebHook endpoint
-- Pull - HTTPS request to endpoint
- Once acknowledged by subscriber, message deleted from queue

Pub/Sub: Use-cases
- Balancing workloads in network clusters
- Asynchronous order processing
- Distributing event notifications
- Refreshing distributed caches
- Logging to multiple systems simultaneously
- Data streaming
- Reliability improvement

Pub/Sub: Architecture
- Data plane, which handles moving messages between publishers and subscribers
- Control plane, which handles the assignment of publishers and subscribers to servers on the data plane
- The servers in the data plane are called forwarders, and the servers in the control plane are called routers.

Pub/Sub: Publishers
- Any application that can make HTTPS requests to googleapis.com
-- App Engine app
-- App running on Compute Engine instance
-- App running on third party network
-- Any mobile or desktop app
-- Even a browser

Pub/Sub: Subscribers
- Push subscribers - Any app that can make HTTPS request to googleapis.com
- Pull subscribers - must be WebHook endpoint that can accept POST request over HTTPS

GCP Virtual Machines: Overview
- VMs offer many useful features such as live migration which allows them to remain up even during maintenance events
- Rightsizing recommendations allow you to use the right sized machines for your workloads
- Google offers sustained use and committed use discounts which help reduce your cloud bill
- Images help you instantiate new VMs with the OS and applications of your choice baked in

VM: Live Migration
- Keeps your VM instances running even during a hardware or software update
- Live migrates your instance to another host in the same zone without rebooting VMs
-- infrastructure maintenance and upgrades
-- network and power grid maintenance
-- Failed hardware
-- Host and BIOS updates
-- Security changes etc
- VM gets a notification that it needs to be evicted
- A new VM is selected for migration, the empty "target"
- A connection is authenticated between the two
- Instances with GPUs cannot be live migrated, They get a 60 minute notice before termination
- Instances with local SSDs attached can be live migrated
- Preemptible instances cannot be live migrated, they are always terminated

VM: Live Migration Stages
- Pre-migration brownout: VM executing on source when most of the state is sent from source to target
- Blackout: A brief moment when the VM is not running anywhere.
- Post-migration brownout: VM is on the target, the source is present and might offer support (forwards packets from the source to target VMs till networking is
updated)

VM: Cloud Platform Free Tier
- 1 f1-micro VM instance per month (US regions, excluding Northern Virginia).
- 30 GB of Standard persistent disk storage per month.
- 5 GB of snapshot storage per month.
- 1 GB egress from North America to other destinations per month (excluding Australia and China).

VM: Billing Model
- All machines types are charged for a minimum of 1 minute
- After 1 minute instances are charged in 1 second increments

VM: Shared Core
- Ideal for applications that do not require a lot of resources
- Small, non-resource intensive applications

VM: Shared Core Bursting
- f1-micro machine types offer bursting capabilities that allow instances to use additional physical CPU for short periods of time.
- Bursting happens automatically when needed.
- The instance will automatically take advantage of available CPU in bursts.
- Bursts are not permanent, only possible periodically.

VM: High Memory Machines
- More memory per vCPU as compared with regular machines
- Useful for tasks which require more memory as compared to processing
- 6.5 GB of RAM per core

VM: High CPU Machines
- More memory per vCPU as compared with regular machines

VM: Custom Machines
- If none of the predefined machine types fit your workloads, use a custom machine type
- Save the cost of running on a machine which is more powerful than what you need
- Billed according to the number of vCPUs and the amount of memory used

VM: Sustained Use Discounts
- Discounts for running a VM instance for a significant portion of the billing month
- Say you run an instance for 25% of the month, you get a discount for every incremental minute
- Applied automatically, no action to avail of these

VM: Inferred Instances
- Compute engine gives you the maximum available discount by clubbing instance usage together
- Different instances running the same predefined machine type are combined to create inferred instances

VM: Sustained Discounts for Custom Machines
- Calculates sustained use discounts by combining memory and CPU usage
- Tries to combine resources to qualify for the biggest sustained usage discounts possible

VM: Rightsizing Recommendations
- Compute Engine provides machine recommendations to help optimize resource utilization
- Automatically generated based on system metrics gathered by Stackdriver monitoring
- Uses last 8 days of data for recommendations

VM: RAM Disk
- Allocate high performance memory to use as a disk
- A RAM disk has very low latency and high performance
- Used when your application expects a file system structure and can't store data in memory
- No storage redundancy or flexibility
- Shares memory with your applications
- Contents stays only as long as the VM is up

VM: Image (definition)
An image in Compute Engine is a cloud resource that provides a reference to an immutable disk.

VM: Images
- Used to create boot disks for VM instances
- Public images:
-- provided and maintained by Google, open source communities, third party vendors
-- all projects have access and can use them
- Custom images:
-- Available only to your project
-- Create a custom image from boot disks and other images
- Most of the public images can be used for no cost
- Some premium images may have an additional cost
- Custom images that you import to compute engine add no cost to your instance
- They incur an image storage charge when stored in your project (tar and gzipped file)
- Images are configured as a part of the instance template of a managed instance group

VM: Image Contents
- Boot loader
- Operating system
- File system structure
- Software
- Customizations

VM: Premium Images
- Additional per second charges, same charges across the world
- Red Hat Enterprise Linux, Microsoft Windows
- Changes based on the machine type used
- SQL Server images are charged per minute

VM: Startup Scripts
- Used to customize the instance created using a public image
- The script runs commands that deploys the application as it boots
- Script should be idempotent to avoid inconsistent or partially configured state

VM: Baking
- A more efficient way to provision infrastructure
- Create a custom image with your configuration incorporated into the public image

VM: Startup Scripts vs. Baking
Startup Scripts:
- Longer for the instance to be ready
- Startup scripts might fail and has to be idempotent
- Rollback has to be handled for applications and image separately
- The script will need to install dependencies during application deployment
- Each deployment might reference different versions if the latest version of the software has changed

Baking:
- Much faster to go from boot to application readiness
- Much more reliable for application deployments
- Version management is easier, easier to rollback to previous versions
- Fewer external dependencies during application bootstrap
- Scaling up creates instances with identical software versions

VM: Image Lifecycle
DEPRECATED: Images that are no longer the latest, but can still be launched by users. Users will see a warning at launch that they are no longer using the most recent image.
OBSELETE: Images that should not be launched by users or automation. An attempt to create an instance from these images will fail. You can use this image state to archive images so their data is still available when mounted as a non-boot disk.
DELETED: Images that have already been deleted or are marked for deletion in the future. These cannot be launched, and you should delete them as soon as possible.

VPC
Virtual Private Cloud

A global private isolated virtual network partition that provides managed networking functionality

VPC: Overview
- Resources in GCP projects are split across VPCs (Virtual Private Clouds)
- Routes and forwarding rules must be configured to allow traffic within a VPC and with the outside world
- Traffic flows only after firewall rules are configured specifying what traffic is allowed or not
- VPN, peering, shared VPCs are some of the ways to connect VPCs or a VPC with an on premise network

GCP Virtual Private Cloud
A VPC network is a virtual version of a physical network, like a data center network. It provides connectivity for your Compute Engine virtual machine (VM) instances, Kubernetes Engine clusters, App Engine Flex instances, and other resources in your project.
Projects can contain multiple VPC networks. New projects start with a default network that has one subnet in each region (an auto mode network).

VPC: Features
- Global: Resources from across zones, regions. VPCs are global. Subnets are regional.
- Multi-tenancy: VPCs can be shared across GCP projects
- Private and secure: IAM, firewall rules
- Scalable: Add new VMs, containers to the network, without any workload shutdown or downtime.

- A single project has a quota of 5 networks
- A single network has a limit of 7000 instances
- Within a network the resources communicate with each other often and are trusted
- Resources in other networks are treated just like any other external resource (even if they are in the same project)

VPC: Subnets
Logical partitioning of the network
- Defined by a IP address prefix range
- Specified in CIDR notation
- IP ranges cannot overlap between subnets
- Subnets in the GCP can contain resources only from a single region

CIDR notation
- 10.123.9.0/24
- Contains all IP addresses in the range 10.123.9.0 to 10.123.9.255
- the /24 represents the number of bits which is the network prefix
- Each subnet has a contiguous private RFC1918 IP space

VPC: VPCs are Global
- Instances can be from different zones same region
- Instances can be from different regions also.
- All machines communicate using internal IP addresses

VPC: Subnets are Regional
Instances from different regions cannot be on the same subnet
Subnets can have resources from multiple zones Or from a single zone

VPC: Types of subnets
Auto Mode: Automatically sets up a single subnet in each region - can manually create more subnets. This is default.

Custom Mode: No subnets are set up by default, we have to manually configure all subnets

You can switch a network from auto mode to custom mode. This conversion is one-way; custom mode networks cannot be changed to auto networks.

VPC: The "default" Network
- Every GCP project has an auto-mode network set up by default
- It comes with a number of routes and firewall rules preconfigured
- Gets us up and running without thinking about networks

VPC: IP Addresses
- Can be assigned to resources e.g. VMs
- Each VM has an internal IP address
- One or more secondary IP addresses
- Can also have an external IP address

VPC: Internal IP Addresses
- Use within a VPC
- Cannot be used across VPCs unless we have special configuration (like shared VPCs or VPNs)
- Can be ephemeral or static, typically ephemeral
- VMs know their internal IP address (VM name and IP is available to the network DNS)

VPC: External IP Addresses
- Use to communicate across VPCs
- Traffic using external IP addresses can cause additional billing charges
- Can be ephemeral or static
- VMs are not aware of their external IP address

VPC: IP Addresses: Internal vs External
Internal
- Ephemeral, changes every 24 hours or on VM restarts
- Allocated from the range of IP addresses available to a subnet to which the resource belongs
- VMs know their internal IP
- Hostname is mapped to internal IP "instance-1.c.test-project123.internal"
- VPC networks automatically resolve internal IP addresses to host names

External
- Can be ephemeral or static
- Ephemeral: Allocated from a pool of external IP addresses.
- Static: Reserved - charged when not assigned to VM
- VMs unaware of external IP
- Hosts with external IPs allow connections from outside the VPC
- Need to publish public DNS records to point to the instance with the external IP
- Can use Cloud DNS

VPC: IP Addresses: Ephemeral vs Static
Ephemeral:
- Available only till the VM is stopped, restarted or terminated
- No distinction between regional and global IP addresses

Static:
- Permanently assigned to a project and available till explicitly detached
- Regional or global resources
-- Regional: Allows resource of the region to use the address
-- Global: Used only for global forwarding rules in global load balancing
- Unassigned static IPs incur a cost

VPC: Alias IP Ranges
- A single service on a VM requires just one IP address
- Multiple services on the same VM may need different IP addresses
- Subnets have a primary and secondary CIDR range
- Using IP aliasing can set up multiple IP addresses drawn from the primary or secondary CIDR ranges
- Multiple containers or services on a VM can have their own IP
- VPCs automatically set up routes for the IPs
- Containers don't need to do their own routing, simplifies traffic management
- Can separate infrastructure from containers (infra will draw from the primary range, containers from the secondary range)

VPC: Routes
A route is a mapping of an IP range to a destination. Routes tell the VPC network where to send packets destined for a particular IP address.

VPC: 2 Default Routes
- Direct packets to destinations to specific destinations which carry it to the outside world (uses external IP addresses)
- Allow instances on a VPC to send packets directly to each other (uses internal IP addresses)

The existence of a route does not mean that a packet will get to the destination. Firewall rules have to be configured to allow the packet through.

VPC: Route: Creating a Network
- Default route for internet traffic.
- One route for every subnet that is created.

VPC: What is a route made of?
- name: User-friendly name
- network: The name of the network to which this route applies
- destRange: The destination IP range that this route applies to
- instanceTags: Instance tags that this route applies to, applies to all instances if empty
- priority: Used to break ties in case of multiple matches

and one of:
- nextHopInstance: Fully qualified URL. Instance must already exist
- nextHopIp: The IP address
- nextHopNetwork: URL of network
- nextHopGateway: URL of gateway
- nextHopVpnTunnel: URL of VPN tunnel

VPC: Instance Routing Tables
- Every route in a VPC might map to 0 or more instances
- Routes apply to an instance if the tag of the route and instance match
- If no tag, then route applies to all instances in a network
- All routes together form a routes collection

VPC: Using Routes
- Many-to-one NATs
-- Multiple hosts mapped to one public IP
- Transparent proxies
-- Direct all external traffic to one machine

VPC: Firewall Rules
Protects your virtual machine (VM) instances from unapproved connections, both inbound (ingress) and outbound (egress). You can create firewall rules to allow or deny specific connections based on a combination of IP addresses, ports, and protocol.

- Action: allow or deny
- Direction: ingress or egress
- Source IPs (ingress), Destination IPs (egress)
- Protocol and port
- Specific instance names
- Priorities and tiebreakers

GCP firewall rules are stateful. If a connection is allowed, all traffic in the flow is also allowed, in both directions.

Few ports are permanently blocked (Outgoing traffic to port 25 (SMTP), GRE (Generic Routing Encapsulation) traffic, etc.)

A rule with a deny action overrides another with an allow if the two rules have same priority.

VPC: Firewall: Rule Assignment
- Every rule is assigned to every instance in a network
- Rule assignment can be restricted using tags or service accounts
-- Allow traffic from instances with source tag "backend"
-- Deny traffic to instances running as service account "[email protected]"

VPC: Service Accounts vs. Tags
Service Accounts:
- Represents the identity that the instance runs with.
- An instance can have just one service account
- Restricted by IAM permissions, permissions to start an instance with a service account has to be explicitly given
- Changing a service account requires stopping and restarting an instance

Tags:
- Logically group resources for billing or applying firewalls
- An instance can have any number of tags
- Tags can be changed by any user who can edit an instance
- Changing tags is metadata update and is a much lighter operation

Prefer service accounts to tags for group instances so that firewall rules can be applied.

VPC: Firewall: Implied Rules
- A default "allow egress" rule.
-- Allows all egress connections. Rule has a priority of 65535.
- A default "deny ingress" rule.
-- Deny all ingress connection. Rule has a priority of 65535.

VPC: Firewall Rules for the "default" network
- default-allow-internal: Allows ingress network connections of any protocol and port between VM instances on the network
- default-allow-ssh: Allows ingress TCP connections from any source to any instance on the network over port 22
- default-allow-icmp: Allows ingress ICMP traffic from any source to any instance on the network.
- default-allow-rdp: Allows ingress remote desktop protocol traffic to TCP port 3389.

VPC: Firewall: Egress Connections
- Destination CIDR ranges, Protocols, Ports
- Destinations with specific tags or service accounts
-- Allow: Permit matching egress connections
-- Deny: Block the matching egress connections

VPC: Firewall: Ingress Connections
- Source CIDR ranges, Protocols, Ports
- Sources with specific tags or service accounts
-- Allow: Permit matching ingress connections
-- Deny: Block the matching ingress connections

VPC: Interconnecting Networks
3 options
- Virtual Private Networks (VPNs) using Cloud Router
- Dedicated Interconnect
- Direct and Carrier Peering

VPC: Interconnect: VPN
- Connects your on premise network to the Google Cloud VPC or two VPCs.
- Offers 99.9% service availability
- Traffic is encrypted by one VPN gateway and then decrypted by another VPN gateway
- Supports both static and dynamic routes for traffic between on-premise and cloud
- Only IPSec gateway to gateway scenarios are supported, does not work with client software on a laptop
- Must have a static external IP address
- Needs to know what destination IPs are allowed and create routes to forward packets to those IPs
- Can have multiple tunnels to a single VPN gateway, site-to-site VPN

VPN will have higher latency and lower throughput as compared with dedicated interconnect and peering options.

VPC: Cloud Router
- Dynamically exchange routes between Google VPCs and on premise networks
- Fully distributed and managed Google cloud service
- Peers with on premise gateway or router to exchange route information
- Uses the BGP or Border Gateway Protocol
- To enable dynamic routing, create a Cloud Router. Then, configure a BGP session between the Cloud Router and your on-premises gateway or router.
- The new subnets are seamlessly advertised between networks. Instances in the new subnets can start sending and receiving traffic immediately.

VPC: Static Routes
- Create and maintain a routing table
- A topology change in the network requires routes to be manually updated
- Cannot re-route traffic automatically if a link fails
- Suitable for small networks with stable topologies
- Routers do not advertise routes

VPC: Static Routing for VPN tunnels
- A VPN tunnel connecting a gateway at either end (Google Cloud & Peer network)
- A new subnet added to the on premise network
- New routes need to be added to the cloud VPC to reach the new subnet
- VPN tunnel will need to be torn down and re-established to include the new subnet
- Static routes are slow to converge as updates are manual

VPC: Dynamic Routes
- Can be implemented using Cloud Router on the GCP
- Uses BGP to exchange route information between networks
- Networks automatically and rapidly discover changes
- Changes implemented without disrupting traffic

VPC: Dynamic Routing for VPN tunnels
- A Cloud Router belongs to a particular network and a particular region
- Subnets segmenting the network IP space
- Advertises subnet changes using the BGP
- Also learns about subnet changes in the on premise network through BGP
- The IP address of the Cloud Router and the gateway router should both be link local IP addresses (valid only for communication within the network link)

VPC: Dynamic Routing Mode
- Determines which subnets are visible to Cloud Routers
- Global dynamic routing: Cloud router advertises all subnets in the VPC network to the on-premise router
- Regional dynamic routing: Advertises and propagates only those routes in its local region

VPC: Flow Logs
- Using Flow Logs, you can monitor network traffic to and from your VMs for TCP and UDP protocols.
- Enable or disable VPC Flow Logs per network subnet.
- It will capture source and destination IP addresses, source and destination ports and protocol number, time stamp, number of packets, throughput, etc..
- Where do you use?
>>> Network monitoring, diagnostics
>>> Network forensics (e.g.: which IPs talked with whom and when)
>>> Real-time security analysis - This can provide real-time monitoring, correlation of events, analysis, and security alerts.
>>> Cost / expense optimization.
- You can view flow logs in Stackdriver Logging, and you can export logs to any destination that Stackdriver Logging export supports (3 destination):
>>> Cloud Storage Buckets
>>> Cloud Pub/Sub(publish & subscribe for real time messaging) and
>>> BigQuery (fully managed enterprise data warehouse).

VPC: Interconnect: Dedicated Interconnect
- Direct physical connection and RFC 1918 communication between on-premise network and cloud VPC
- Can transfer large amounts of data between networks
- More cost effective than using high bandwidth internet connections or using VPN tunnels
- Capacity of a single connection is 10Gbps
- A maximum of 8 connections supported

Cross connect between the Google network and the on premise router in a common colocation facility.

VPC: Interconnect: Dedicated Interconnect Benefits
- Does not traverse the public internet. Fewer hops between points so fewer points of failure
- Can use internal IP addresses over a dedicated connection
- Scale connection based on needs up to 80Gbps
- Cost of egress traffic from VPC to on-premise network reduced

VPC: Interconnect: Direct Peering
- Direct connection between on-premise network and Google at Google's edge network locations
- BGP routes exchanged for dynamic routing
- Direct peering can be used to reach all of Google's services include the full suite of GCP products
- Special billing rate for GCP egress traffic, other traffic billed at standard GCP rates

VPC: Interconnect: Carrier Peering
- Enterprise grade network services connecting your infrastructure to Google using a service provider
- Can get high availability and lower latency using one or more links
- No Google SLA, the SLA depends on the carrier
- Special billing rate for GCP egress traffic, other traffic billed at standard GCP rates

Shared VPC
- Used to be called XPN (Cross-Project Networking)
- So far one project, multiple networks
- Shared VPC allow cross project networking i.e. multiple projects, one network.
- Creates a VPC network of RFC1918 IP spaces that associated projects can use.
- Firewall rules and policies apply to all projects on the network

Shared VPC: Host Project
Project that hosts sharable VPC networking resources within a Cloud Organization.

Shared VPC: Service project
Project that has permission to use the shared VPC networking resources from the host project.

Shared VPC: Standalone project
A project that does not share networking resources with any other project.

Shared VPC: Shared VPC network
A VPC network owned by the host project and shared with one or more service projects in the Cloud Organization.

Shared VPC: Organization
The Cloud Organization is the top level in the Cloud Resource Hierarchy and the top-level owner of all the projects and resources created under it. A given host project and its service projects must be under the same Cloud Organization.

A given host project and its service projects must be under the same Cloud Organization.

Shared VPC: Host and Service projects
- A service project can only be associated with a single host
- A project cannot be a host as well as a service project at the same time
- Instances in a project can only be assigned external IPs from the same project
- Existing projects can use shared VPC networks
- Instances on a shared VPC need to be created explicitly for the VPC

VPC Network Peering
- Allows private RFC1918 connectivity across two VPC networks
- Networks can be in the same or in different projects
- Primary and secondary ranges should not overlap with any peered ranges.
- Build SaaS ecosystems in GCP, services can be made available privately across different VPC networks
- Useful for organizations:
-- With several network administrative domains
-- Which want to peer with other organizations on the GCP

VPC Network Peering Benefits
- Lower latency as compared with public IP networking
- Better security since services need not expose an external IP address
- Using internal IPs for traffic avoids egress bandwidth pricing on the GCP

VPC Network Peering Properties
- Peered networks are administratively separate - routes, firewalls, VPNs and traffic management applied
independently
- One VPC can peer with multiple networks with a limit of 25
- Only directly peered networks can communicate

Cloud DNS
Google Cloud DNS is a high-performance, resilient, global Domain Name System (DNS) service that publishes your domain names to the global DNS in a cost-effective way.

- Hierarchical distributed database that lets you store IP addresses and other data and look them up by name
- Publish zones and records in the DNS
- No burden of managing your own DNS server

Cloud DNS: Managed Zone
- Entity that manages DNS records for a given suffix (example.com)
- Maintained by Cloud DNS

Cloud DNS: Record types
A - Address record, maps hostnames to IPv4 addresses
SOA - Start of authority - specifies authoritative information on a managed zone
MX - Mail exchange used to route requests to mail servers
NS - Name Server record, delegates a DNS zone to an authoritative server

Cloud DNS: Resource Record Changes
The changes are first made to the authoritative servers and is then picked up by the DNS resolvers when their cache expires

Managed Instance Groups and Load Balancing
- Managed instance groups are a pool of similar machines which can be scaled automatically
- Load balancing can be external or internal, global or regional
- Basic components of HTTP(S) load balancing - target proxy, URL map, backend service and backends
- Use cases and architecture diagrams for all the load balancing types HTTP(S), SSL proxy, TCP proxy, network and internal load balancing

Instance Groups
A group of machines which can be created and managed together to avoid individually controlling each instance in the project
- Managed
- Unmanaged

Managed Instance Group
- Uses an instance template to create a group of identical instances
- Changes to the instance group changes all instances in the group

- Can automatically scale the number of instances in the group
- Work with load balancing to distribute traffic across instances
- If an instance stops, crashes or is deleted the group automatically recreates the instance with the same template
- Can identify and recreate unhealthy instances in a group (autohealing)
- Two types: 1) Zonal, 2) Regional.

Instance Template
Defines the machine type, image, zone and other properties of an instance. A way to save the instance configuration to use it later to create new instances or groups of instances

- Global resource not bound to a zone or a region
- Can reference zonal resources such as a persistent disk
-- In such cases can be used only within the zone

Zonal vs. Regional MIG
- Prefer regional instance groups to zonal so application load can be spread across multiple zones
- This protects against failures within a single zone
- Choose zonal if you want lower latency and avoid cross-zone communication

MIG: Health Checks and Autohealing
- A MIG applies health checks to monitor the instances in the group
- If a service has failed on an instance, that instance is recreated (autohealing)
- Similar to health checks used in load balancing but the objective is different
-- LB health checks are used to determine where to send traffic
-- MIG health checks are used to recreate instances
- Typically configure health checks for both LB and MIGs
- The new instance is recreated based on the template that was used to originally create it (might be different from the default instance template)
- Disk data might be lost unless explicitly snapshotted

MIG: Configuring Health Checks
- Check Interval: The time to wait between attempts to check instance health
- Timeout: The length of time to wait for a response before declaring check attempt failed
- Health Threshold: How many consecutive "healthy" responses indicate that the VM is healthy
- Unhealthy Threshold: How many consecutive "failed" responses indicate VM is unhealthy

Unmanaged Instance Groups
- Groups of dissimilar instances that you can add and remove from the group
- Do not offer autoscaling, rolling updates or instance templates
- Not recommended, used only when you need to apply load balancing to pre-existing configurations

Load Balancing
- Load balancing and autoscaling for groups of instances
- Scale your application to support heavy traffic
- Detect and remove unhealthy VMs, healthy VMs automatically re-added
- Route traffic to the closest VM
- Fully managed service, redundant and highly available

Load Balancing Hierarchy
External:
-- Global
---- HTTP/HTTPS
---- SSL Proxy
---- TCP Proxy
-- Regional
---- Network
Internal
-- Regional

HTTP/HTTPS Load Balancing
A global, external load balancing service offered on the GCP. Distributes HTTP(S) traffic among groups of instances based on:
-- proximity to the user
-- requested URL
-- or both.
- Traffic from the internet is sent to a global forwarding rule - this rule determines which proxy the traffic should be directed to
- The global forwarding rule directs incoming requests to a target HTTP proxy
- The target HTTP proxy checks each request against a URL map to determine the appropriate backend service for the request
- The backend service directs each request to an appropriate backend based on serving capacity, zone, and instance health of its attached backends
- The health of each backend instance is verified using either an HTTP health check or an HTTPS health check - if HTTPS, request is encrypted
- Actual request distribution can happen based on CPU utilization, requests per instance
- Can configure the managed instance groups making up the backend to scale as the traffic scales (based on the parameters of utilization or requests per second)
- HTTPS load balancing requires the target proxy to have a signed certificate to terminate the SSL connection
- Must create firewall rules to allow requests from load balancer and health checker to get through to the instances
- Session affinity: All requests from same client to same server based on either
-- client IP
-- cookie

Global Forwarding Rules
- Route traffic by IP address, port and protocol to a load balancing proxy
- Can only be used with global load balancing HTTP(S), SSL Proxy and TCP Proxy
- Regional forwarding rules can be used with regional load balancing and individual instances

Target Proxy
- Referenced by one or more global forwarding rules
- Route the incoming requests to a URL map to determine where they should be sent
- Specific to a protocol (HTTP, HTTPS, SSL and TCP)
- Should have a SSL certificate if it terminates HTTPS connections (limit of 10 SSL certificates)
- Can connect to backend services via HTTP or HTTPS

URL Map
- Used to direct traffic to different instances based on the incoming URL
-- http://www.example.com/audio -> backend service1
-- http://www.example.com/vide -> backend service2

Load Balancing: Backend Service
- Centralized service for managing backends
- Backends contain instance groups which handle user requests
- Knows which instances it can use, how much traffic they can handle
- Monitors the health of backends and does not send traffic to unhealthy instances

Load Balancing: Backend Service Components
- Health Check: Pools instances to determine which one can receive requests
- Backends: Instance group of VMs which can be automatically scaled
- Session Affinity: Attempts to send requests from the same client to the same VM
- Timeout: Time the backend service will wait for a backend to respond

Load Balancing: Health Checks
- HTTP(S), SSL and TCP health checks
- HTTP(S): Verifies that the instance is healthy and the web server is serving traffic
- TCP, SSL: Used when the service expects TCP or SSL connection i.e. not HTTP(S)
- GCP creates redundant copies of the health checker automatically so health checks might happen more frequently that you expect

Load Balancing: Session Affinity
- Client IP: Hashes the IP address to send requests from the same IP to the same VM
-- Requests from different users might look like it is from the same IP
-- Users which move networks might lose affinity
- Cookie: Issues a cookie named GCLB in the first request.
-- Subsequent requests from clients with the cookie are sent to the same instance

Load Balancing: Backends
- Instance Group: Can be a managed or unmanaged instance group
- Balancing Mode: Determines when the backend is at full usage
-- CPU utilization, Requests per second
- Capacity Setting: A % of the balancing mode which determines the capacity of the backend

Load Balancing: Backend Buckets
- Allow you to use Cloud Storage buckets with HTTP(S) load balancing
- Traffic is directed to the bucket instead of a backend
- Useful in load balancing requests to static content

Load Balancing: Load Distribution
- Uses CPU utilization of the backend or requests per second as the balancing mode
- Maximum values can be specified for both
- Short bursts of traffic above the limit can occur
- Incoming requests are first sent to the region closest to the user, if that region has capacity
- Traffic distributed amongst zone instances based on capacity
- Round robin distribution across instances in a zone
- Round robin can be overridden by session affinity

Load Balancing: Firewall Rules
- Allow traffic from 130.211.0.0/22 and 35.191.0.0/16 to reach your instances
- IP ranges that the load balancer and the health checker use to connect to backends
- Allow traffic on the port that the global forwarding rule has been configured to use

Load Balancing: SSL Proxy Load Balancing
- Remember the OSI network layer stack: physical, data link, network, transport, session, presentation, application?
- The usual combination is TCP/IP: network = IP, transport = TCP, application = HTTP
- For secure traffic: add session layer = SSL (secure socket layer), and application layer = HTTPS
- Use only for non-HTTP(S) SSL traffic
- For HTTP(S), just use HTTP(S) load balancing
- SSL connections are terminated at the global layer then proxied to the closest available instance group

Load Balancing: TCP Proxy Load Balancing
- Perform load balancing based on transport layer (TCP)
- Allows you to use a single IP address for all users around the world.
- Automatically routes traffic to the instances that are closest to the user.
- Advantage of transport layer load balancing:
-- more intelligent routing possible than with network layer load balancing
-- better security - TCP vulnerabilities can be patched at the load balancer

Network Load Balancing
- Based on incoming IP protocol data, such as address, port, and protocol type
- Pass-through, regional load balancer - does not proxy connections from clients
- Use it to load balance UDP traffic, and TCP and SSL traffic
- Load balances traffic on ports that are not supported by the SSL proxy and TCP proxy load balancers

Load Balancing Algorithm
- Picks an instance based on a hash of:
-- the source IP and port
-- destination IP and port
-- protocol
- This means that incoming TCP connections are spread across instances and each new connection may go to a different instance.
- Regardless of the session affinity setting, all packets for a connection are directed to the chosen instance until the connection is closed and have no impact on load balancing decisions for new incoming connections
- This can result in imbalance between backends if long-lived TCP connections are in use.

Load Balancing: Target Pools
- Network load balancing forwards traffic to target pools
- A group of instances which receive incoming traffic from forwarding rules
- Can only be used with forwarding rules for TCP and UDP traffic
- Can have backup pools which will receive requests if the first pool is unhealthy
- failoverRatio is the ratio of healthy instances to failed instances in a pool
- If primary target pool's ratio is below the failoverRatio traffic is sent to the backup pool

Network Load Balancer: Firewall Rules
- HTTP health check probes are sent from the IP ranges 209.85.152.0/22, 209.85.204.0/22, and 35.191.0.0/16.
- The load balancer uses the same ranges to connect to the instances
- Firewall rules should be configured to allow traffic from these IP ranges

Internal Load Balancing
- Private load balancing IP address that only your VPC instances can access
- VPC traffic stays internal - less latency, more security
- No public IP address needed
- Useful to balance requests from your frontend instances to your backend instances

Internal Load Balancing: Load Balancing Algorithm
- The backend instance for a client is selected using a hashing algorithm that takes instance health into consideration.
- Using a 5-tuple hash, five parameters for hashing:
-- client source IP
-- client port
-- destination IP (the load balancing IP)
-- destination port
-- protocol (either TCP or UDP)
- Introduce session affinity by hashing on only some of the 5 parameters
-- Hash based on 3-tuple (Client IP, Dest IP, Protocol)
-- Hash based on 2-tuple (Client IP, Dest IP)

Internal Load Balancing: Health Checks
- HTTP, HTTPS health checks: These provide the highest fidelity, they verify that the web server is up and serving traffic, not just that the instance is healthy.
- SSL (TLS) health checks: Configure the SSL health checks if your traffic is not HTTPS but is encrypted via SSL(TLS)
- TCP health checks: For all TCP traffic that is not HTTP(S) or SSL(TLS), you can configure a TCP health check

High Availability
- Managed service. no additional configuration needed to ensure high availability
- Can configure multiple instance groups in different zones to guard against failures in a single zone
- With multiple instance groups all instances are treated as if they are in a single pool and the load balancer distributes traffic amongst them using the load balancing algorithm

GCP Internal Load Balancing
- Not proxied - differs from traditional model
- lightweight load-balancing built on top of Andromeda network virtualization stack
- provides software-defined load balancing that directly delivers the traffic from the client instance to a backend instance

Autoscaling
- Managed instance groups automatically add or remove instances based on increases and decreases in load
- Helps your applications gracefully handle increases in traffic
- Reduces cost when load is lower
- Define autoscaling policy, the autoscaler takes care of the rest

For GKE groups autoscaling is different, called Cluster Autoscaling

- Autoscaling Policy
- Target Utilization Level

Autoscaling: Autoscaling Policy
- Average CPU utilization
- Stackdriver monitoring metrics
- HTTP(S) load balancing server capacity (utilization or RPS)
- Pub/Sub queueing workload (alpha)

Autoscaling: Target Utilization Level
- The level at which you want to maintain your VMs
- Interpreted differently based on the autoscaling policy that you've chosen

Autoscaling Policy: Average CPU Utilization
- Target utilization level of 0.75 maintains average CPU utilization at 75% across all instances
- If utilization exceed the target, more CPUs will be added
- If utilization reaches 100% during times of heavy usage the autoscaler might increase the number of CPUs by
-- 50%
-- 4 instances
- whichever is larger

Autoscaling Policy: Stackdriver monitoring metrics
- Can configure the autoscaler to use standard or custom metrics
- Not all standard metrics are valid utilization metrics that the autoscaler can use
-- the metric must contain data for a VM instance
-- the metric must define how busy the resource is, the metric value increases or decreases proportional to the number of instances in the group

Autoscaling Policy: HTTP(S) Load Balancing Server Capacity
- Only works with
-- CPU utilization
-- maximum requests per second/instance
- These are the only settings that can be controlled by adding and removing instances

Autoscaling does not work with maximum requests per group. This setting is independent of the number of instances in a group.

Autoscaler with Multiple Policies
- The autoscaler will scale based on the policy which provides the largest number of VMs in the group.
- This ensures that you always have enough machines to handle your workload.
- Can handle a maximum of 5 policies at a time.

StackDriver Accounts
A Stackdriver account holds monitoring and other configuration information for a group of GCP projects and AWS accounts that are monitored together.

StackDriver: Types of Monitored Projects
- Hosting Projects: holds the monitoring configuration for the Stackdriver account â€” the dashboards, alert policies, uptime checks, and so on.
- To monitor a single GCP project, create new StackDriver account within that 1 project
- To monitor multiple GCP projects, create new StackDriver account in an otherwise empty hosting project
-- Don't use hosting project for any other purpose
- AWS Connector Projects: When you add an AWS account to a Stackdriver account, Stackdriver Monitoring creates the AWS connector project for you, typically giving it a name beginning AWS Link.
- The Monitoring and Logging agents on your EC2 instances send their metrics and logs to this connector project.
- If you use StackDriver logging from AWS, those logs will be in the AWS connector project (not in the host project of the Stackdriver account)
- Don't put any GCP resources in an AWS connector project. This will not be monitored!
- Monitored Projects: Regular (non-AWS) projects within GCP that are being monitored.

StackDriver: Metrics
- Stackdriver Monitoring has metrics for
-- the CPU utilization of your VM instances
-- the number of tables in your SQL databases
-- hundreds more

- Can create custom metrics for StackDriver monitoring to track

- Three types:
-- gauge metrics
-- delta metrics
-- cumulative metrics

- Metric data will be available in StackDriver monitoring for 6 weeks

StackDriver: Metric Latency
- VM CPU utilisation - once a minute, available with 3-4 minutes lag
- If writing data programmatically to metric time series
-- first write takes a few minute to show up
-- subsequent writes visible within seconds

StackDriver: Error Reporting
- StackDriver error reporting works on
-- AppEngine Standard Environment - log entries with a stack trace and severity of ERROR or higher automatically show up
-- AppEngine Flexible Environment - anything written to stderr automatically shows up
-- Compute Engine - instrument - throw error in exception catch block
-- Amazon EC2: Enable StackDriver logging

StackDriver: Trace
- Distributed tracing system that collects latency data from Google App Engine, Google HTTP(S) load balancers, and applications instrumented with the Stackdriver Trace SDKs
- Think TensorBoard for Google Cloud Apps

Logging with StackDriver
- Stackdriver Logging includes storage for logs, a user interface (the Logs Viewer), and an API to manage logs programmatically
- Stackdriver Logging lets you
-- read and write log entries
-- search and filter your logs
-- export your logs
-- create logs-based metrics.

Types of Logs
- Audit logs: permanent GCP logs (no retention period)
- Admin activity logs: for actions that modify config or metadata
- Data access logs: API calls that create modify or read user-provided data
- Admin activity logs are always on; data access logs need to be enabled (can be big)
- BigQuery data access logs are always on by default

StackDriver: Service Tiers and Retention
- Basic - no StackDriver account - free and 5 GB cap
- Retention period of log data depends on service tier

StackDriver: Using Logs
- Monitor virtually anything - VM instances, AWS EC2 instances, database instances ...
- Exporting to sinks: Cloud Storage, BigQuery datasets, Pub/Sub topics
- Create metrics to view in StackDriver monitoring

Cloud Endpoints
- Helps create, share, maintain, and secure your APIs
- Uses the distributed Extensible Service Proxy to provide low latency and high performance
- Provides authentication, logging, monitoring
- Host your API anywhere Docker is supported so long as it has Internet access to GCP
- Ideally, use with
-- App Engine (flexible or some types of standard)
-- Container Engine instance
-- Compute Engine instance

Note - proxy and API containers must be on same instance to avoid network access

Identity and Security
Authentication - Who are you?
Standard flow - critical to get it right.
End-User Accounts
Service Accounts
API Keys - Not critical to get it right.

Authorization - What can you do?
Identity and Access Management (Cloud IAM)

Service Accounts
- Most flexible and widely supported authentication method
- Different GCP APIs support different credential types, but all GCP APIs support service accounts
- For most applications that run on a server and need to communicate with GCP APIs, use service accounts

Service Accounts: Why use them?
- Service accounts are associated with a project, not a user
- So, any project user gets access to all required resources at one go
- Btw, can also assign roles to service accounts
- Only use end-user accounts if you'd like to differentiate even between different end-users on the same project

Application Credentials
- A service account is a Google account that is associated with your GCP project, as opposed to a specific user.
- Create from
-- GCP Console
-- Programmatically
- Service account is associated with credentials via environment variable GOOGLE_APPLICATION_CREDENTIALS
- At any point, one set of credentials is 'active', called Application Default Credentials.

Application Default Credentials
When your code uses a client library, the strategy checks for your credentials in the following order:
- First, ADC checks to see if the environment variable GOOGLE_APPLICATION_CREDENTIALS is set. If the variable is set, ADC uses the service account file that the variable points to.
- If the environment variable isn't set, ADC uses the default service account that Compute Engine, Container Engine, App Engine, and Cloud Functions provide, for applications that run on those services.
- If ADC can't use either of the above credentials, an error occurs.

End-user Authentication
- Use service accounts wherever possible
- In certain specific cases however, end-user authentication its unavoidable
- You need to access resources on behalf of an end user of your application
-- For example, your application needs to access Google BigQuery datasets that belong to users of your application.
- You need to authenticate as yourself (not as your application)
-- For example, because the Cloud Resource Manager API can create and manage projects owned by a specific user, you would need to authenticate as a user to create projects on their behalf.

Scenario: "Sign in to Quora using Google"
- User navigates to quora.com
- Quora needs to access resources on behalf of user
- Quora presents Google sign-in screen to user; user signs in
- Quora requests Google to authenticate user
- Quora has authenticated user, now releases resource

- Resource owner: Quora guarding access to your account
- Resource server: Quora granting access to your account
- Client: Quora talking to Google
- Authorisation server: Google

OAuth 2.0
- Application needs to access resources on behalf of a specific user
- Application presents consent screen to user; user consents
- Application requests credentials from some authorisation server
- Application then uses these credentials to access resources

Creation:
- GCP Console => API Manager => Credentials => Create
- Select "OAuth client ID"
- Will create OAuth client secret

Scenario: "Access API via GCP Project"
- User wants to access some API
- Project needs to access that API on behalf of user
- Project requests GCP API Manager to authenticate user by passing client secret; API manager responds
- Project has authenticated user, now gives API access

- Resource owner: Project guarding access to your account
- Resource server: Project granting access to your account
- Client: Project talking to API manager
- Authorisation server: API manager

OAuth: Caution
- OAuth client ID secrets are viewable by all project owners and editors, but not readers
- If you revoke access to some user, remember to reset these secrets to prevent data exfiltration

Endpoints: API Keys
- Simple encrypted string
- Can be used when calling certain APIs that don't need to access private user data.
- Useful in clients such as browser and mobile applications that don't have a backend server
- The API key is used to track API requests associated with your project for quota and billing.

Creation:
- GCP Console => API Manager => Credentials => Create
- Select "API Key"

Beware:
- Can be used by anyone - Man-in-the-Middle
- Do not identify user or application making request

Identity and Access Management (IAM)
Identities:
- End-user (Google) account
- Service account
- Google group
- G-Suite domain
- Cloud Identity domain
- allUsers, allAuthenticatedUsers
Roles:
- lots of granular roles
- per resource
Resources:
- Projects
- Compute Engine instances
- Cloud Storage buckets
Policy:
- Associate identities with roles

Resource Hierarchy
- Organization >> project >> resource
- Can set an IAM access control policy at any level in the resource hierarchy
- Resources inherit the policies of the parent resource

Organization:
- Not required, but helps separate projects from individual users
- Link with G-suite super admin
- Root of hierarchy, rules cascade down

Folders:
- Logical groupings of projects

Identity-Aware Proxy (IAP)
- Identity-Aware Proxy (IAP) is an HTTPS-based, i.e. web based, way to combine all the identity management.
- IAP acts as an additional safeguard on a particular resource
- Turning on IAP for a resource causes creation of an OAuth 2.0 Client ID & secret (per resource). Don't delete any of these! IAP will stop working.

- central Authorization layer for applications accessed by HTTPS
- Application-level access control model instead of relying on network-level firewalls
- With Cloud IAP, you can set up group-based application access:
- a resource could be accessible for employees and inacccessible for contractors, or only accessible to a specific department.

IAP and IAM
- IAP is an additional step, not a bypassing of IAM
- So, users and groups still need correct Cloud Identity Access Management (Cloud IAM) role

IAP: Authentication & Authorisation
Authentication:
- Requests come from 2 sources:
-- App Engine
-- Cloud Load Balancing (HTTPS)
- Cloud IAP checks the user's browser credentials
- If none exist, the user is redirected to an OAuth 2.0 Google Account sign-in
- Those credentials sent to IAM for authorisation

Authorisation:
- As before using IAM

IAP Limitations
- Will not protect against activity inside VM, e.g. someone SSH-ing into a VM or AppEngine flexible environment
- Need to configure firewall and load balancer to disallow traffic not from serving infrastructure
- Need to turn on HTTP signed headers

Data Loss Prevention API
- Understand and manage sensitive data in Cloud Storage or Cloud Datastore
- Easily classify and redact sensitive data
-- Classify textual and image-based information
-- Redact sensitive data from text files, and classify

Deployment Manager: Configuration
- It describes all the resources you want for a single deployment and this file written in YAML syntax.
- This lists each of the resources you want to create and it's respective resource properties.
- A configuration must contain a resource. Resource must contain three components :-
a) Name-user-defined string for identification.
b) Type-Type of resource being deployed
c) Properties-Parameters of the resource type

Deployment Manager: Templates
- Parts of the configuration and abstracted into individual building blocks. This file is written in python or jinja2.
- They are much more flexible than individual configuration files and intended to support easy portability across deployments.
- The interpretation of each template eventually must result in the same YAML syntax.

Deployment Manager: Resource
- Which represents a single API resource and provided by Google-managed base type.
- API resource provided by a Type Provider.
- To specify a resource- provide a Type for that resource.

Deployment Manager: Types
- Which represents a single API resource or set of resources and more important for resource creation.
- Base type - Creates single primitive resource and type provider used to create additional base types.
- Composite base types contains one or more templates - preconfigured to work together.

Deployment Manager: Manifest
- It is a read only object contains original configuration.
- At the time of updation Deployment manager generates manifest.
- Manifest is useful for solving troubleshooting issue.

Deployment Manager: Deployment
- Deployment is a collection of resources,deployed and managed together.

Access control for users
- If the users have access permission to our project then they can create configurations and deployments.
- IAM - Support predefined and primitive roles.
- Primitive roles-map directly to the legacy project owner, editor and viewer roles.

Access control for Deployment Manager
- Deployment Manager uses the credentials of Google API service account for create Google Cloud Platform resources.
- The Google APIs service account is automatically granted editor permissions on the project.
- The service account exists indefinitely with the project and is only deleted when the project is deleted.

What is Runtime Configurator?
- Which lets you define and store and store as hierarchy of key value pairs in the google cloud.
- These key value pairs are used for Dynamically configure services, Communicate service states, Send notification of changes to data and Share information between multiple tiers of services.
- Runtime configurator also offers watcher and waiter service.

Concepts:
- Config resource-Which contains a hierarchical list of variables.
- Variables are the key value pairs belongs to RuntimeConfig resource.
- Watchers can use the watch() method to watch a variable and return when the variable changes, and finally waiters which have a cardinality condition.

Cloud Key Management: Object hierarchy: Project
- Cloud KMS resources belong to a project.
- Resources have permission when the account with primitive IAM roles on any project with cloud KMS resources.

Cloud Key Management: Object hierarchy: Location
- Which represents geographical data centre location of where requests to Cloud KMS regarding the given resources.
- If locations are close to you,which is more fast and reliable.
- Global - If we using this resource, KMS resources are available from multiple data centers.

Cloud Key Management: Object hierarchy: KeyRing
- KeyRing is a grouping of CryptoKeys for organisational purpose.
- Combination of CryptoKey and KeyRing - No need act individually.

Cloud Key Management: Object hierarchy: CryptoKey
- Cryptographic key used for special purpose.
- CryptoKey is used to protect some corpus of data.
- Can encrypt and decrypt by users with the permissions of CryptoKey.

Cloud Key Management: Object hierarchy: CryptoKey Version
- Represents the key materials, which have many versions and starting from 1.
- Which have states like enabled, disabled and scheduled.
- Primary version will use for the encryption of data.

CryptoKey Version state
CryptoKeyVersion has a state:
a) Enabled(ENABLED) - Used for encryption and decryption of cryptokey requests.
b) Disabled(DISABLED) - May not be used, placed back to enabled state.
c) Scheduled for destruction(DESTROY-SCHEDULED) - For destruction and destroyed soon.
d) Destroyed(DESTROYED) - Key material no longer stored in cloud KMS.

Primary CryptoKey Version
- Used for the time of encryption.
- At any given point of time,One version of the cryptography cab be primary.
- If primary CryptoKeyVersion is disabledCryptoKey cannot encrypt the data.

CryptoKey and CryptoKeyVersion states
- CryptoKeyVersion Contains the states.
- If Primary CryptoKeyVersion is enabled, Then only CryptoKey to encrypt the data.
- At the time of decryption, no need for the primary version.

Crypto Key Rotation
Rotation in cloud KMS:
- At the time of generating crypto key version of crypto key marking that is primary key.
- Each Crypto key version rotated to primary key, at the point to encrypt the data.
Frequency of key rotation:
- Regular rotation and Irregular rotation are the two rotations of Encryption keys.
- Regular rotation take the time for data encrypted with single key.
- Irregular rotation disable the to the restrict access of data.
Automatic rotation:
- CryptoKey rotation schedule d using cloud command or via Google Cloud Platform Console.
- Rotation schedule is scheduled by rotation period and next rotation time.
Manual Rotation:
- Used for irregular key rotation
- Manually rotated using cloud command line or via Cloud Platform Console.

Setting up Cloud KMS in separate project
The user and owner can access and manage the project at the time of run.
a) Create the key project without an owner-recommended.
b) Grant an owner role for your key project-Not recommended.

Choosing the right IAM roles
- In smaller organisation - owner, editor and viewer provide sufficient granularity for key management.
- In large organisation - separation of duties required.

The roles they recommend are:
a) For the business owners whose application requires encryption.
b) For the user managing cloud
c) For the user or service using keys for encryption and decryption operations.

Cloud KMS: Overview of secret management
Cloud KMS (Key Management Service) allows you to keep encryption keys in one central cloud service, for direct use by other cloud resources and applications. With Cloud KMS you are the ultimate custodian of your data, you can manage encryption in the cloud the same way you do on-premises, and you have a provable and monitorable root of trust over your data.

- Common ways to storing secrets are Code of binaries, Deployment Management etc...
- Authorisation, Verification of usage,Encryption at rest,RotationAnd Isolation are the security concerns.
- Consistency and Version Management describes the functionality concerns of secret management.

Cloud KMS: Choosing a secret management solution
- Storing secrets in code, encrypted with a key from Cloud KMS and Storing secrets in storage bucket in Google cloud storage are some example for approaches.
- Rotating secrets, Cache secrets locally and using a separate solution or problem are some of the changing secrets.
- Encryption options are Use application layer encryption using a key in Cloud KMS and aUse the default encryption built into the Cloud Storage bucket.
- Managing access to secrets are Access controls on the bucket in which the secret is stored and Access controls on the key which encrypts the bucket in which the secret is stored.
- Key rotation and Secret rotation are some example for secret management.
- Permission management without a service account requires several users: An organizational-level administrator, A second user that has the a storage,A third user with the cloudkms.admin role and A fourth user that has both the storage.objectAdmin and cloudkms.cryptoKeyEncrypterDecrypter roles.

Cloud KMS: Envelope encryption
- The key used to encrypt data itself is called a data encryption key (DEK).
- The DEK is encrypted (or wrapped) by a key encryption key (KEK).

Data Exfiltration
An authorized person extracts data from the secured systems where it belongs, and either shares it with unauthorized third parties or moves it to insecure systems. Data exfiltration can occur due to the actions of malicious or compromised actors, or accidentally.

Data Exfiltration: Types
- Outbound email
- Downloads to insecure devices
- Uploads to external services
- Insecure cloud behaviour
- Rogue admins, pending employee terminations

Data Exfiltration: Don'ts for VMs
- Don't allow outgoing connections to unknown addresses
- Don't make IP addresses public
- Don't allow remote connection software e.g. RDP
- Don't give SSH access unless absolutely necessary

Data Exfiltration: Dos for VMs
- Use VPCs and firewalls between them
- Use a bastion host as a chokepoint for access
- Use Private Google Access
- Use Shared VPC, aka Cross-Project Networking

Data Exfiltration: Bastion Hosts
- Limit source IPs that can communicate with the bastion
- Configure firewall rules to allow SSH traffic to private instances from only the bastion host.

Similar to jump hosts?

Cloud Data Transfer Use Cases - Data Center Migration
The data you create and store on-premises takes relentless focus and significant resources to manage it cost-effectively, securely, and reliably. As organizations face exponential growth of their data many are turning to the cloud to scale with them in their efforts. For your structured and unstructured data sets, whether they are small and frequently accessed or huge and rarely referenced, Google offers solutions to migrate that data quickly to Google Cloud Storage , BigQuery or Dataproc.

Cloud Data Transfer Use Cases - Decommission Tape Libraries and Infrastructure
Many organizations accumulate vast libraries of magnetic tape as they copy data for backup, archival or disaster recovery purposes. You can easily transfer data from tape to Google Cloud Storage. Once in Google Cloud you can generate new insights with advanced analytics, discover it more easily for regulatory and legal purposes and apply machine learning.

Cloud Data Transfer Use Cases - Machine Learning
Google Cloud Machine Learning Engine is a managed service that enables you to easily build machine learning models, that work on any type of data, of any size. Create your model with the powerful TensorFlow framework that powers many Google products, from Google Photos to Google Cloud Speech. Build models of any size with our managed scalable infrastructure. Your trained model is immediately available for use with our global prediction platform that can support thousands of users and TBs of data.

Cloud Data Transfer Use Cases - Content Storage and Delivery
To serve users around the world with the highest availability, Google offers multi-regional setups designed for video streaming and frequently accessed content like web sites and images.

For analytics and batch processing, regional setups are available to meet the unique requirements of those workloads.

For content-rich use cases like these you can choose a data transfer option that will have minimal impact on your network while moving large amounts of data.

Cloud Data Transfer Use Cases - Backup and Archival
With increased frequency of cloud outages you need to ensure your data is always available. Using our data transfer services you can easily backup data from another cloud storage provider to Google Cloud Storage. You can ensure your data is retained cost-effectively by taking advantage of ultra low-cost, highly-durable and highly available archival storage offered through Google's Nearline and Coldline storage classes.

Object lifecycle management enables this automatically, transitioning data from one storage class to the next depending on your business's cost and availability needs at the time.

Storage Transfer Service Overview
Storage Transfer Service transfers data from an online data source to a data sink. Your data source can be an Amazon Simple Storage Service (Amazon S3) bucket, an HTTP/HTTPS location, or a Cloud Storage bucket. Your data sink (the destination) is always a Cloud Storage bucket.

You can use Storage Transfer Service to:
- Back up data to a Cloud Storage bucket from other storage providers.
- Move data from a Multi-Regional Storage bucket to a Nearline Storage bucket to lower your storage costs.

Storage Transfer Service: Options
Storage Transfer Service has options that make data transfers and synchronization between data sources and data sinks easier. For example, you can:
- Schedule one-time transfer operations or recurring transfer operations.
- Delete existing objects in the destination bucket if they don't have a corresponding object in the source.
- Delete source objects after transferring them.
- Schedule periodic synchronization from data source to data sink with advanced filters based on file creation dates, file-name filters, and the times of day you prefer to import data.

In order to have full access to Storage Transfer Service, you must be the EDITOR or OWNER of the project that creates the transfer job. If you are a VIEWER of the project, you can view and list transfer jobs and transfer operations associated with the data sink.

Google Transfer Appliance
Google Transfer Appliance is a high capacity storage server that enables you to transfer up to one petabyte of data on a single appliance and securely ship it to a Google upload facility, where the data is uploaded to Google Cloud Storage. You can serially lease multiple Transfer Appliances if your data size exceeds one petabyte.

Transfer Appliance offers two models:
- The rackable 100 terabyte (TB), which stores from 100 TB up to potentially 200 TB of data, depending on the deduplication and compression ratio of your data.
- The standalone 480 TB, which stores from 480 TB up to potentially 1 petabyte (PB).

Is Transfer Appliance suitable for me?
Transfer Appliance is a good fit for your data transfer needs if:
- You are an existing Google Cloud Platform (GCP) customer.
- Your data resides in the United States.
- Your data size is greater than or equal to 20TB.
- You don't require HIPAA compliance.

BigQuery Data Transfer Service
The BigQuery Data Transfer Service automates data movement from Software as a Service (SaaS) applications such as Google AdWords and DoubleClick on a scheduled, managed basis.

You can access the BigQuery Data Transfer Service using the:
- BigQuery web UI
- BigQuery command-line tool
- BigQuery Data Transfer Service API

After you configure a data transfer, the BigQuery Data Transfer Service automatically loads data into BigQuery on a regular basis. You can also initiate data backfills to recover from any outages or gaps. Currently, you cannot use the BigQuery Data Transfer Service to transfer data out of BigQuery.

BigQuery Data Transfer Service: Supported data sources
BigQuery Data Transfer Service supports loading data from the following data sources:
- Google AdWords
- DoubleClick Campaign Manager
- DoubleClick for Publishers
- Google Play (beta)
- YouTube - Channel Reports
- YouTube - Content Owner Reports

Cloud Datalab
- Cloud Datalab is packaged as a container and run in a VM (Virtual Machine) instance.
- Cloud Datalab uses notebooks instead of the text files containing code. Notebooks bring together code, documentation written as markdown, and the results of code executionâ€”whether as text, image or, HTML/JavaScript.
- Cloud Datalab notebooks can be stored in Google Cloud Source Repository, a git repository. This git repository is cloned onto persistent disk attached to the VM. This clone forms your workspace. To share your work with other users, push your changes from this local workspace to the repository.
- When the executed code accesses Google Cloud services such as BigQuery or Google Machine Learning Engine, it uses the service account available in the VM. Hence, the service account must be authorized to access the data or request the service.
- The VM used for running Cloud Datalab is a shared resource accessible to all the members of the associated cloud project. Therefore, using an individual's personal cloud credentials to access data is strongly discouraged.

Cloud Datalab: Usage Scenarios
Cloud Datalab is an interactive data analysis and machine learning environment designed for Google Cloud Platform. You can use it to explore, analyze, transform, and visualize your data interactively and to build machine learning models from your data.

A few ideas to get you started:
- Write a few SQL queries to explore the data in BigQuery. Put the results in a Dataframe and visualize them as a histogram or a line chart.
- Read data from a CSV file in Google Cloud Storage and put it in a Dataframe to compute statistical measures such as mean, standard deviation, and quantiles using Python.
- Try a TensorFlow or scikit-learn model to predict results or classify data.

Cloud Datalab: Pricing
There is no charge for using Google Cloud Datalab. However, you do pay for any Google Cloud Platform resources you use with Cloud Datalab, for example:
- Compute resources: You incur costs from the time of creation to the time of deletion of the Cloud Datalab VM instance. The default Cloud Datalab VM machine type is n1-standard-1, but you can choose a different machine type. You are also charged for a 20GB Standard Persistent Disk, which is used as a Boot Disk, and a 200GB Standard Persistent Disk, where user notebooks are stored.
>>> The 20GB boot disk is deleted when the VM instance is deleted, but the 200GB disk remains after the deletion of the VM until you delete it.
- Storage resources: Notebooks are saved to Persistent Disk and backed up to Google Cloud Storage
- Data Analysis Services: You incur Google BigQuery costs when issuing SQL queries within Cloud Datalab notebooks. Also, when you use Google Cloud Machine Learning, you may incur Cloud Machine Learning Engine and/or Google Cloud Dataflow charges.

Cloud Dataprep
- Instant data exploration: Visually explore and interact with data in seconds. Instantly understand data distribution and patterns. You don't need to write code. You can prepare data with a few clicks.
- Intelligent data cleansing: Cloud Dataprep automatically identifies data anomalies and helps you to take corrective action fast. Get data transformation suggestions based on your usage pattern. Standardize, structure, and join datasets easily with a guided approach.
- Serverless: Cloud Dataprep is a serverless service, so you do not need to create or manage infrastructure. This helps you to keep your focus on the data preparation and analysis.
- Seriously powerful: Cloud Dataprep is built on top of the powerful Cloud Dataflow service. Cloud Dataprep is auto-scalable and can easily handle processing massive data sets.
- Supports common data sources of any size: Process diverse datasets â€” structured and unstructured. Transform data stored in CSV, JSON, or relational table formats. Prepare datasets of any size, megabytes to terabytes, with equal ease.
- Integrated with Google Cloud Platform: Easily process data stored in Cloud Storage, BigQuery, or from your desktop. Export clean data directly into BigQuery for further analysis. Seamlessly manage user access and data security with Cloud Identity and Access Management.

Cloud ML Engine Overview
Cloud ML Engine mainly does two things:
- Enables you to train machine learning models at scale by running TensorFlow training applications in the cloud.
- Hosts those trained models for you in the cloud so that you can use them to get predictions about new data.

Cloud ML Engine manages the computing resources that your training job needs to run, so you can focus more on your model than on hardware configuration or resource management.

Cloud ML: Prepare your trainer and data for the cloud
The key to getting started with Cloud ML Engine is your training application, written in TensorFlow. You can develop your trainer as you would any other TensorFlow application, but you need to follow a few guidelines about your approach to work well with cloud training.

You must make your trainer into a Python package and stage it on Google Cloud Storage where your training job can access it.

As with your application package, your data must be stored where Cloud ML Engine can access it. The easiest solution is to store your data in Google Cloud Storage in a bucket associated with the same project that you use for Cloud ML Engine tasks.

BigQuery & MapReduce Selection Criteria
Use BigQuery:
- Finding particular records with specified conditions. For example, to find request logs with specified account ID.
- Quick aggregation of statistics with dynamically-changing conditions. For example, getting a summary of request traffic volume from the previous night for a web application and draw a graph from it.
- Trial-and-error data analysis. For example, identifying the cause of trouble and aggregating values by various conditions, including by hour, day and etc...

Use MapReduce:
- Executing a complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms.
- Executing large join operations across huge datasets.
- Exporting large amount of data after processing.

DEVSHELL_PROJECT_ID
Environment variable that holds the current project id.

Referring tables in BigQuery
<project id>:<dataset>.<table>

Examples:
publicdata:samples.shakespeare
bigquery-public-data:usa_names.usa_1910_current

BigQuery Shell Commands
Show table info:
bq show publicdata:samples.shakespeare
Show first 10 rows:
bq head -n 10 publicdata:samples.shakespeare

BigQuery Commands: Load data from storage
bq load --source_format=CSV babynames.babynames_2011 gs://<bucket-name>//babynames/yob2011.txt name:string,gender:string,count:integer

Can use regular expressions in the file names. Like yob20*.txt.

BigQuery Commands: SQL Query
bq query "SELECT name, count FROM babynames.all_names WHERE gender = 'F' ORDER BY count DESC LIMIT 5"

BigQuery Commands: Export table to cloud storage
bq extract babynames.all_names gs://<bucket name>/export/all_names.csv

⚡ Recently practiced quizzes in this class

Google Cloud Platform Basics Practice Test Google Certified Professional Data Engineer: Selecting Appropriate Storage Technologies Google Certified Professional Data Engineer: Designing a Data Processing Solution Google Certified Professional Data Engineer: Building and Operationalizing Processing Infrastructure Google Certified Professional Data Engineer Assessment Test Google Certified Professional Data Engineer: Designing Data Pipelines Google Certified Professional Data Engineer: Building and Operationalizing Storage Systems Google Cloud Platform Practice Test GCP Data Engineer Exam Questions Google Certified Professional Data Engineer: Designing for Security and Compliance

➡️ Next Study Guide

GCP - Data Engineer Certification Complete Study Guide

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

GCP - Data Engineer Certification Complete Study Guide

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | What Should We Know? Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com