Fatskills
Practice. Master. Repeat.
Study Guide: Google Cloud Data Engineer Certification Important Facts To Know
Source: https://www.fatskills.com/google-cloud-certified-professional-data-engineer/chapter/google-cloud-data-engineer-certification-important-facts-to-know

Google Cloud Data Engineer Certification Important Facts To Know

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~65 min read

Bigtable's security can be controlled at what points?
Folder
Project
Instance

BigQuery's security can be controlled at what points?
Folder
Project
Dataset

A project can contain a Bigtable ____, which contains one or two _____, each with 3 to many ____
Instance
Clusters
Nodes

If configured with SSDs, a Bigtable node can handle ____ QPS read or write, with a latency of ____ms, and can hold ____ total data.
10,000
6ms
2.5TB

Remember, with the 3 node minimum in a production cluster, that means a base production cluster can handle 30,000 QPS and will hold 7.5tb

If configured with HDDs, a Bigtable node can handle ____ QPS read with a latency of ____, ____ QPS writes, with a latency of ____, and can store ____ of data.
500
200ms
10,000
50ms
8TB

Remember, with the 3 node minimum in a production cluster, that means a base production cluster can handle 24tb data, 1500 read QPS and 30,000 write QPS.

An Apache Beam ____ aggregates data, which is then emitted by a ____. The emitted data is known as a ____.
Window
Trigger
Pane

The four types of Dataflow (Apache Beam) windows are?
Fixed Time
Sliding Time
Per-Session
Single Global

The four types of Dataflow (Apache Beam) triggers are?
Event Time
Processing Time
Data-Driven
Composite

Currently, Dataflow only supports the ____ data-driven trigger.
.elementCountAtLeast()

In Cloud Dataflow the default window is type ____.
Global

In Cloud Dataflow the default trigger for a PCollection is based on ____.
Event Time

In Dataflow, what's a PCollection
A distributed set of data that your Dataflow pipeline operates on. It's usually initially created by a read operation on an external datasource. Each PTransform in the pipeline then starts with a PCollection, does something to each element in it, and generates 1+ new PCollections.

In Cloud Dataflow if using the default window together with the default trigger, the trigger fires ____ time(s) and late data is ____.
1
Discarded

In Cloud Dataflow data is guaranteed to be processed in a pipeline in the order it was sent? True/False
False

The Cloud Dataflow notion of when all data in a certain window can be expected to have arrived in the pipeline is known as the ____
Watermark. The delay between when an event happens, and when it gets processed at any point in the pipeline. That time difference ebbs and flows, thus the watermark name.

In most cases, BigQuery can automatically deduplicate streaming message inserts, true or false.
True

In order for BigQuery to deduplicate streaming inserts, all inserted records must provide an ____ and the duplicate messages must arrive within ____ minute(s) of each other.
insertId
1

Google Cloud Machine Learning can train and serve ____, ____, ____, and ____ models
Classification
Regression
Clustering
Dimensionality Reduction

In machine learning, linear regression models are used primarily to:
Estimate real values based on continuous variables

Examples of using Machine Learning Linear Regression models include:
Total Sales
Housing Prices
Retirement Age

In machine learning classification models are used primarily to:
Group items into known categories

Examples of using Machine Learning Classification models include:
Spam, not spam
Good movie, Bad movie
Authorized, Fraudulent
Good wine, bad wine
Picture of: Cat, Dog, Goat,...
All the words in English

In machine learning, Clustering models are used primarily to:
Gain insight into sets of data by using unsupervised learning to see what groups the data points are falling into

Three major types of machine learning are ____, ____, and ____ learning.
Supervised
Unsupervised
Reinforcement

K-Means is an example of an ____ learning algorithm used to spot ____.
Unsupervised
Clusters

In Machine Learning, what is Reinforcement learning?
Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Good puppy, you did well. That idea. Almost a pavlovian model.

In Machine Learning, what is Supervised learning?
Supervised learning trains the Machine Learning algorithm from a dataset where we know and have labeled the correct answers. The model makes predictions on the training data, and learns from how far off it is from the correct answer. The process continues iteratively until the guess in within a desired error margin.

In Machine Learning, what is Unsupervised learning?
Unsupervised learning learns from test data that has not been labeled, classified or categorized, by identifying commonalities and differences in the data.

In machine learning, what is Dimension Reduction?
Dimension Reduction is the process of reducing the number features.

Two major types of Dimensional Reduction are:
Feature Extraction and Feature Selection

Feature Selection is a form of dimensional reduction which works by?
Figuring out which features may be safely removed, leaving the rest.

Feature Extraction is a form of dimensional reduction which works by?
Replacing a group of features with a new feature

In Pub/Sub, large volume message flows should use Push or Pull subscriptions?
Pull. Push delivers one message at a time. Pull can pull batches

Pub/Sub adds what two pieces of data to each message?
messageId and publishTime

In Pub/Sub the messageId is guaranteed to be unique within the ____.
Topic

In Pub/Sub by default, if a recipient doesn't acknowledge a message within ____ seconds a new message will be resent.
10

Pub/Sub can store messages for ____, after which time the message will be deleted.
7 days

In Pub/Sub the maximum time a subscriber can wait before acknowledging the receipt of a message is configurable. True/False
True.
gcloud pubsub subscriptions modify-message-ack-deadline ....
FYI, default is 10sec

It's easy to switch a Pub/Sub subscriber from push to pull. True/False
True

To determine which user has been accessing what in a project, examine the ____ log.
Cloud Audit Logging Data Access

Google Cloud Audit Logging maintains what three log files for every project
Data Access
Admin Activity
System Events

A common backup format for MySQL databases is?
mysqldump

mysqldump backup files use a basic ____ file format.
SQL, with both the content and structure specified.

What's the difference between a Bigtable Developer instance, and a Bigtable Production instance?
A Developer instance has a single node and is designed for low cost testing and dev work. No SLA, guaranteed response time, etc. Upgradable at any point to Production.

A Production instance is exactly that. It has 1-2 clusters, each with a minimum of 3 nodes. Yes to SLA, etc. You cannot downgrade from Production to Developer

MapReduce is a what?
MapReduce is a massively parallel big data processing technique and program model for distributed computing based on java.

Apache Pig is a what?
Apache Pig is a high level data analysis language designed to greatly simplify a developers interaction with Hadoop MapReduce (pig latin). Remember though, it uses MapReduce behind the scenes so it is no faster, just easier to code

Apache Hive does what?
Apache Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Like Pig, it greatly simplifies an analyst's interaction with Hadoop and MapReduce. Unlike pig it supports SQL statements.

Apache Spark is a?
Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing.

Apache Spark uses memory to achieve high performance gains over classic MapReduce. True/False
True
Over 100 times faster than MapReduce

Apache Spark has programming APIs (Application Programing Interfaces) for which languages?
Java
Scala
Python
R

Apache Pig is a Java API. True/False
False
Apache Pig allows users to interact and do data analysis with Hadoop and MapReduce using its own script syntax: Pig Latin

Apache Hive is written using a combination of ___ and ___.
SQL
Java

Apache Pig is much faster than Java/MapReduce. True/False
False
Pig uses its own scripting language, Pig Latin, but uses MapReduce to do all its work.

The technique that can be used to provide secure access to a single BigQuery table or view is called?
Authorized View

What steps are used to setup a BigQuery authorized view?
Create table T1 in dataset DS1
Secure DS1
Create dataset DS2
Allow access from DS2 to DS1
Create a view in DS2 of T1
Provide access to DS2 to the user

Google Cloud Datastore is being upgraded and renamed to?
Firestore

By default, Firestore (Datastore) automatically predefines an index for each property of each entity kind. True/False
In datastore mode, True. Firestore by default, false, you have to manually configure the indexs

By default, Bigtable automatically predefines an index for each property of each entity kind. True/False
False. The only index in Bigtable is the key

In Bigtable, the unique identifier for each record is known as the ____?
Row Key

In Cloud Datastore, what is index explosion and when can it happen.
Cloud Datastore creates an entry in a predefined index for every indexed property of every entity. If a property has multiple values (a movie's actors), then it creates an index for every possible combination of properties. Multiple, multiple value properties (a movie's actors, tags, and Genres) has a combinatorial effect on the index count and it can explode past the index limits.

In Cloud Datastore, the Movie entity has a list property for actors, a list property for genres, and a single value for title. How could a custom index be created to avoid index explosion?
indexes:
- kind: Movie
properties:
- name: actors
- name: title
- kind: Movie
properties:
- name: genres
- name: title

In Cloud Datastore, the maximum size for an entity is?
1MiB

MongoDB is what kind of database?
JSON document store

MongoDB has a max database size of?
32TB

Apache Casandra is what?
A highly scalability, high available, NoSQL database

Apache Kafka is most similar to what GCP product?
Pub/Sub

Cloud SQL is a managed version of what?
MySQL or PostgreSQL database

In Cloud SQL, the maximum amount of data that can be stored in MySQL is?
10.23TB

In BigQuery, what is the proper way to reference a field in a repeated nested column (like a customers column which has a nested country)?
UNNEST the nested column, then work with the field. Like:
SELECT customer FROM `cool.example.table` UNNEST(customers) as customer
WHERE customer.country = "USA"
(Note: I'm 90% sure that's the correct syntax)

BigQuery can export data in what formats?
JSON
Avro
CSV

The max size of any single file in a BigQuery export is?
1gb

When BigQuery is exporting more than 1gb of data, use what format for the export file name?
gs://[YOUR_BUCKET]/file-name-*.json
So it can create a series if 1g files

How does BigQuery charge?
By data storage (Bucket pricing)
By slot (a measure of processing resources)
By Gig of data streamed in

BigQuery doesn't charge for data batch loaded from, or exported to, a bucket in the same region. True/False
True, but it does charge for streaming inserts

BigQuery should be able to handle streaming inserts up to ___ rows per second, per project.
100,000

BigQuery can load files from what sources?
File Upload
Google Cloud Storage
Google Drive
Bigtable

When a BigQuery table is set up to partition, the partitions are separated based on ____.
Time. Daily by default but the can be configured to use any timespan.

What steps should be taken to change a BigQuery standard table to a partitioned table?
The partition type of a BigQuery table can't be changed. Would have to export to a new table.

What two types of BigQuery partitioned tables exist?
Ingestion Time Partitioned Tables
Partitioned Tables

In BigQuery ingestion time partitioned tables, what two pseudo columns are added to the tables?
_PARTITIONTIME
_PARTITIONDATE

In BigQuery ingestion time partitioned tables, how are the partitions created?
BigQuery automatically loads data into daily, date-based partitions that reflect the data's ingestion or arrival date.

It's possible to control the time frames used by BigQuery to create partitions. True/False
True. Partitioned tables allow you to bind the partitioning scheme to a specific TIMESTAMP or DATE column.

Creating a BigQuery table with SQL such that it partitions data every seven days would require what option?
partition_expiration_days=7

CREATE TABLE cooldataset.coolnewtable (neatfield INT64, transaction_date DATE)
PARTITION BY transaction_date
OPTIONS(
partition_expiration_days=7,
description="a table partitioned
by transaction_date"
)

In many cases it's more efficient to denormalize data and load it into a single big BigQuery table because?
BigQuery only indexes on the key column so joins are relatively inefficient. Yes, denormalization uses more disk space, but that's cheep so...

The best way to denormalize data in BigQuery is to take advantage of its ____.
Native support for nested and repeated structures.

In a BigQuery query, what does the LIMIT do? What doesn't it do?
LIMIT does limit the number of records in the result, but it does not change the number of records processed by the query

When a machine learning model is training, it adjusts what values?
Weights and Bias

In Machine Learning, a neuron does what?
It accepts a group of weighted inputs, applies an activation function, and returns an output

In Machine Learning, each neuron accepts what?
Features form a training set or outputs from a previous layer of neurons.

In Machine Learning, what does a Bias term represent?
It represents a constant value added to the input of a neuron. So if a neuron calculation comes to 0, the bias can overcome that so the actual output has a value.

In Machine Learning, what is Weight?
The input value to a neuron is the sum of the outputs from the previous neurons, each with a weight value attached (multiplied). You can think of that as a value adding extra importance to the decisions from certain neurons. So the sum of Wi*Xi. Then you'd add the bias. So
Sum Of (Wi*Xi) + bias

Machine Learning code in GCP can be created with which languages?
Java and Python

Machine Learning code in GCP can be created with which libraries?
TensorFlow for Java
Scikit and XGBoost for Python

In a nutshell, Machine Learning breaks down into which major steps?
Data Preparation
Code the Model
Train the Model
Evaluate the accuracy of the Model
Tune Hyperparameters
Deploy
Handle prediction requests
Monitor/Evaluate

In GCP Machine Learning, what are some fundamental differences between web and batch prediction requests?
Web requests need to be optimized for handling single requests in a reasonable amount of time (person's waiting for the response to load). Batch requests can handle larger sets of requests and predictions, that both originate and end up in a Google Cloud Storage bucket.

In Machine Learning, what are hyperparameters?
Hyperparameters contain the data that controls the training process and include: input neuron count, network layers (how deep the network is), neurons in each layer, output neuron count.

What are the three layer types in a deep neural network
Input layer, output layer, and hidden layers

How would you move a Machine Learning model?
Package it, export the serialized model to a staging bucket, redeploy it to its new home

What gcloud command submits a training model?
gcloud ml-engine jobs submit training
specify job, package, details, region, machine type or scale

How do Machine Learning scale tiers work?
There are several standard scales which specify the type for the master, and the number and type for the worker, and parameter servers. So for example
STANDARD_1: One master: n1-highcpu-8, four workers: n1-highcpu-8, three parameter servers: n1-standard-4

In the Machine Learning custom tier, what can you specify?
CUSTOM allows you to control the type of machine for your master node, the number and type of workers, and the number and type of parameter servers

In a Wide and Deep neural network, what does wide mean? What does deep mean?
Wide reefers to the number if neurons in the input layer. Deep refers to the number of hidden layers, layers between the input and output tier.

In a neural network, how does it being Wide help you?
Wide refers to the number of neurons in the input tier and it tends to help with exact matching and memorization learning.

In a neural network, how does it being Deep help you?
Deep refers to the number of hidden layers in a network and it tends to help generalize learning. "You liked X, so you might also like..."

K-Means clustering is what?
K-Means clustering is an unsupervised machine learning algorithm that groups similar data points together and helps discover underlying patterns.

What gcloud switch is used to run a TensorFlow training job locally?
local
gcloud ml-engine local train

A sparse vector is what?
A vector with a single 1
[0,1]
[0,0,1,0,0]

In machine learning, a common technique to handle a feature that represents a category with a limited number of options is what?
One-hot encoding

How is one-hot encoding used to convert categories into a machine learning friendly format?
One-hot encoding converts each option in a category into a sparse vector. For example:
Red [1,0,0]
Blue [0,1,0]
Green [0,0,1]

In machine learning models, feature values tend to break down into what two major types?
Continuous: Numbers in a range
Categorical: A group of possible values

What is feature engineering?
Feature engineering is the process of using domain knowledge to pick features that make machine learning algorithms work efficiently.

Name two feature engineering approaches
Bucketization or binning: converting a feature from a continuous string of values into several bucketed values, usually tied to range. Not every temp, but a group of temp ranges.

Crossing or cross feature columns: combine a group of features into a new feature. So come up with a single value that crosses age and weight, or a single value that combines latitude and longitude.

It is possible to change a Google Cloud Storage bucket from Regional to Nearline to Coldline, and back. True/False
True

Which trigger does Dataflow not support? Count, size, time, or combination?
Size

It is possible to change a Google Cloud Storage bucket from Regional to Multi-Regional. True/False
False

A Nearline or Cloldline bucket can also be Regional or Multi-Regional. True/False
True

It is possible to set unique Google Cloud Storage classes (Regional, Nearline, Coldline) at the file level. True/False
True

The Dataflow sink to BigQuery only supports streaming. True/False
False. It supports both batch and streaming loads from Dataflow into BigQuery

Dataflow connects to Bigtable using the ____.
Cloud Dataflow Connector

To run Java Dataflow jobs locally for testing, use the ____.
DirectPipeRunner

What IAM role is required to run a Dataflow job?
dataflow.worker

The workflow through a typical Apache Beam (Dataflow) app contains what major steps?
Create the pipeline, Create or load the first PCollection of data, Apply PTransforms to each PCollection, Write the transformed PCollection to some sink.

In Cloud Dataflow, pipeline's frequently share data. True/False
False. Pipelines don't share data, not directly with each other. That would impact Dataflow's ability to process large amounts of data in parallel.

In Cloud Dataproc, what can't be stored on preemptable workers?
Data

Does autoscaling have to be initially enabled for Cloud Dataflow, or is it enabled by default?
It's enabled by default

How is Cloud Dataflow autoscaling disabled/enabled?
By setting the autoscaling_algorithm option

Cloud Dataflow in autoscaling mode will allow a default maximum of how many Compute Engine Instances?
1000 per job (n1-standard-4, by default), or the max compute engine quota for the project, whichever is lowest.

What Cloud Dataflow option can be used to change the Compute Engine instance type?
worker_machine_type

In Cloud Dataproc, what security role is needed to execute jobs?
dataproc.worker

What terminal command is needed to create a Dataproc cluster?
gcloud dataproc clusters create ....

Cloud Dataproc is essentially a GCP managed instance of what?
Hadoop and Spark

In Dataproc, what role does YARN play?
Yet Another Resource Negotiator (YARN) is the resource management and job scheduling technology at the heart of the Hadoop architecture.

How should a Dataproc's YARN site be accessed?
Use SOCKS through a SSH tunnel. When you're in Cloud Shell, that's the little "Web Preview" button, though you'll have to manually set the port correctly.

What are the two configuration options in Cloud Dataproc for the Master server?
1 master (default)
3 masters (Hadoop HA)

In Dataproc High Availability mode, what are some of the cluster changes?
3 masters
All masters participate in a ZooKeeper cluster
YARN configured for HA
HDFS configured for HA

What is Apache ZooKeeper
It's a hadoop service designed to share configurations, naming, and other group service across the hadoop cluster.

How is Cloud Dataproc structured?
Project
Cluster
Master/Worker Nodes
Jobs

What can a dataproc.viewer see?
Details about the jobs and cluster

Cloud Dataproc is billed per ___
Minute

Bigtable requests run through a ______ before they hit a BT node.
Front End Server

When using gcloud to create a Dataproc cluster, how can property files be modified?
--properties 'fileAlias:cool.key=value'
--properties 'spark:spark.master=...'

To customize software in a Dataproc cluster:
Set initialization actions
Use --properties
SSH in and manage

To enable Bigtable replication ____.
Create multiple clusters in the same Bigtable instance. Replications starts automatically. Use a different Zone for each cluster, all in the same region.

Data can easily be transferred into Dataproc via ____
SSH

A Bigtable cluster is a Multi-Regional, Regional, Zonal resource?
Zonal

Bigtable is a Multiregional, Regional, Zonal resource?
Regional

In Bigtable, what are some recommendations on choosing a key?
Group keys containing like data together (data from sensor 1)
Keys should distribute evenly across the tablespace
Reasonably short
Contain data fields
Timestamps at end not beginning
Reverse domain names

If a Bigtable node fails, what happens to its data?
Nothing, the data is replicated and stored safe in tablets in Colossus

What is Borg?
Google's internal container management system.

What is Colossus?
Googles highly distributed, redundant, cluster level file system

In Bigtable, what is hotspotting?
When a small group of keys (table section) are over utilized, causing Bigtable to overuse particular servers. For example, streaming data with keys that all start with the timestamp. All the writes will hit the same section of the cluster causing hotspotting.

What is typically, the single most effective way to avoid hotspotting?
Field promotion

In Bigtable, what is Field Promotion?
Adding one or more of the records fields to the beginning of the key: sensorId#timestamp, region:center:timestamp, reverseUrl/timestamp

What are some Bigtable keys that should be avoided?
Keys that start with: Sequential numbering, Timestamps, Non reversed domain names
Keys that contain frequently updated fields
Hashed values

When testing performance in Bigtable, what steps should be taken?
Use a production (not dev) instance
Use at least 300gb of data, 100 per node
Do a heavy 10min + pretest before the real test

In Bigtable, the HBase shell is ___?
A command line tool that can be used to perform administrative and data access tasks.

If an application needs extra support for mobile SDKs, but has a workload appropriate for Cloud Storage, what might be a better option?
Firebase

What is the Bigtable Key Visualizer
A graphical tool that displays several usage metrics about Bigtable. It's a great way to spot key hotspotting for example

How can a Bigtable instance be switched from HDD to SDD drives?
It can't. What you can do is export the data, spin up a new instance, and reload the data

Google Cloud Storage is appropriate for what kind of data storage?
Binary/File

For Structured/Simi-structured data targeted at an analytics workload, what might be good storage options?
Cloud Bigtable
BigQuery

For simi-structured object/entity/JSON document types of workloads, what would be a possible data storage option?
Datastore/Firestore

Describe persistent disk storage and what it's good for.
Fully-managed block storage, used for Compute Engine VMs and Kubernetes Volumes.

What options should be considered for relational data storage?
Cloud SQL
Cloud Spanner

How is Cloud SQL scaled?
Vertically. Bigger machines, more chips, larger drives, more memory.

How is Cloud SQL scaled?
Horizontally

Describe Google Cloud Storage and what it's good for.
Scalable, fully-managed, blob store for images, files, objects, unstructured data, etc.

Describe Google Cloud Datastore and what it's good for.
Scalable, fully-managed NoSQL document (think Entity/JSON/Object) database for semi-structured and hierarchical data

Describe Google Cloud Bigtable and what it's good for.
Scalable, fully-managed NoSQL wide-column database for low-latency read/write access, high-throughput analytics, and native time series support

Describe Google Cloud SQL and what it's good for.
Fully-managed MySQL or PostgreSQL for web frameworks, structured/relational data, and OLTP workloads

Describe Google BigQuery and what it's good for.
Scalable, fully-managed, Enterprise Data Warehouse (EDW) with SQL support and fast response times over massive data for OLAP workloads up to petabyte-scale, Big Data exploration and processing, and reporting via Business Intelligence (BI) tools.

Describe Google Cloud Spanner and what it's good for.
Scalable, Fully-managed, global scale relational database for Mission-critical applications, high transactions, scale and Consistency requirements

Name some common workloads for Persistent Disks.
Virtual machines drives
Read-only data across multiple virtual machines
Durable backups of running virtual machines

Name some common workloads for Google BigQuery.
Analytical reporting on large data
Data Science and advanced analyses
Big Data processing using SQL

Name some common workloads for Google Cloud Storage.
Storing and streaming multimedia
Storage for static web application files
Storage for custom data analytics pipelines
Archive, backup, and disaster recovery

Name some common workloads for Google Cloud Bigtable
IoT, finance, adtech
Personalization, recommendations
Monitoring
Geospatial datasets
Graphs

Name some common workloads for Google Cloud Datastore.
User profiles
Product catalogs
Game state

Name some common workloads for Google Cloud SQL.
Websites, blogs, and (CMS)
BI applications
ERP, CRM, and eCommerce applications
Geospatial applications

Name some common workloads for Google Cloud Spanner.
Adtech
Financial services
Global supply chain
Retail

Talk about data consistency in Cloud Storage.
Strongly consistent:
Read-after-write
Read-after-metadata-update
Read-after-delete
Bucket listing
Object listing
Granting access to resources

Eventually consistent:
Revoking access from resources

Compare and contrast Eventually Consistent and Strongly Consistent.
Eventual consistency means that an updated piece of data will eventually yield reads that return the new updated value, and conversely, that for a unspecified but hopefully short amount of time, different reads might result in a mix of the old and new value. Think DNS servers. You update your DNS, there might be a lag before all browsers point at the same location.

Strong or immediate consistency, on the other hand, tends to link back to the more traditional ACID concept in relational databases. Data read after an update will always return the same answer. Think your bank balance when you check it after a deposit.

Talk about transactions and Cloud Storage.
Writes and updates are transactional, but there's no concept of a multi step, "update these three files" transaction.

Talk about transactions and Cloud Bigtable
Reads and writes are atomic at the row level. Multiple row transactions are not supported. If there's a single cluster, or if a replicated cluster's application profile is in single-cluster routing mode, then single record Read-modify-write and Check-and-mutate operations are transactional.

Talk about data consistency in Google Cloud Bigtable.
By default Bigtable is eventually consistent with change replication taking seconds and occasionally minutes.

In a Bigtable instance with two clusters, replication happens automatically. True/False
True

In a Bigtable instance with two replicated clusters, if one cluster goes down will the other automatically pick up all the new queries?
Only if the application profile routing policy is set to Multi-cluster routing. If it's set to Single-cluster routing then the switch will have to be made manually.

Talk about data consistency in Cloud Datastore.
Ancestor queries are strongly consistent by default. To improve performance, you can set a query's read policy so that the results are eventually consistent instead.

Global queries (those that do not execute against an entity group) are always eventually consistent.

Talk about transactions in Cloud Datastore.
Transactions are optional and depend on how the statement or group of statements, are executed. If a second transaction attempts to modify records which are already part of a transaction, the changes will fail for the second transaction.

What is the structure of data objects stored in Datastore?
Data objects stored in Datastore are entities. Entities contain properties. An entity group consists of a root entity and all of its descendants.

How do ancestors and descendants work in Datastore.
When an entity is created, another entity can optionally be assigned as its parent. An entity with no parent is a root entity. The path from an entity, through its ancestors to the root is called the ancestor path. The path from a parent down through children is descendant.

Can a Datastore entity be moved from one parent to another?
No

In Cloud Datastore, what is a Kind?
An entity's kind (type? class? schema?) is used to categorize the entity for the purpose of queries. Person, Task, Product might be examples of kinds.

How do Datastore keys work?
A Datastore key consists of:

The entity namespace
The entity's kind
A key-name string or numeric ID

In Cloud Datastore what is the function of namespaces?
Datastore namespaces are used in multitenancy configurations. Multitenancy allows a single project's Datastore data to be segmented into partitions. The kinds and kind logical structures can be the same for each tennant, but the data split into unique partitions. Think a set of Tasks split by operating unit.

Can multitenancy partitions be used for security in Datastore?
No, multitenancy partitions split data for by tennant, but they offer no kid of security for the spits. Nothing's to stop tennant1 from accessing data from tennant2. You'd have to do that with your application.

Talk about transactions in Google Cloud SQL and Google Cloud Spanner.
Both Spanner and MySQL support standard ACID transactions.

Talk about data consistency in Google Cloud SQL and Google Cloud Spanner.
Both Google Cloud Spanner and MySQL are strongly consistent.

Talk about data consistency in Google BigQuery.
BigQuery is immediately consistent for most operations. insertIds should be used on streaming inserts if there's a chance of message duplication. If provided, BigQuery will automatically deduplicate any messages with the same insertId, provided they arrive within a minute of each other.

Talk about transactions in Google BigQuery.
BigQuery doesn't support transactions and should not be used for OLTP applications.

What is Dremel?
The internal Google system behind BigQuery

Which of the following Google services are Multi-Regional, Regional, Zonal, or some mix:
Persistent disks
BigQuery
Datastore
Bigtable
Dataproc
Machine Learning
Dataflow
Cloud Storage
Cloud SQL
Cloud Spanner
Persistent disks zonal standard, regional replicated across 2 zones

BigQuery dataset storage regional or multi regional. Query, load, and export jobs run in region with the dataset.

Datastore regional or multi regional

Bigtable zonal. Replication can spread over two zones.

Dataproc's compute engine instances are zonal. If you pic a region you can pick a zone or let GCP auto zone it for you. If you choose global then you must pic the zone.

Machine Learning regional

Dataflow is zonal. You can specify the region and zone, or just the region and it will autozone.

Cloud Storage Regional, multiregional, or dual regional.

Cloud SQL second gen MySQL is zonal by default but with the HA option it can replicate across two zones.

Cloud Spanner the instance is regional or multiregional.

The Cloud Spanner hierarchy is ____?
Project, instance, node.

Each Cloud Spanner node can store how much data?
2TiB

Do Cloud Spanner nodes help with replication?
No, hey help with data load and processing power but not replication.

What replication options does Cloud Spanner offer?
For regional there will be three read-write replicas spread over multiple zones. For multiregional there are multiple replicas in multiple zones in multiple regions, based on configuration. This provides faster reads but slows down the writes. There's a record update voting algorithm that requires a quorum between the replicas, and the added network latency slows that down.

One machine learning method that helps when a wide and deep neural network is overfitting training data is?
Dropout method, that is, ignoring neurons. It helps remove some of the mutual dependencies that neurons develop. It helps with overfitting because it forces the neurons to work in different ways.

The recommended minimum number of Cloud Spanner nodes is?
3

A Cloud Spanner node can perform queries at about what rate?
For 1kb of data 10,000 QPS of reads or 2,000 QPS of writes

In Machine Learning prediction jobs that need to deal with slowly changing labels, like a users changing movie preferences, how best can we handle model retraining?
By continually retraining on a mix of new and historical data.

In BigQuery, what's the difference between Sharding and Partitioning?
Partitioning is done by date, either daily or at some interval configured by the user. The data is in a single logical table, but it is stored in "partitions" and has pseudo keys that allow querying by timespan. Sharding is a manual splitting of data into multiple tables based on some criteria: stores, regions, date ranges, etc. Queries over sharded tables require UNIONs or table wildcards. Given enough tables, shard queries can hit the 1000 table limit and fail. But they can also be very fast if only a handful of small tables are queried.

Does Google Data Studio cache data?
It does and it typically refreshes every 12 hours. There's a lightning bolt in the UI that lets you know the data is cached. The cache can be disabled if report data needs to be refreshed more frequently.

When pulling CSV files from Google Cloud Storage into BigQuery, if the file might contain bad rows, how might you automate the preprocessing?
Pull the data into a Dataflow processing pipe, filter out the bad data to a secondary storage location, then load the scrubbed data into BigQuery

A GCE application runs a regular query against the database. The database quits answering. What's a common approach to a connection failure of this sort?
Requery with an exponential backoff.

What common machine learning algorithm might be a good fit to help predict movie prices?
Linear regression

What is a classic machine learning algorithm for classification?
Logistic regression, based on the logistic or sigmoid function.

Contrast a Recurrent vs a feedforward neural network
In a feedforward network data flows only one way from input, through hidden, to output neurons. In a recurrent or feedback network data can loop and flow both ways through some neurons.

What does the following BigQuery query do?
SELECT * EXCEPT(row_number)
FROM (SELECT *, ROW_NUMBER()
OVER (PARTITION BY ID_COLUMN) row_number
FROM `TABLE_NAME`)
WHERE
row_number = 1
Ensure that there are no duplicates in the returned records. Might be useful deduplicating streamed inserts. Though insertId would be better

The proper syntax for a BigQuery wildcard table name is?
FROM `bigquery-public-data.noaa_gsod.gsod*`
(Backticks! Don't fall for single quotes. Also, in legacy SQL it was square brackets, so a question containing an error related to [ in the table name is a legacy vs standard thing)

In a sentence, what is Machine Learning?
Tom M. Mitchell: "Machine learning is the study of computer algorithms that improve automatically through experience." I might add that it's a subset of the much larger topic of AI

You want to setup a Dataflow (Beam) app to process real-time sensor data. You need to track activity and react to a sensor who's had no activity in the last 30 min. How might Beam handle this problem?
Setup a session window with a gap time of 30 min.

Dataflow pipes are written using which languages and use which api?
Written with Java, Python, or Go using the Apache Beam api

Which GCP data storage options would work best with
OLTP e-commerce type data?
Cloud SQL and Spanner

When using machine learning for unsupervised anomaly detection, what should be true about the data being tested?
The rate of anomalies to normal data should be low.

You're dealing with streaming sensor data with intake which peak north of 250,000 messages per second. What might be a typical GCP intake flow, and what storage option might work well for the data?
Pub/Sub to Dataflow to Bigtable

You need to discover everyone using BigQuery and what they are doing with it. How might you accomplish this?
Export or Stream the Data access Stackdriver Cloud Audit log into BigQuery and analyze it there.

What are the three Stackdriver Cloud Audit logs generated for each project, folder, and organization?
Admin Activity for configuration changes
System Events for compute engine system events
Data Access for read, write, modify events

Which Stackdriver Cloud Audit logs are free? Pay? Enabled by default?
Admin and System event logs are enabled by default and always free. BigQuery Data Access is also enabled by default (can't be disabled) and free. The rest of the Data access logs need to be enabled and may incur spend.

If Stackdriver Data access logs are enabled, what type of data access is still not logged?
Anonymous access data where the user doesn't have an account.

You are helping with a lift and shift job for a company's Hadoop's cluster. What will you do with the Hadoop related data?
Move it into GCS so the Dataproc cluster can be disposable.

Why can Hadoop in Dataproc work so well with data stored in Google Cloud Storage?
Because GCS is a Hadoop Compliant Filesystem. HCFS

What is a key mindset change related to the way Hadoop works and how Hadoop data is stored, for people moving from onsite Hadoop to Hadoop in Dataproc?
They need to think of the data as a more general Google Cloud Storage thing, and not just a Hadoop HDFS thing. Also, they need to think of Hadoop as an ephemeral data processing tool they can spin up when needed, and shut down when not activly in use

Would Datastore work well in an online order processing sort of application?
Not especially, it really isn't designed for that kinda of data processing and OLTP. Cloud SQL or Spanner would be much better choices.

In Datastore, what is an entity group?
An entity group is a root entity and all its descendants.

In Datastore, if you had an entity group of kind Orders, what would be the performance of single write? Batch write?
About one per second, doesn't matter if it's single or batch.

What's the difference between Cloud Audit Logging and Stackdriver Logging?
Cloud Audit Logs are the Admin, System Event, and Data Access logs captured by Google for each Org, Folder, and Project. Stackdriver Logging is the part of the Stackdriver which allows you to store, search, analyze, monitor, and alert on log data from GCP or AWS, including the Cloud Audit Logs.

In most situations should Dataproc be setup to store data stored in the Persistent Drives? or in Cloud Storage?
Dataproc with data stored in Cloud Storage. The Dataproc hadoop cluster should be up and going only while a job process needs to run. After, kill it and keep the data only.

In machine learning, would fraud detection most likely be a regression or classification problem.
Classification: fraudulent, legit

Which of the following might be used when working on machine learning fraud detection?
Unsupervised learning?
K-Means Clustering?
Linear Regression?
Supervised learning?
Unsupervised learning: yes, classic anomaly detection problem

K-Means Clustering? Yes, could be used with unsupervised categorization

Linear Regression? Probably not

Supervised learning? Possibly, starting with a supervised model for initial training and then switching to unsupervised for refinement has been done before

What application protocol does Pub/Sub operate over?
HTTP

What is Cassandra?
An Apache NoSQL database. Bigtable competitor

You have a large amount of data loaded into BigQuery and you need to manually update the data type of a column. How should you proceed?
For ease of use and simplicity, use a query to overwrite the existing table or create a new table if you need to preserve existing data. If you are more concerned with cost of the move, export all the data to GCS and load it into a new table.

What's the cost of exporting BigQuery data to a Google Cloud Storage bucket? Importing?
No charge for the import or export, but you will pay GCS storage.

What would I need to create and use in Stackdriver to export a specific type of event to Pub/Sub
Using the GCP console or the Stackdriver API, create an advanced filter to find what you want, and then create and use a sink to Pub/Sub.

Stackdriver can create a log export sink to which other GCP products?
Pub/Sub, BigQuery, or Google Cloud Storage

When training a machine learning model, which would be preferable: Features with a high or a low correlation to the output labels?
High, generally speaking.

What are some common ways of dealing with training datasets with missing/null fields.
Use a form of estimation.
Dump in a constant value
Replace with a constant, not great
Could dump the records but that's often the worst choice. Lose lots of data and could skew the overall result.
Dumping the feature might be just as bad.

When doing a BigQueryIO.Read what's the difference between the from(...) and the fromQuery(...)
from() reads the specified table.
fromQuery() executes the specified query first, then read from the results.

Can also use BigQueryIO.read() more directly to generate TableRows which are easier to use but slower.

What's typically wrong with Bigtable keys like the following?
datatime
eventId (incrementing number)
They tend to hotspot one key range at a time, with all the updates coming into a single Bigtable node. Also, it might not be the easiest to access, assuming that you're not only accessing by date or eventId. Might be better to add some more meaning to the keys. sensorId/datetime, that sort of thing.

In Cloud SQL, MySQL Gen 2, what's High Availability mode?
A second copy of the MySQL primary server is created in another zone of the same region (must be same Region). The failover server is then replicated from the primary using MySQL semisynchronous replication. Users, data, settings, etc. will all be replicated.

In Cloud SQL, MySQL Gen 2, when would the failover server go live? Automatic switch or manual? Then what would happen?
The primary server writes a heartbeat to the failover server every second. If the heartbeats fail for approximately 60sec, then Cloud SQL will automatically switch to the failover replica, write a message to that effect in the operations log, wait to see if any updates arrive from the primary, make the failover primary, create a new failover instance, and start to replicate there. All ip addresses will be updated automatically.

In Cloud Dataflow can you update a running pipe?
Python doesn't yet support updating streaming jobs. Java does. When you deploy an update in Java, Dataflow runs a compatibility check of the new code against the existing pipe. You might have to provide transformation mappings for graph changes, and some changes are not supported, but bug fixes and the like should be easy. Once the new code passes the check, the old job is stopped, the new is started under the same job name, new transformations in the new code might be missed for inflight data, but all in all it picks right up where the old code left off.

The most common performance issue in Cloud Bigtable is?
Key hotspotting

What is Avro?
A compressed file format and serialization system which BigQuery can load and use very efficiently.

The default file format for loads into BigQuery is what?
CSV

The preferred compressed format to use for data loaded into BigQuery is?
Avro

What's the default file encoding that BigQuery expects for CSV files?
UTF-8

What will happen if you load a CSV file into BigQuery from Cloud Storage that isn't encoded with UTF-8, and you don't warn BigQuery about the encoding change?
BigQuery will attempt to dynamically detect the file encoding and load it anyway, but the load might not be byte for byte the same as the original.

What's the max amount of data that can be loaded into Google Sheets
2mil cells

What the only allowable encoding for JSON files imported into BigQuery?
UTF-8

In BigQuery, what's a good way to limit table data to exactly what a group of users needs, without copying the data into different tables or external systems.
Set up a view that shows the users exactly the set of data they need. They can even run queries over the view data.

You want Dataflow to scale as needed. What setting do you need to change to allow for scaling?
Autoscaling is enabled by default. Specify --maxNumWorkers to change the max scale vm count. Note, you can't change the max for a running system. Shut it down and redeploy with the new count.

In Datastore, what effect does excluding a field from the index have?
It decreases the key storage size because it no longer needs entries involving that field. It also means that no filter operations involving said field will work.

What's an easy way to automate a daily Dataflow job?
Create a cron job in Google App Engine Cron Service and have it run the Dataflow job

Uploading lots of small files with gsutil will work fastest with what option enabled?
Use the -m switch to enable multithreaded, parallel uploads.
gsutil -m cp -r place gs://...

What is S3 storage?
It's the AWS equivalent to Cloud Storage

When should you move data with the Google Storage Transfer Service?
When the destination is Google Cloud Storage and the source is AWS S3, an HTTP(S) location, or another Cloud Storage bucket. It can be a one off transfer or you can schedule it. It's also smart enough to spot just the changed files if the transfer is set up on a periodic timer.

What is Redis?
It's an in memory, data store/Database.

What is the Google Managed version of Redis?
Memstore

What is Memstore/Redis good for?
Highly available, in memory caches with sub ms access speeds.

How much data can be stored in Memstore?
300GB with up to 12Gbps network throughput. You pay for the memory you use, by the hour. Also, the throughput is directly related to the amount of data you are storing, with 12gps being the max.

What is HBase?
It is the Hadoop database that sits on top of the Hadoop File System (HDFS)

Does denormalizing data decrease or increase total DB storage size?
You frequently have repeated data so, increases. Normalized data tends to be more concise and smaller.

What file formats can BigQuery accept and using what compression formats.
CSV and JSON with GZIP, Avro with DEFLATE or SNAPPY. CSV won't work for repeated or nested data.

What are GCP primitive roles? What's the problem with them?
The original security roles created by Google in the early days of GCP: Owner, Editor, and Viewer. The problem with them is their lack of granularity.

BigQuery caches results by default for about ___ hours.
24

BigQuery caching is enabled / is not enabled by default.
is enabled

Does BigQuery charge for a query that returns its results from cache?
No, rerun queries that load from cache are free, but if the underlying data has been modified or new data has been streamed in, then the cache expires and the query is re-executed at normal fees.

When would a BigQuery not cache a query's results?
If the results are sent into a new table.

In a basic query what do the following statement elements do, briefly? SELECT ____, WHERE ____, FROM ____, and LIMIT ____.
Select chooses the particular columns being returned (Projection), WHERE limits the rows returned based on a condition, FROM specifies the data source, and LIMIT will limit the number of rows returned, though not the number of rows processed by a given query.

In BigQuery, when storing data denormalized, how is nested data setup?
You set the datatype of the field to RECORD, set the mode to REPEATED, and then add the nested fields and their types.

Another name for a machine learning regression based estimator is?
Regressor

Provide some examples for the type of prediction a wide and deep neural network would handle well.
Language translation, self driving cars, image recognition, colorizing black and white photos, hand writing analysis, etc.

In machine learning, what's a categorical feature?
A feature that represents category data, as opposed to continuous numerical data. Examples: shirt size, movie rating, zip code

To help a neural network learn about the relationships between categories in a categorical feature, you might consider adding a what?
For small sets of categories, like shirt size, one hot encoding the feature works well. If there are a lot of category choices, like all the words in English, you might use an embedding layer/column. Embedding creates an index for each choice, and then links the index to a fixed length vector. So "dog" might have an index 107 and might link to a vector of size 32 where each value is a weighted number. The training model will update these values as it learns. So over time, the vector for "hello" might grow very close in values to the vector for "dogs" or "canine"

In a neural network, what are hidden layers?
The neurons between the input layer and the output layer. They aren't really hidden as much as in between.

In Dataflow (Beam), what's a sink?
A sink represents an output location which Beam can write to

If you want to stop a Dataflow pipeline that's currently handling data, what are your two options?
Cancel, which kills the process and in flight data is not processed, and Drain, which stops data intake but processes data in flight.

Can a Dataflow pipeline be tested outside of Dataflow?
Yes, Dataflow uses the open source Beam framework. Use DirectPipelineRunner if executing on a local machine.

In Dataflow, the Sink and Source APIs are for what?
Source is for creating custom data loader "read()" code. Sink is for writing custom output "write()" code.

In Dataflow, what's a ParDo used for?
The ParDo or Parallel Do looks at every element of the incoming PCollection, does something to it, and generates an output PCollection for the next step in the pipeline. This is a key part of most Beam transformations.

The machines in a Dataproc cluster are actually created where?
A regional Compute Engine Instance Group (don't mess with it!)

When loading data through the BigQuery web UI what are the limits on the uploaded file size?
Less than 10mb and 16,000 rows. They also have to be loaded one at a time.

What is one key way to limit the number of rows processed by a BigQuery query?
Using sharded and/or partitioned tables

In machine learning why is it important that you keep some of your labeled data back for testing.
Testing. Once you have your model trained, you need to run some data through it, see it's predictions, then test the prediction accuracy against your acceptable range limits.

When configuring the GCP Machine Learning Engine, what are the three machine types you are altering? And what's the purpose of each type?
The Master Node, there can be only one, and it's responsible for controlling the cluster and coordinating all the parts of the job graph. The Worker Nodes, responsible for handling the various tasks in the the job. The Parameter Server Nodes, responsible for storing

In a Dataproc cluster, where is YARN running?
On the master node.

What are some of the ways you can customize the software in a Dataproc cluster?
SSH into the master and make changes
Use --properties to to mod config files
Setup initialization actions

An SSH connection automatically sends data encrypted. True/False
True, Secure SHell passes data through an encrypted channel.

To switch Bigtable from HDD to SSD drives, what steps need to be taken?
You can't switch a Bigtable instance drive type. You'd have to export out all the data and reload it into a new instance.
...................................................
GCP Data Engineer Exam

INGEST Services
App Engine - Compute Engine - Kubernetes Engine - Cloud Pub/Sub - Stackdriver Logging - Cloud Transfer Service - Transfer Appliance

STORE Services
Cloud Storage - Cloud SQL - Cloud Datastore - Cloud Bigtable - BigQuery - Cloud Store for Firebase - Cloud FireStore - Cloud Spanner

PROCESS/ANALYTICS Services
Dataflow - Dataproc - BigQuery - Cloud ML - Cloud APIs - Dataprep

EXPLORE/VISUALIZE Services
Datalab - Data Studio - Google Sheets

What is Cloud Datastore?
No ops, highly scalable, TRANSACTIONAL, NoSQL Relational Database

What should you use Cloud Datastore for?
Highly available, structured data, < 1 TB | E.g. Product Catalogs, Game Save States, User profiles

What should you NOT use Cloud Datastore for?
Analytics (Use BigQuery/Spanner) - Extreme scale (Use Bigtable) - Existing MySQL (Use Cloud SQL)

What's the difference between analytical and transactional databases?
Analytical databases are designed for higher scale with aggregating calculations. Transactional databases are optimized for finding individual rows in tables (e.g. based on ids).

Relational Database --> DataStore
Table --> ? | Row --> ? | Field --> ? | Primary Key --> ? | Kind | Entity | Property | Key

What do you query in DataStore?
Entities

How can you query Datastore?
Programmatic - Web Console - Google Query Language

How do you avoid bad indexes in Datastore?
Create your own custom indexes. Don't index properties that don't need to be indexed

What is data consistency in queries?
How up to date are these results?

What is Strong Consistency?
Changes happen in order --> query is guaranteed to update but it will take longer. E.g. Financial transactions

What is Eventual Consistency?
Changes happen out of order --> faster query but can have "stale" results - E.g. Census population

What is Cloud BigTable?
Highly scalable, ANALYTICAL, NoSQL database - Ideal for large analytics workloads

What are some use-cases for BigTable?
Financial Data, IoT, Marketing data

What is a BigTable "instance"?
Each BigTable project is an "instance"

How is a BigTable instance structured?
Nodes are grouped into clusters. 1 or more clusters in an instance

What are the instance types in BigTable?
Development - low cost, single node  | Production - 3+ nodes per cluster

Can you change disk type (HDD->SSD) within an instance?
No, you need a new instance

How do you interact with BigTable?
Command line tool (cbt - preferred) or Hbase shell - You can also use BigQuery!

How is a table stored in BigTable?
It is sharded across tablets

How is a BigTable table organized?
First row --> row key | Columns are grouped into families

How do you query BigTable tables?
Index on the row key --> requires good schema design!

Where should related entities be in BigTable?
They should be in adjacent rows

What are the three challenges of data streaming?
Volume (amount of data) - Velocity (speed of data transfer and analysis) - Variety (Types of data to process)

How should data be stored on BigTable nodes?
It should be spread over many nodes to prevent "hotspotting"

What are good row key practices?
1) Reverse domain names (com.website...) 2) String identifiers 3) Timestamps in REVERSE 

Why is it beneficial to separate compute and storage?
It enables autoscaling

How do you create a BigTable Cluster?
Use gcloud

What types of BigTable row keys should you avoid?
Domain names (some may be more active than others) - Sequential IDs (newer users could be more active) - Static, updated identifiers - 

Would it be better to have one 10 node cluster or two 5 node clusters?
One Cluster - Multiple clusters introduces latency b/c the cluster getting written to also has to process read functions

Should you make changes to BigTable immediately?
No, BigTable can learn how to best optimize your data structures

What is Cloud Spanner?
Highly scalable, RELATIONAL, database. - Similar in structure to BigTable

When should you use Cloud Spanner?
When you have bigger workloads than Cloud SQL can handle (>8000 queries/sec).  ACID Compliance

What does ACID compliance stand for?
A - Atomicity  C - Consistency I - Isolation D - Durability

How is data sharded in Cloud Spanner?
Within a zone

When would you use Cloud SQL instead of Cloud Spanner?
To replicate an existing on-premise relational database.   Spanner - Designed for use in the cloud

Is Cloud Spanner an easy replacement for MySQL?
No, work is required for migration, but it will enable higher scalability

What's the difference between horizontal and vertical scalability?
Horizontal - More nodes sharing the load (more consistent)  | Vertical - more compute on a single instance

What are Cloud Spanner tables called?
RBDMS

How are tables handled in Cloud Spanner?
They use table interleaving - combines what would be multiple tables into one table using "Parent/Child" tables

How do you ingest data to Cloud SQL?
Batch data imports

What is a tightly coupled system and what are the issues with them? (Pub/Sub)
Senders and receivers talk directly to each other. If one side goes down, data is lost

What are the benefits of a Loosely Coupled System?
Fault Tolerant - Scalable - Message Queuing

What is Cloud Pub/Sub?
Asynchronous messaging bus  - Decouples senders and receivers

Does Pub/Sub guarantee message deliver?
Yes, it guarantees delivery at least once

How does messaging flow in Pub/Sub?
Topics --> Messages --> Subscription <-- Subscribers

What's the difference between PUSH and PULL in Pub/Sub?
Push - lower latency | Pull - better for larger volumes; batch delivery

Does Pub/Sub manage the order messages are sent?
No, it doesn't manage message delivery order, so messages can arrive out of order

How can you deal with messages being out of order?
Have Dataflow handle it  - Have Pub/Sub include metadata that helps with ordering the messages

What happens with the subscriber receives a message?
They send a receipt acknowledging delivery, but this doesn't always get sent before a duplicate message is sent --> Use Dataflow

What are the three steps in data processing?
1) Read Data (Ingest) 2) Process (ETL) 3) Output

What has historically been the problem with having streaming and batch data? (3)
They had to come in through different pipelines. - Streaming was faster, but batch was more accurate  - It was hard to compare recent and historical data

What is Cloud Dataflow?
Built on Apache Beam - No ops, scalable, stream and batch data processing

How are Dataflow pipelines organized?
They are region based

Why use Dataflow over Dataproc? (4)
Less overhead (No-ops) - Unified Batch and Streaming - Pre-processing for ML - Serverless solution

Why use Dataproc over Dataflow? (3)
Familiar tools - Better for existing pipelines - Has iterative processing and SparkML

What are the differences in base packages between Dataflow and Dataproc?
Dataflow --> Apache Beam | Dataproc --> Hadoop/Spark

What is a Dataflow "Element"?
Single entry of data - row

What is a "PCollection"?
Distributed Dataset, data input, and output

What is a Dataflow "Transform"?
Data processing operation in a pipeline - Uses programming conditionals (for/while loops)

What is a ParDo?
Transform applied to individual elements - Filter out/extract data elements

What three things does Dataflow use to deal with late data?
Windows - Watermarks - Triggers

What is a Dataflow "window"?
Dataflow logically divides data into event-time-based groups

What are the three types of Dataflow windows?
Fixed - fixed period of time (8-9) | Sliding - windows overlap with each other; better for smoother changes in avgs | Sessions - users logging in for a session; good for "bursty" data, e.g. rain

What is a Dataflow "Watermark"?
Timestamp | Event time - when the data was generated | Processing time - when the data is processed in the pipeline | Pub/Sub or data-source can provide a watermark

What is a Dataflow "Trigger"?
Says when a results is emitted from Dataflow - Default is to trigger at the watermark

Why are Dataflow triggers useful?
They allow for re-aggregation of metrics with late arriving data

How long do messages stay in Pub/Sub?
Seven days

What would the steps be to set up a streaming data pipeline using BigQuery and Dataflow?
1) Create BigQuery Table for output 2) Create Cloud Storage bucket for Dataflow staging 3) Create Pub/Sub topic for streaming data 4) Create Dataflow pipeline to connect to Pub/Sub and deposit data into BigQuery Table

What is cloud Dataproc?
On-demand managed Hadoop clusters | NOT No ops - you need to configure clusters yourself | DOESN'T autoscale

Is Cloud Dataproc a no ops solution?
No, you need to manually configure clusters

What is the main use-case for Dataproc?
Migrating existing Hadoop structures to the cloud

What is MapReduce?
Map - Take big data and distribute to many workers (nodes) | Reduce - Combine the results of those pieces | Distributed/parallel computing

How do you create a Dataproc cluster?
Manually or using gcloud commands

What is are preemptible VMs and what should you use them for?
Low cost worker nodes that can be lost with no warning if demand goes up - Use for processing only!

Is HDFS a good storage solution for Dataproc?
No, use Cloud Storage instead if needed - Cluster can be stateless - shut down at any time

What is the recommended ration for Preemptible VM to Permanent VMs?
50/50

Are you guaranteed Preemptible VMs?
No, they won't be available if the region is very busy

How do you access you Dataproc cluster?
SSH in via cloud console or gcloud

What are the two ways to access a Dataproc cluster from the web?
Firewall ports (8088, 9870) - SOCKS proxy

What is the benefit of using a SOCKS proxy?
It doesn't expose your firewall ports

What format is data in when it's migrated to Dataproc?
HDFS

How we do we change the way we think about clusters when moving to Dataproc?
Clusters become ephemeral (temporary) entities instead of permanent ones

What are Dataproc migration best practices?
1) Move data first (generally to Cloud Storage) 2) Small scale experimentation

What is a storage benefit of migrating to Dataproc?
You can separate storage (Cloud Storage) and compute (Dataproc). This means you don't need your Dataproc clusters running all the time - On-Premise --> GCP | HDFS --> ? | Hive --> ? | Hbase --> ? | Cloud Storage | BigQuery | Bigtable

What is the term for optimizing on-premise structures to the cloud?
"Lift and Leverage"

What is BigQuery?
Fully managed data warehouse - Serverless, no ops

What are the benefits of BigQuery?
Serverless - Autoscales - Good for storage and analysis - Accepts batch and streaming - Durable, multi-regional

What data types are stored in BigQuery vs Cloud Storage?
BigQuery --> Tables } Cloud Storage --> Files

How is data stored in BigQuery?
Columnar instead of Record Oriented - Each value is stored on a different storage volume

Does BigQuery update existing records?
NO - It's not transactional

What are the good and bad aspects of columnar data storage?
Fast read but slow write

What are BigQuery datasets and tables?
Dataset - collection of tables and views } Table - collection of columns

How do you pay for with BigQuery?
Storage, queries, and streaming data inserts

How can you interact with BigQuery? (4)
Web UI - Command line (CLI) - Programmatic (REST API) - Via queries

What is a BigQuery "View"?
Virtual table defined by a query

What are the benefits of caching BigQuery queries and how are cached queries stored?
If you make the same query repeatedly, you won't have to pay for it - Queries are cached at the user level

What are User Defined Functions (UDFs) and what are their benefits?
SQL code combined with JavaScript functions - Allows for more complex functions like loops and complex conditionals

What services can be used to ingest data to BigQuery?
Cloud Storage (Batch, Read) - Bigtable (Read) - Dataflow (Streaming) - Google Drive (Read)

How would you connect Dataproc job outputs to BigQuery?
Write Dataproc output to Cloud Storage then have GCS write to BigQuery

Where can BigQuery export data to and what are the max file sizes?
ONLY to Cloud Storage
Max 1 GB/file, but files can be split up

What are six best practices of BigQuery queries?
1) Avoid SELECT * (loads too much data, costs a lot) 2) Denormalize data 3) Filter early (minimizes later data processing) 4) Do the biggest JOINs first 5) LIMIT does not affect the cost!! (just the output you see; cost is per column) 6) Partition Data by date

Why does the LIMIT clause not affect the cost of a BigQuery query?
Queries charge per column and LIMIT doesn't change how much data is processed

What are the four steps (and their colors) to look at to best optimize query performance?
Look at the "Execution Plan" view: - Wait (Yellow) - Read (Purple) - Compute (Orange) - Write (Blue)

What are machine learning algorithms ultimately trying to do?
Make accurate predictions using completely new data based on historical data they've seen

What are the six steps in developing a machine learning model?
1) Collect Data 2) Organize Data 3) Develop model - use historical data 4) Test model - using different historical data 5) Train model - using new data 6) Deploy model - then keep training with new data!

What is an ML "feature"?
A variable - raw or calculated (engineered feature)

What is "Inference"?
Apply trained model to new examples

What is model overfitting?
When model too closely predicts based on the test dataset. It won't be able to make good predictions on new data.

What is Supervised Learning?
Create a function from labeled training data. - The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and the desired output value (also called the supervisory signal).

What are the two main types of Supervised Learning and what are they used for?
Regression - Continuous variables; predict stock prices | Classification - Categorical variables; yes/no, is this a cat?

What is unsupervised learning?
Uses unlabled data and tries to find natural clusters and patterns.

What are the three types of ML "learning"?
Supervised - Unsupervised - Reinforcement

What is Reinforcement learning and what are some examples of it?
Use positive/negative reinforcement to complete a task - E.g. complete a maze or learn chess

What is a neural network model?
Model with multiple layers with connected units (neurons)

What is an ML neuron?
Node, combines input values and creates one output value

What is a neural network "input"?
Value fed to a neuron (e.g. cat pic)

What is a neural network "hidden layer"?
Set of neurons operating from the same input set

What is feature engineering?
Deciding which features to use in a dataset; which features can we calculate that might be predictive?

Why do we do feature engineering in ML?
Brings in human intuition, we can hypothesize variables that might be predictive

What is a neural network "feature"?
A transformation of inputs

What is an ML "epoch"?
A single pass through the training dataset - Each run through building the ML model

What are ML model weights?
Multiplication of input values

What is ML "bias"?
Value of output given a weight of zero

What is Gradient Descent?
We walk down an error surface trying to find the smallest error (MSE) - Iterative process

What are Small and Large steps in ML Learning rates?
Small - finds the smallest error but takes a long time |Large - runs faster but isn't as accurate; may never converge to the true minimum

How are images processed in ML models? 
Each pixel is a 1 or 0 value - The input to every neuron is a pixel

What are hyperparameters?
Variables used in the training process itself - Learning rate, etc

What are deep and wide neural networks?
Deep - has many hidden layers; good for generalization --> easier for neural networks (e.g. numeric data, pictures) | Wide - has many features; good for memorization; hard for neural networks (e.g. one-hot encoded) - The best models have deep and wide parts of the network!

Should you automatically throw out outliers?
No, see why outliers might be occurring in the datset

What are two best practices for ML dataset?
Make sure data covers all possible use cases - Include negative examples and near misses

What is the precision of a classifier model and when is it useful?
Accuracy when the classifier says "yes" | True Positive / (True Positive + False Positive) | Use when things you're trying to find are very common

What is "recall" in a classifier model and when is it useful?
Accuracy when the truth is "yes" | TP / (TP + FN) | Use when the things you're trying to find are very rare

What is "accuracy" in a classifier model and when is it useful?
Accuracy of all evaluate data | Cross-entropy for classifiers | Use when the things you're trying to find are balanced

What is TensorFlow?
Software library for ML model creation - Pre-processing, feature creation, and model training

What should TensorFlow be used for and by whom?
Creating an ML model and initial training - Used by ML researchers

What is Cloud ML Engine?
Tensorflow library | Distributed training and prediction | Hyperparameter tuning

What are the two types Cloud ML Engine predictions? What are their inputs and outputs?
Batch - Get inference on large amounts of data; Cloud Storage is the input and output | Online - Fast requests with minimal latency; Input - JSON strings, output - returned in response message

What frameworks does ML Engine support?
TensorFlow | scikit-learn | XGBoost

Should you read directly from BigQuery to ML Engine?
No, it's better to pre-process in Cloud Storage

What are the three types of regressors in TensorFlow?
Linear (regression) | Linear Classifier (logistic) | Deep Neural Network (DNN)

What is the benefit of training in Cloud ML Engine instead of TensorFlow?
Cloud ML Engine can distribute training across many machines, decreasing processing time and increasing stability

Why is TextLineReader is an efficient way to read TensorFlow?
It reads data directly into the computation graph

What four characteristics make a good ML model feature?
1) It's related to the objective 2) Should be known at production time 3) Has to be numeric with meaningful magnitude 4) Needs a big enough sample size

What is feature crossing and why is it useful?
It concatenates multiple variables - Can simplify learning

How do you use categorical variables in ML models?
You need to one-hot encode them --> create sparse columns

Why do you bucketize/discretize continuous variables?
To not weight variables with specific values that are overly meaningful (e.g. latitude)

When would you use pre-trained APIs over a custom ML model?
I need it quickly - I can't make an ML model - It fits my use-case

How do you pay for pre-trained APIs?
You pay per API request

What is Datalab?
Interactive notebook for exploring and visualizing data - Based on Jupyter notebooks

What GCP services can you visualize with Datalab?
BigQuery, ML Engine, Cloud Storage, Compute Engine, and Stackdriver

How do you connect to Datalab?
SSH in Cloud console and create Datalab instance

What is the Datalab web preview port?
8081

What is Dataprep?
Intelligent data preparation - Managed, serverless, and web-based

How does Dataprep work?
It's backed by Dataflow - Has automatic options (remove outliers, dedupe), but you can add custom options as well

What is DataStudio?
For data viz and dashboards (e.g. Tableau) - Part of G suite NOT Google Cloud

What services can DataStudio connect to?
GCP - BigQuery, Spanner, Cloud SQL, GCS | G Suite - YouTube Analytics, Sheets, AdWords | Many third party integrations

Can you change the region of a BigQuery dataset?
No, datasets are immutable

What is the best metric to use for determining when to scale?
CPU utilization, NOT storage utilization

At what level is BigQuery data access controlled?
At the data-set level

What's the difference between Failover and Disaster Recovery?
Failover has a very short downtime - Disaster Recovery may incur delays before service is restored

What is a best practice for Identify and Access Management (IAM)?
Assign roles to groups and give groups access privileges

In ML Engine, should you monitor jobs or operations?
Jobs

Do you need to restore restore snapshots to use them?
No, you can use them right away

What is Apache Kafka?
It's an on-premise version of Pub/Sub

What is Cloud Composer?
A workflow orchestration tool build on Apache Airflow
Works with cloud and on-premise servers

What is Cloud Memorystore?
Fully managed in-memory data store service for Redis

What is BigQuery ML?
ML model development in a SQL querying language.
Enables data analysts to build ML models without having to export data and re-import

What four model types does BigQuery ML support?
Linear regression - Binary Logistic regression - Multiclass logistic regression for classification - K-means clustering

What is "Prefetch" caching?
Caching ahead of time to predict what might be searched for. - Only possible on Owner credentials. - Can be turned off (unlike query caching)

How many standard nodes are required before you can use Preemptible nodes in a Dataproc cluster?
Two; you can't only have preemptible nodes

What are the two shutdown options for Dataflow?
Cancel - stops all processes immediately | Drain - stops ingesting and finishes processing current data

At what levels can you control IAM in Pub/Sub?
Project - Topic - Subscription

What IAM roles exist for Pub/Sub?
Admin, Editor, Publisher, Subscriber

What types of accounts are a best practice for IAM in Pub/Sub?
Service accounts

What are two partitioning methods in BigQuery?
Ingestion-time partitioned | Partition by specific timestamp/date column

What is the command to train models on ML Engine?
'submit job train model'

What is the command to deploy trained models on ML Engine?
'submit job deploy trained model'

What are the "Project and Model" IAM roles in ML Engine?
Admin - full control | Developer - Create jobs, request predictions | Viewer - Read only

What are the "Model Only" IAM roles in ML Engine?
Model Owner - Full access | Model user - Read models and use for prediction | 

What are the Machine scale tiers for ML Engine?
BASIC - single worker | STANDARD_1 - 1 master, 4 workers, 3 parameter | PREMIUM_1 - 1 master, 19 workers, 11 parameter | BASIC_GPU - 1 worker with GPU | CUSTOM

How is ML Engine priced?
Per hour of use

What is a deep neural network?
A neural network with at least three hidden layers

What benefits are there to training a model locally before deploying it to ML Engine?
Lower cost - Quick iterations

What IAM roles are required to use Datalab?
Compute Instance Admin - Service Account Actor

How is data shared in Datalab?
Via shared Cloud Source Repository - At the Project level

What langauges does Datalab support?
Javascript, Python, and SQL

What three things does a Datalab instance run on?
On a Compute Engine instance, dedicated VPC, and Cloud Source repository

What are the supported file types for Dataprep?
CSV, JSON, txt, XLS, LOG, TSV, AVRO

What is an AVRO file?
Serialized data file for Apache Hadoop projects

What is Cloud TPU?
Tensor Processing Unit - custom ML training computer

What service should you use for semi-structured data < 1 TB in size?
Datastore

What are the four IAM levels for Dataflow?
Admin - Full pipeline access and machine type config | Developer - Full pipeline access, NO machine type config | Viewer - Permissions only | Worker - Only for service accounts

Can Pub/Sub stream to Dataproc?
Yes, then Dataproc can move data to the right place

How do you use Dataproc for real-time streaming data and analytics on that data?
Pub/Sub --> Dataproc | Dataproc --> Bigtable and Cloud Storage (analytics on both)

Can different environments be in the same project?
No, you need to create different projects for environments that must be fully isolated

Can someone with an IAM role for one service have the same role for others?
No, IAM roles are assigned by service. - E.g. a Dataflow Developer with no other IAM roles can't see the underlying data; they can just manage the pipelines

What are two benefits of denormalizing data for BigQuery?
Increased query performance - Decreased query complexity (you don't have to use JOIN clauses)

Does denormalization change the amount of data in BigQuery?
No, but performance is increased since you don't have to query against as many tables

What is the recommended amount of data to store in Bigtable?
1 TB

What is the maximum number of tables BigQuery can query at once?
1000

What data types can be imported to Cloud SQL?
CSV and SQL dumps

How is data shared betweek Datalab notebooks?
The Cloud Source Repository

What are the two options for creating team Datalab notebooks and is the IAM difference between the two?
Team lead creates notebooks for users - Everyone accesses the same shared repository for notebooks - requires all users to be project editors

What are the three types of triggers in Dataflow?
Element count - Combinations of triggers - Timestamps

What are the four levels of Cloud Spanner IAM?
Admin - Full access | Database Admin - Create, edit databases; grant access to databases | Reader - read/execute database/schema | Viewer - cannot modify or read from database; can only view instances

What are the three levels of Dataproc IAM?
Editor - Full access | Viewer - view access only | Worker - service accounts (e.g. read/write to gcs)

What are the six levels of BigQuery IAM?
Admin - full access | Data Owner - full dataset access | Data Editor - edit dataset tables | Data Viewer - view datasets and tables | Job User - run jobs | User - run queries and create datasets

What are the two levels of Dataprep IAM?
User - Run Dataprep in a project | Service Agent - necessary for cross-project access; Trifecta necessary access

How can you use BigQuery and Dataproc together?
Use the BigQuery connector in Dataproc (uses cloud storage for staging)

How do you convert images, video, etc for use with teh APIs?
Use Cloud Storage URI - Encode in base64 format

What language does Cloud Composer use?
Python

What are the Cloud Composer charts called?
Directed Acyclic Graphs

What are the two main components of Cloud IoT Core?
Device Manager - Manages devices programmatically or via CLI  | Protocol Bridge - load balancing, publishes device telemetry to Pub/Sub

What's the difference between hybrid cloud and multi-cloud setups?
Hybrid Cloud has a public cloud and a private datacenter - Multi-Cloud has at least two public clouds in the system, but can also have a private datacenter

What is Google VPC?
Google Virtual Private Cloud Network - has a VPN, allows you to control networking, firewalls, ports, etc

What are the three Hybrid connectivity products?
Cloud Interconnect - Connects on-premise devices to the GCP  | Cloud VPN | Peering - Direct and Carrier; Doesn't require GCP

What are the two types of Cloud Interconnect?
Dedicated Connect
- Connect directly to GCP VPC | Partner Interconnect - Connect to a service partner which connects to GCP; doesn't require equipment maintenance

What service can you use to EASILY build semi-custom ML models?
AutoML: Natural Language - Tables - Vision - Translation - Video Intelligence

What is Dialogflow?
A service to build conversastional interfaces: chatbots, assistants, etc.
Use Cases: Customer Service, Commerce,

What is the purpose of the Data Loss Prevention API?
To find sensitive data

What does the COPPA legislation protect?
The rights of data for children under 13

What is Datastore a replacement for?
Cassandra

What parameters does an ML algorithm adjust?
Weights and biases

How many concurrent interactive queries can you run at once?
50

How many concurrent queries can you run against BigTable?
4

How many concurrent queries can you run that use Legacy SQL and UDFs?
6

How does denormalizing data help queries run faster?
It enables the data processing to be done in parallel

What are two use cases for streaming insert data?
Non-transactional data - Aggregate analysis

What is Key Visualizer?
A tool to help you understand BigTable usage -  Helps you find where hotspotting is occurring - Which rows have too much data? - Are access patterns balanced?

How do you create a Side Input in a Dataflow pipeline?
Turn a PCollection into a view; call the ParDo with a side input.

Side Inputs are useful for when the ParDo needs additional data for its operations, but the data is pulled at runtime (not hard coded)

What command do you use to continuously sync between on-prem and GCP?
rsync

Why use wildcard tables in BigQuery and how do you use them?
They allow you to query multiple tables at once

Add an asterisk to the end of the table name

What BigQuery keyword do you use to select from multiple wildcard tables by their suffices?
_TABLE_SUFFIX

Put _TABLE_SUFFIX criteria in the WHERE clause of the query

Is streaming or batch loading more cost effective for BigQuery?
Batch loading

What are three use cases for BigQuery external tables?
ETL Operations on data - Frequently changed data - Data ingested periodically

What is an external data source in BigQuery?
A datasource you can query directly even though it's not stored in BigQuery

When is table clustering useful in BigQuery?
It speeds up queries; Use for queries with aggregations or WHERE clause filtering

What kinds of tables can you cluster in BigQuery?
Ingestion-time and date-time partitioned tables

What's the difference between Legacy and Standard SQL when using a project qualified table name?
Use a period between the project and table instead of a colon

What is the recommended authentication account for Cloud Composer?
Service Account

What are BigQuery template tables and how are they made?
They're tables partitioned by a non-date variable (e.g. user-id). You make them by spefifying a <templateSuffix> in the table insert request

BigQuery streaming inserts: max row size? max rows/second? max rows/request?
1 MB - 100,000 rows - 10,000 (GCP Recommends 500)

What is the F1 value for model builds?
Weighted average of precision and recall

What are Sigmoids and Softmaxes?
Sigmoid - useful for mapping numbers to probability in log reg | Softmax - like sigmoid, but for multiple inputs

What is regularization?
Approach to overfitting (drop layers, add weights, etc..)

What are Estimators in modeling?
A Tensorflow high-level representation of a complete model

How does Cloud Spanner distribute data across nodes?
Load-based splitting

What do you use to update, insert, or delete data in Cloud Spanner tables?
Data-Manipulation Language (DML)

When should you use Dataproc autoscaling?
When using external storage solutions (GCS, BigQuery) - Clusters that process many different jobs - Scale up single-job clusters

When should you NOT use Dataproc autoscaling?
HDFS - YARN Node labels - Spark Structured Streaming - Idle Clusters

Which external data sources can you query directly from BigQuery?
Bigtable - Cloud Storage - Google Drive

Note: Not as fast as querying directly from BigQuery

Can you change a Spanner instance configuration after you make it?
No

Does Spanner autoscale the number of nodes when workloads increase?
No

What are two approaches to ETL of external data to BigQuery?
BigQuery UI - Dataflow

What is Kubernetes Engine for?
Deploying containerized applications

What is a logpoint snapshot in Stackdriver Debugger?
It's a debug snapshot generated while the program is running. You put a line of code into the program to create the snapshot without stopping the program itself

What is Sqoop used for?
Transfer data between relational databases and Hadoop

What is DirectPipelineRunner used for?
Running Dataflow pipeline operations locally

What does Stackdriver Error Reporting tell you?
It aggregates crashes in your cloud services

What is Stackdriver Trace used for?
Finding bottlenecks and latency in your data processing structures

What kind of data sources does BigQuery Data Transfer Service support?
SaaS services (from Google and others, e.g. Amazon S3)

How should you handle invalid inputs in a Dataflow pipeline?
Create a "deadletter" output with the invalid inputs as a Side Input and re-process the data later



ADVERTISEMENT