By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
Bigtable's security can be controlled at what points? Folder Project Instance
BigQuery's security can be controlled at what points? Folder Project Dataset
A project can contain a Bigtable ____, which contains one or two _____, each with 3 to many ____ Instance Clusters Nodes
If configured with SSDs, a Bigtable node can handle ____ QPS read or write, with a latency of ____ms, and can hold ____ total data. 10,000 6ms 2.5TB
Remember, with the 3 node minimum in a production cluster, that means a base production cluster can handle 30,000 QPS and will hold 7.5tb
If configured with HDDs, a Bigtable node can handle ____ QPS read with a latency of ____, ____ QPS writes, with a latency of ____, and can store ____ of data. 500 200ms 10,000 50ms 8TB
Remember, with the 3 node minimum in a production cluster, that means a base production cluster can handle 24tb data, 1500 read QPS and 30,000 write QPS.
An Apache Beam ____ aggregates data, which is then emitted by a ____. The emitted data is known as a ____. Window Trigger Pane
The four types of Dataflow (Apache Beam) windows are? Fixed Time Sliding Time Per-Session Single Global
The four types of Dataflow (Apache Beam) triggers are? Event Time Processing Time Data-Driven Composite
Currently, Dataflow only supports the ____ data-driven trigger. .elementCountAtLeast()
In Cloud Dataflow the default window is type ____. Global
In Cloud Dataflow the default trigger for a PCollection is based on ____. Event Time
In Dataflow, what's a PCollection A distributed set of data that your Dataflow pipeline operates on. It's usually initially created by a read operation on an external datasource. Each PTransform in the pipeline then starts with a PCollection, does something to each element in it, and generates 1+ new PCollections.
In Cloud Dataflow if using the default window together with the default trigger, the trigger fires ____ time(s) and late data is ____. 1 Discarded
In Cloud Dataflow data is guaranteed to be processed in a pipeline in the order it was sent? True/False False
The Cloud Dataflow notion of when all data in a certain window can be expected to have arrived in the pipeline is known as the ____ Watermark. The delay between when an event happens, and when it gets processed at any point in the pipeline. That time difference ebbs and flows, thus the watermark name.
In most cases, BigQuery can automatically deduplicate streaming message inserts, true or false. True
In order for BigQuery to deduplicate streaming inserts, all inserted records must provide an ____ and the duplicate messages must arrive within ____ minute(s) of each other. insertId 1
Google Cloud Machine Learning can train and serve ____, ____, ____, and ____ models Classification Regression Clustering Dimensionality Reduction
In machine learning, linear regression models are used primarily to: Estimate real values based on continuous variables
Examples of using Machine Learning Linear Regression models include: Total Sales Housing Prices Retirement Age
In machine learning classification models are used primarily to: Group items into known categories
Examples of using Machine Learning Classification models include: Spam, not spam Good movie, Bad movie Authorized, Fraudulent Good wine, bad wine Picture of: Cat, Dog, Goat,... All the words in English
In machine learning, Clustering models are used primarily to: Gain insight into sets of data by using unsupervised learning to see what groups the data points are falling into
Three major types of machine learning are ____, ____, and ____ learning. Supervised Unsupervised Reinforcement
K-Means is an example of an ____ learning algorithm used to spot ____. Unsupervised Clusters
In Machine Learning, what is Reinforcement learning? Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Good puppy, you did well. That idea. Almost a pavlovian model.
In Machine Learning, what is Supervised learning? Supervised learning trains the Machine Learning algorithm from a dataset where we know and have labeled the correct answers. The model makes predictions on the training data, and learns from how far off it is from the correct answer. The process continues iteratively until the guess in within a desired error margin.
In Machine Learning, what is Unsupervised learning? Unsupervised learning learns from test data that has not been labeled, classified or categorized, by identifying commonalities and differences in the data.
In machine learning, what is Dimension Reduction? Dimension Reduction is the process of reducing the number features.
Two major types of Dimensional Reduction are: Feature Extraction and Feature Selection
Feature Selection is a form of dimensional reduction which works by? Figuring out which features may be safely removed, leaving the rest.
Feature Extraction is a form of dimensional reduction which works by? Replacing a group of features with a new feature
In Pub/Sub, large volume message flows should use Push or Pull subscriptions? Pull. Push delivers one message at a time. Pull can pull batches
Pub/Sub adds what two pieces of data to each message? messageId and publishTime
In Pub/Sub the messageId is guaranteed to be unique within the ____. Topic
In Pub/Sub by default, if a recipient doesn't acknowledge a message within ____ seconds a new message will be resent. 10
Pub/Sub can store messages for ____, after which time the message will be deleted. 7 days
In Pub/Sub the maximum time a subscriber can wait before acknowledging the receipt of a message is configurable. True/False True. gcloud pubsub subscriptions modify-message-ack-deadline .... FYI, default is 10sec
It's easy to switch a Pub/Sub subscriber from push to pull. True/False True
To determine which user has been accessing what in a project, examine the ____ log. Cloud Audit Logging Data Access
Google Cloud Audit Logging maintains what three log files for every project Data Access Admin Activity System Events
A common backup format for MySQL databases is? mysqldump
mysqldump backup files use a basic ____ file format. SQL, with both the content and structure specified.
What's the difference between a Bigtable Developer instance, and a Bigtable Production instance? A Developer instance has a single node and is designed for low cost testing and dev work. No SLA, guaranteed response time, etc. Upgradable at any point to Production.
A Production instance is exactly that. It has 1-2 clusters, each with a minimum of 3 nodes. Yes to SLA, etc. You cannot downgrade from Production to Developer
MapReduce is a what? MapReduce is a massively parallel big data processing technique and program model for distributed computing based on java.
Apache Pig is a what? Apache Pig is a high level data analysis language designed to greatly simplify a developers interaction with Hadoop MapReduce (pig latin). Remember though, it uses MapReduce behind the scenes so it is no faster, just easier to code
Apache Hive does what? Apache Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Like Pig, it greatly simplifies an analyst's interaction with Hadoop and MapReduce. Unlike pig it supports SQL statements.
Apache Spark is a? Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing.
Apache Spark uses memory to achieve high performance gains over classic MapReduce. True/False True Over 100 times faster than MapReduce
Apache Spark has programming APIs (Application Programing Interfaces) for which languages? Java Scala Python R
Apache Pig is a Java API. True/False False Apache Pig allows users to interact and do data analysis with Hadoop and MapReduce using its own script syntax: Pig Latin
Apache Hive is written using a combination of ___ and ___. SQL Java
Apache Pig is much faster than Java/MapReduce. True/False False Pig uses its own scripting language, Pig Latin, but uses MapReduce to do all its work.
The technique that can be used to provide secure access to a single BigQuery table or view is called? Authorized View
What steps are used to setup a BigQuery authorized view? Create table T1 in dataset DS1 Secure DS1 Create dataset DS2 Allow access from DS2 to DS1 Create a view in DS2 of T1 Provide access to DS2 to the user
Google Cloud Datastore is being upgraded and renamed to? Firestore
By default, Firestore (Datastore) automatically predefines an index for each property of each entity kind. True/False In datastore mode, True. Firestore by default, false, you have to manually configure the indexs
By default, Bigtable automatically predefines an index for each property of each entity kind. True/False False. The only index in Bigtable is the key
In Bigtable, the unique identifier for each record is known as the ____? Row Key
In Cloud Datastore, what is index explosion and when can it happen. Cloud Datastore creates an entry in a predefined index for every indexed property of every entity. If a property has multiple values (a movie's actors), then it creates an index for every possible combination of properties. Multiple, multiple value properties (a movie's actors, tags, and Genres) has a combinatorial effect on the index count and it can explode past the index limits.
In Cloud Datastore, the Movie entity has a list property for actors, a list property for genres, and a single value for title. How could a custom index be created to avoid index explosion? indexes: - kind: Movie properties: - name: actors - name: title - kind: Movie properties: - name: genres - name: title
In Cloud Datastore, the maximum size for an entity is? 1MiB
MongoDB is what kind of database? JSON document store
MongoDB has a max database size of? 32TB
Apache Casandra is what? A highly scalability, high available, NoSQL database
Apache Kafka is most similar to what GCP product? Pub/Sub
Cloud SQL is a managed version of what? MySQL or PostgreSQL database
In Cloud SQL, the maximum amount of data that can be stored in MySQL is? 10.23TB
In BigQuery, what is the proper way to reference a field in a repeated nested column (like a customers column which has a nested country)? UNNEST the nested column, then work with the field. Like: SELECT customer FROM `cool.example.table` UNNEST(customers) as customer WHERE customer.country = "USA" (Note: I'm 90% sure that's the correct syntax)
BigQuery can export data in what formats? JSON Avro CSV
The max size of any single file in a BigQuery export is? 1gb
When BigQuery is exporting more than 1gb of data, use what format for the export file name? gs://[YOUR_BUCKET]/file-name-*.json So it can create a series if 1g files
How does BigQuery charge? By data storage (Bucket pricing) By slot (a measure of processing resources) By Gig of data streamed in
BigQuery doesn't charge for data batch loaded from, or exported to, a bucket in the same region. True/False True, but it does charge for streaming inserts
BigQuery should be able to handle streaming inserts up to ___ rows per second, per project. 100,000
BigQuery can load files from what sources? File Upload Google Cloud Storage Google Drive Bigtable
When a BigQuery table is set up to partition, the partitions are separated based on ____. Time. Daily by default but the can be configured to use any timespan.
What steps should be taken to change a BigQuery standard table to a partitioned table? The partition type of a BigQuery table can't be changed. Would have to export to a new table.
What two types of BigQuery partitioned tables exist? Ingestion Time Partitioned Tables Partitioned Tables
In BigQuery ingestion time partitioned tables, what two pseudo columns are added to the tables? _PARTITIONTIME _PARTITIONDATE
In BigQuery ingestion time partitioned tables, how are the partitions created? BigQuery automatically loads data into daily, date-based partitions that reflect the data's ingestion or arrival date.
It's possible to control the time frames used by BigQuery to create partitions. True/False True. Partitioned tables allow you to bind the partitioning scheme to a specific TIMESTAMP or DATE column.
Creating a BigQuery table with SQL such that it partitions data every seven days would require what option? partition_expiration_days=7
CREATE TABLE cooldataset.coolnewtable (neatfield INT64, transaction_date DATE) PARTITION BY transaction_date OPTIONS( partition_expiration_days=7, description="a table partitioned by transaction_date" )
In many cases it's more efficient to denormalize data and load it into a single big BigQuery table because? BigQuery only indexes on the key column so joins are relatively inefficient. Yes, denormalization uses more disk space, but that's cheep so...
The best way to denormalize data in BigQuery is to take advantage of its ____. Native support for nested and repeated structures.
In a BigQuery query, what does the LIMIT do? What doesn't it do? LIMIT does limit the number of records in the result, but it does not change the number of records processed by the query
When a machine learning model is training, it adjusts what values? Weights and Bias
In Machine Learning, a neuron does what? It accepts a group of weighted inputs, applies an activation function, and returns an output
In Machine Learning, each neuron accepts what? Features form a training set or outputs from a previous layer of neurons.
In Machine Learning, what does a Bias term represent? It represents a constant value added to the input of a neuron. So if a neuron calculation comes to 0, the bias can overcome that so the actual output has a value.
In Machine Learning, what is Weight? The input value to a neuron is the sum of the outputs from the previous neurons, each with a weight value attached (multiplied). You can think of that as a value adding extra importance to the decisions from certain neurons. So the sum of Wi*Xi. Then you'd add the bias. So Sum Of (Wi*Xi) + bias
Machine Learning code in GCP can be created with which languages? Java and Python
Machine Learning code in GCP can be created with which libraries? TensorFlow for Java Scikit and XGBoost for Python
In a nutshell, Machine Learning breaks down into which major steps? Data Preparation Code the Model Train the Model Evaluate the accuracy of the Model Tune Hyperparameters Deploy Handle prediction requests Monitor/Evaluate
In GCP Machine Learning, what are some fundamental differences between web and batch prediction requests? Web requests need to be optimized for handling single requests in a reasonable amount of time (person's waiting for the response to load). Batch requests can handle larger sets of requests and predictions, that both originate and end up in a Google Cloud Storage bucket.
In Machine Learning, what are hyperparameters? Hyperparameters contain the data that controls the training process and include: input neuron count, network layers (how deep the network is), neurons in each layer, output neuron count.
What are the three layer types in a deep neural network Input layer, output layer, and hidden layers
How would you move a Machine Learning model? Package it, export the serialized model to a staging bucket, redeploy it to its new home
What gcloud command submits a training model? gcloud ml-engine jobs submit training specify job, package, details, region, machine type or scale
How do Machine Learning scale tiers work? There are several standard scales which specify the type for the master, and the number and type for the worker, and parameter servers. So for example STANDARD_1: One master: n1-highcpu-8, four workers: n1-highcpu-8, three parameter servers: n1-standard-4
In the Machine Learning custom tier, what can you specify? CUSTOM allows you to control the type of machine for your master node, the number and type of workers, and the number and type of parameter servers
In a Wide and Deep neural network, what does wide mean? What does deep mean? Wide reefers to the number if neurons in the input layer. Deep refers to the number of hidden layers, layers between the input and output tier.
In a neural network, how does it being Wide help you? Wide refers to the number of neurons in the input tier and it tends to help with exact matching and memorization learning.
In a neural network, how does it being Deep help you? Deep refers to the number of hidden layers in a network and it tends to help generalize learning. "You liked X, so you might also like..."
K-Means clustering is what? K-Means clustering is an unsupervised machine learning algorithm that groups similar data points together and helps discover underlying patterns.
What gcloud switch is used to run a TensorFlow training job locally? local gcloud ml-engine local train
A sparse vector is what? A vector with a single 1 [0,1] [0,0,1,0,0]
In machine learning, a common technique to handle a feature that represents a category with a limited number of options is what? One-hot encoding
How is one-hot encoding used to convert categories into a machine learning friendly format? One-hot encoding converts each option in a category into a sparse vector. For example: Red [1,0,0] Blue [0,1,0] Green [0,0,1]
In machine learning models, feature values tend to break down into what two major types? Continuous: Numbers in a range Categorical: A group of possible values
What is feature engineering? Feature engineering is the process of using domain knowledge to pick features that make machine learning algorithms work efficiently.
Name two feature engineering approaches Bucketization or binning: converting a feature from a continuous string of values into several bucketed values, usually tied to range. Not every temp, but a group of temp ranges.
Crossing or cross feature columns: combine a group of features into a new feature. So come up with a single value that crosses age and weight, or a single value that combines latitude and longitude.
It is possible to change a Google Cloud Storage bucket from Regional to Nearline to Coldline, and back. True/False True
Which trigger does Dataflow not support? Count, size, time, or combination? Size
It is possible to change a Google Cloud Storage bucket from Regional to Multi-Regional. True/False False
A Nearline or Cloldline bucket can also be Regional or Multi-Regional. True/False True
It is possible to set unique Google Cloud Storage classes (Regional, Nearline, Coldline) at the file level. True/False True
The Dataflow sink to BigQuery only supports streaming. True/False False. It supports both batch and streaming loads from Dataflow into BigQuery
Dataflow connects to Bigtable using the ____. Cloud Dataflow Connector
To run Java Dataflow jobs locally for testing, use the ____. DirectPipeRunner
What IAM role is required to run a Dataflow job? dataflow.worker
The workflow through a typical Apache Beam (Dataflow) app contains what major steps? Create the pipeline, Create or load the first PCollection of data, Apply PTransforms to each PCollection, Write the transformed PCollection to some sink.
In Cloud Dataflow, pipeline's frequently share data. True/False False. Pipelines don't share data, not directly with each other. That would impact Dataflow's ability to process large amounts of data in parallel.
In Cloud Dataproc, what can't be stored on preemptable workers? Data
Does autoscaling have to be initially enabled for Cloud Dataflow, or is it enabled by default? It's enabled by default
How is Cloud Dataflow autoscaling disabled/enabled? By setting the autoscaling_algorithm option
Cloud Dataflow in autoscaling mode will allow a default maximum of how many Compute Engine Instances? 1000 per job (n1-standard-4, by default), or the max compute engine quota for the project, whichever is lowest.
What Cloud Dataflow option can be used to change the Compute Engine instance type? worker_machine_type
In Cloud Dataproc, what security role is needed to execute jobs? dataproc.worker
What terminal command is needed to create a Dataproc cluster? gcloud dataproc clusters create ....
Cloud Dataproc is essentially a GCP managed instance of what? Hadoop and Spark
In Dataproc, what role does YARN play? Yet Another Resource Negotiator (YARN) is the resource management and job scheduling technology at the heart of the Hadoop architecture.
How should a Dataproc's YARN site be accessed? Use SOCKS through a SSH tunnel. When you're in Cloud Shell, that's the little "Web Preview" button, though you'll have to manually set the port correctly.
What are the two configuration options in Cloud Dataproc for the Master server? 1 master (default) 3 masters (Hadoop HA)
In Dataproc High Availability mode, what are some of the cluster changes? 3 masters All masters participate in a ZooKeeper cluster YARN configured for HA HDFS configured for HA
What is Apache ZooKeeper It's a hadoop service designed to share configurations, naming, and other group service across the hadoop cluster.
How is Cloud Dataproc structured? Project Cluster Master/Worker Nodes Jobs
What can a dataproc.viewer see? Details about the jobs and cluster
Cloud Dataproc is billed per ___ Minute
Bigtable requests run through a ______ before they hit a BT node. Front End Server
When using gcloud to create a Dataproc cluster, how can property files be modified? --properties 'fileAlias:cool.key=value' --properties 'spark:spark.master=...'
To customize software in a Dataproc cluster: Set initialization actions Use --properties SSH in and manage
To enable Bigtable replication ____. Create multiple clusters in the same Bigtable instance. Replications starts automatically. Use a different Zone for each cluster, all in the same region.
Data can easily be transferred into Dataproc via ____ SSH
A Bigtable cluster is a Multi-Regional, Regional, Zonal resource? Zonal
Bigtable is a Multiregional, Regional, Zonal resource? Regional
In Bigtable, what are some recommendations on choosing a key? Group keys containing like data together (data from sensor 1) Keys should distribute evenly across the tablespace Reasonably short Contain data fields Timestamps at end not beginning Reverse domain names
If a Bigtable node fails, what happens to its data? Nothing, the data is replicated and stored safe in tablets in Colossus
What is Borg? Google's internal container management system.
What is Colossus? Googles highly distributed, redundant, cluster level file system
In Bigtable, what is hotspotting? When a small group of keys (table section) are over utilized, causing Bigtable to overuse particular servers. For example, streaming data with keys that all start with the timestamp. All the writes will hit the same section of the cluster causing hotspotting.
What is typically, the single most effective way to avoid hotspotting? Field promotion
In Bigtable, what is Field Promotion? Adding one or more of the records fields to the beginning of the key: sensorId#timestamp, region:center:timestamp, reverseUrl/timestamp
What are some Bigtable keys that should be avoided? Keys that start with: Sequential numbering, Timestamps, Non reversed domain names Keys that contain frequently updated fields Hashed values
When testing performance in Bigtable, what steps should be taken? Use a production (not dev) instance Use at least 300gb of data, 100 per node Do a heavy 10min + pretest before the real test
In Bigtable, the HBase shell is ___? A command line tool that can be used to perform administrative and data access tasks.
If an application needs extra support for mobile SDKs, but has a workload appropriate for Cloud Storage, what might be a better option? Firebase
What is the Bigtable Key Visualizer A graphical tool that displays several usage metrics about Bigtable. It's a great way to spot key hotspotting for example
How can a Bigtable instance be switched from HDD to SDD drives? It can't. What you can do is export the data, spin up a new instance, and reload the data
Google Cloud Storage is appropriate for what kind of data storage? Binary/File
For Structured/Simi-structured data targeted at an analytics workload, what might be good storage options? Cloud Bigtable BigQuery
For simi-structured object/entity/JSON document types of workloads, what would be a possible data storage option? Datastore/Firestore
Describe persistent disk storage and what it's good for. Fully-managed block storage, used for Compute Engine VMs and Kubernetes Volumes.
What options should be considered for relational data storage? Cloud SQL Cloud Spanner
How is Cloud SQL scaled? Vertically. Bigger machines, more chips, larger drives, more memory.
How is Cloud SQL scaled? Horizontally
Describe Google Cloud Storage and what it's good for. Scalable, fully-managed, blob store for images, files, objects, unstructured data, etc.
Describe Google Cloud Datastore and what it's good for. Scalable, fully-managed NoSQL document (think Entity/JSON/Object) database for semi-structured and hierarchical data
Describe Google Cloud Bigtable and what it's good for. Scalable, fully-managed NoSQL wide-column database for low-latency read/write access, high-throughput analytics, and native time series support
Describe Google Cloud SQL and what it's good for. Fully-managed MySQL or PostgreSQL for web frameworks, structured/relational data, and OLTP workloads
Describe Google BigQuery and what it's good for. Scalable, fully-managed, Enterprise Data Warehouse (EDW) with SQL support and fast response times over massive data for OLAP workloads up to petabyte-scale, Big Data exploration and processing, and reporting via Business Intelligence (BI) tools.
Describe Google Cloud Spanner and what it's good for. Scalable, Fully-managed, global scale relational database for Mission-critical applications, high transactions, scale and Consistency requirements
Name some common workloads for Persistent Disks. Virtual machines drives Read-only data across multiple virtual machines Durable backups of running virtual machines
Name some common workloads for Google BigQuery. Analytical reporting on large data Data Science and advanced analyses Big Data processing using SQL
Name some common workloads for Google Cloud Storage. Storing and streaming multimedia Storage for static web application files Storage for custom data analytics pipelines Archive, backup, and disaster recovery
Name some common workloads for Google Cloud Bigtable IoT, finance, adtech Personalization, recommendations Monitoring Geospatial datasets Graphs
Name some common workloads for Google Cloud Datastore. User profiles Product catalogs Game state
Name some common workloads for Google Cloud SQL. Websites, blogs, and (CMS) BI applications ERP, CRM, and eCommerce applications Geospatial applications
Name some common workloads for Google Cloud Spanner. Adtech Financial services Global supply chain Retail
Talk about data consistency in Cloud Storage. Strongly consistent: Read-after-write Read-after-metadata-update Read-after-delete Bucket listing Object listing Granting access to resources
Eventually consistent: Revoking access from resources
Compare and contrast Eventually Consistent and Strongly Consistent. Eventual consistency means that an updated piece of data will eventually yield reads that return the new updated value, and conversely, that for a unspecified but hopefully short amount of time, different reads might result in a mix of the old and new value. Think DNS servers. You update your DNS, there might be a lag before all browsers point at the same location.
Strong or immediate consistency, on the other hand, tends to link back to the more traditional ACID concept in relational databases. Data read after an update will always return the same answer. Think your bank balance when you check it after a deposit.
Talk about transactions and Cloud Storage. Writes and updates are transactional, but there's no concept of a multi step, "update these three files" transaction.
Talk about transactions and Cloud Bigtable Reads and writes are atomic at the row level. Multiple row transactions are not supported. If there's a single cluster, or if a replicated cluster's application profile is in single-cluster routing mode, then single record Read-modify-write and Check-and-mutate operations are transactional.
Talk about data consistency in Google Cloud Bigtable. By default Bigtable is eventually consistent with change replication taking seconds and occasionally minutes.
In a Bigtable instance with two clusters, replication happens automatically. True/False True
In a Bigtable instance with two replicated clusters, if one cluster goes down will the other automatically pick up all the new queries? Only if the application profile routing policy is set to Multi-cluster routing. If it's set to Single-cluster routing then the switch will have to be made manually.
Talk about data consistency in Cloud Datastore. Ancestor queries are strongly consistent by default. To improve performance, you can set a query's read policy so that the results are eventually consistent instead.
Global queries (those that do not execute against an entity group) are always eventually consistent.
Talk about transactions in Cloud Datastore. Transactions are optional and depend on how the statement or group of statements, are executed. If a second transaction attempts to modify records which are already part of a transaction, the changes will fail for the second transaction.
What is the structure of data objects stored in Datastore? Data objects stored in Datastore are entities. Entities contain properties. An entity group consists of a root entity and all of its descendants.
How do ancestors and descendants work in Datastore. When an entity is created, another entity can optionally be assigned as its parent. An entity with no parent is a root entity. The path from an entity, through its ancestors to the root is called the ancestor path. The path from a parent down through children is descendant.
Can a Datastore entity be moved from one parent to another? No
In Cloud Datastore, what is a Kind? An entity's kind (type? class? schema?) is used to categorize the entity for the purpose of queries. Person, Task, Product might be examples of kinds.
How do Datastore keys work? A Datastore key consists of:
The entity namespace The entity's kind A key-name string or numeric ID
In Cloud Datastore what is the function of namespaces? Datastore namespaces are used in multitenancy configurations. Multitenancy allows a single project's Datastore data to be segmented into partitions. The kinds and kind logical structures can be the same for each tennant, but the data split into unique partitions. Think a set of Tasks split by operating unit.
Can multitenancy partitions be used for security in Datastore? No, multitenancy partitions split data for by tennant, but they offer no kid of security for the spits. Nothing's to stop tennant1 from accessing data from tennant2. You'd have to do that with your application.
Talk about transactions in Google Cloud SQL and Google Cloud Spanner. Both Spanner and MySQL support standard ACID transactions.
Talk about data consistency in Google Cloud SQL and Google Cloud Spanner. Both Google Cloud Spanner and MySQL are strongly consistent.
Talk about data consistency in Google BigQuery. BigQuery is immediately consistent for most operations. insertIds should be used on streaming inserts if there's a chance of message duplication. If provided, BigQuery will automatically deduplicate any messages with the same insertId, provided they arrive within a minute of each other.
Talk about transactions in Google BigQuery. BigQuery doesn't support transactions and should not be used for OLTP applications.
What is Dremel? The internal Google system behind BigQuery
Which of the following Google services are Multi-Regional, Regional, Zonal, or some mix: Persistent disks BigQuery Datastore Bigtable Dataproc Machine Learning Dataflow Cloud Storage Cloud SQL Cloud Spanner Persistent disks zonal standard, regional replicated across 2 zones
BigQuery dataset storage regional or multi regional. Query, load, and export jobs run in region with the dataset.
Datastore regional or multi regional
Bigtable zonal. Replication can spread over two zones.
Dataproc's compute engine instances are zonal. If you pic a region you can pick a zone or let GCP auto zone it for you. If you choose global then you must pic the zone.
Machine Learning regional
Dataflow is zonal. You can specify the region and zone, or just the region and it will autozone.
Cloud Storage Regional, multiregional, or dual regional.
Cloud SQL second gen MySQL is zonal by default but with the HA option it can replicate across two zones.
Cloud Spanner the instance is regional or multiregional.
The Cloud Spanner hierarchy is ____? Project, instance, node.
Each Cloud Spanner node can store how much data? 2TiB
Do Cloud Spanner nodes help with replication? No, hey help with data load and processing power but not replication.
What replication options does Cloud Spanner offer? For regional there will be three read-write replicas spread over multiple zones. For multiregional there are multiple replicas in multiple zones in multiple regions, based on configuration. This provides faster reads but slows down the writes. There's a record update voting algorithm that requires a quorum between the replicas, and the added network latency slows that down.
One machine learning method that helps when a wide and deep neural network is overfitting training data is? Dropout method, that is, ignoring neurons. It helps remove some of the mutual dependencies that neurons develop. It helps with overfitting because it forces the neurons to work in different ways.
The recommended minimum number of Cloud Spanner nodes is? 3
A Cloud Spanner node can perform queries at about what rate? For 1kb of data 10,000 QPS of reads or 2,000 QPS of writes
In Machine Learning prediction jobs that need to deal with slowly changing labels, like a users changing movie preferences, how best can we handle model retraining? By continually retraining on a mix of new and historical data.
In BigQuery, what's the difference between Sharding and Partitioning? Partitioning is done by date, either daily or at some interval configured by the user. The data is in a single logical table, but it is stored in "partitions" and has pseudo keys that allow querying by timespan. Sharding is a manual splitting of data into multiple tables based on some criteria: stores, regions, date ranges, etc. Queries over sharded tables require UNIONs or table wildcards. Given enough tables, shard queries can hit the 1000 table limit and fail. But they can also be very fast if only a handful of small tables are queried.
Does Google Data Studio cache data? It does and it typically refreshes every 12 hours. There's a lightning bolt in the UI that lets you know the data is cached. The cache can be disabled if report data needs to be refreshed more frequently.
When pulling CSV files from Google Cloud Storage into BigQuery, if the file might contain bad rows, how might you automate the preprocessing? Pull the data into a Dataflow processing pipe, filter out the bad data to a secondary storage location, then load the scrubbed data into BigQuery
A GCE application runs a regular query against the database. The database quits answering. What's a common approach to a connection failure of this sort? Requery with an exponential backoff.
What common machine learning algorithm might be a good fit to help predict movie prices? Linear regression
What is a classic machine learning algorithm for classification? Logistic regression, based on the logistic or sigmoid function.
Contrast a Recurrent vs a feedforward neural network In a feedforward network data flows only one way from input, through hidden, to output neurons. In a recurrent or feedback network data can loop and flow both ways through some neurons.
What does the following BigQuery query do? SELECT * EXCEPT(row_number) FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY ID_COLUMN) row_number FROM `TABLE_NAME`) WHERE row_number = 1 Ensure that there are no duplicates in the returned records. Might be useful deduplicating streamed inserts. Though insertId would be better
The proper syntax for a BigQuery wildcard table name is? FROM `bigquery-public-data.noaa_gsod.gsod*` (Backticks! Don't fall for single quotes. Also, in legacy SQL it was square brackets, so a question containing an error related to [ in the table name is a legacy vs standard thing)
In a sentence, what is Machine Learning? Tom M. Mitchell: "Machine learning is the study of computer algorithms that improve automatically through experience." I might add that it's a subset of the much larger topic of AI
You want to setup a Dataflow (Beam) app to process real-time sensor data. You need to track activity and react to a sensor who's had no activity in the last 30 min. How might Beam handle this problem? Setup a session window with a gap time of 30 min.
Dataflow pipes are written using which languages and use which api? Written with Java, Python, or Go using the Apache Beam api
Which GCP data storage options would work best with OLTP e-commerce type data? Cloud SQL and Spanner
When using machine learning for unsupervised anomaly detection, what should be true about the data being tested? The rate of anomalies to normal data should be low.
You're dealing with streaming sensor data with intake which peak north of 250,000 messages per second. What might be a typical GCP intake flow, and what storage option might work well for the data? Pub/Sub to Dataflow to Bigtable
You need to discover everyone using BigQuery and what they are doing with it. How might you accomplish this? Export or Stream the Data access Stackdriver Cloud Audit log into BigQuery and analyze it there.
What are the three Stackdriver Cloud Audit logs generated for each project, folder, and organization? Admin Activity for configuration changes System Events for compute engine system events Data Access for read, write, modify events
Which Stackdriver Cloud Audit logs are free? Pay? Enabled by default? Admin and System event logs are enabled by default and always free. BigQuery Data Access is also enabled by default (can't be disabled) and free. The rest of the Data access logs need to be enabled and may incur spend.
If Stackdriver Data access logs are enabled, what type of data access is still not logged? Anonymous access data where the user doesn't have an account.
You are helping with a lift and shift job for a company's Hadoop's cluster. What will you do with the Hadoop related data? Move it into GCS so the Dataproc cluster can be disposable.
Why can Hadoop in Dataproc work so well with data stored in Google Cloud Storage? Because GCS is a Hadoop Compliant Filesystem. HCFS
What is a key mindset change related to the way Hadoop works and how Hadoop data is stored, for people moving from onsite Hadoop to Hadoop in Dataproc? They need to think of the data as a more general Google Cloud Storage thing, and not just a Hadoop HDFS thing. Also, they need to think of Hadoop as an ephemeral data processing tool they can spin up when needed, and shut down when not activly in use
Would Datastore work well in an online order processing sort of application? Not especially, it really isn't designed for that kinda of data processing and OLTP. Cloud SQL or Spanner would be much better choices.
In Datastore, what is an entity group? An entity group is a root entity and all its descendants.
In Datastore, if you had an entity group of kind Orders, what would be the performance of single write? Batch write? About one per second, doesn't matter if it's single or batch.
What's the difference between Cloud Audit Logging and Stackdriver Logging? Cloud Audit Logs are the Admin, System Event, and Data Access logs captured by Google for each Org, Folder, and Project. Stackdriver Logging is the part of the Stackdriver which allows you to store, search, analyze, monitor, and alert on log data from GCP or AWS, including the Cloud Audit Logs.
In most situations should Dataproc be setup to store data stored in the Persistent Drives? or in Cloud Storage? Dataproc with data stored in Cloud Storage. The Dataproc hadoop cluster should be up and going only while a job process needs to run. After, kill it and keep the data only.
In machine learning, would fraud detection most likely be a regression or classification problem. Classification: fraudulent, legit
Which of the following might be used when working on machine learning fraud detection? Unsupervised learning? K-Means Clustering? Linear Regression? Supervised learning? Unsupervised learning: yes, classic anomaly detection problem
K-Means Clustering? Yes, could be used with unsupervised categorization
Linear Regression? Probably not
Supervised learning? Possibly, starting with a supervised model for initial training and then switching to unsupervised for refinement has been done before
What application protocol does Pub/Sub operate over? HTTP
What is Cassandra? An Apache NoSQL database. Bigtable competitor
You have a large amount of data loaded into BigQuery and you need to manually update the data type of a column. How should you proceed? For ease of use and simplicity, use a query to overwrite the existing table or create a new table if you need to preserve existing data. If you are more concerned with cost of the move, export all the data to GCS and load it into a new table.
What's the cost of exporting BigQuery data to a Google Cloud Storage bucket? Importing? No charge for the import or export, but you will pay GCS storage.
What would I need to create and use in Stackdriver to export a specific type of event to Pub/Sub Using the GCP console or the Stackdriver API, create an advanced filter to find what you want, and then create and use a sink to Pub/Sub.
Stackdriver can create a log export sink to which other GCP products? Pub/Sub, BigQuery, or Google Cloud Storage
When training a machine learning model, which would be preferable: Features with a high or a low correlation to the output labels? High, generally speaking.
What are some common ways of dealing with training datasets with missing/null fields. Use a form of estimation. Dump in a constant value Replace with a constant, not great Could dump the records but that's often the worst choice. Lose lots of data and could skew the overall result. Dumping the feature might be just as bad.
When doing a BigQueryIO.Read what's the difference between the from(...) and the fromQuery(...) from() reads the specified table. fromQuery() executes the specified query first, then read from the results.
Can also use BigQueryIO.read() more directly to generate TableRows which are easier to use but slower.
What's typically wrong with Bigtable keys like the following? datatime eventId (incrementing number) They tend to hotspot one key range at a time, with all the updates coming into a single Bigtable node. Also, it might not be the easiest to access, assuming that you're not only accessing by date or eventId. Might be better to add some more meaning to the keys. sensorId/datetime, that sort of thing.
In Cloud SQL, MySQL Gen 2, what's High Availability mode? A second copy of the MySQL primary server is created in another zone of the same region (must be same Region). The failover server is then replicated from the primary using MySQL semisynchronous replication. Users, data, settings, etc. will all be replicated.
In Cloud SQL, MySQL Gen 2, when would the failover server go live? Automatic switch or manual? Then what would happen? The primary server writes a heartbeat to the failover server every second. If the heartbeats fail for approximately 60sec, then Cloud SQL will automatically switch to the failover replica, write a message to that effect in the operations log, wait to see if any updates arrive from the primary, make the failover primary, create a new failover instance, and start to replicate there. All ip addresses will be updated automatically.
In Cloud Dataflow can you update a running pipe? Python doesn't yet support updating streaming jobs. Java does. When you deploy an update in Java, Dataflow runs a compatibility check of the new code against the existing pipe. You might have to provide transformation mappings for graph changes, and some changes are not supported, but bug fixes and the like should be easy. Once the new code passes the check, the old job is stopped, the new is started under the same job name, new transformations in the new code might be missed for inflight data, but all in all it picks right up where the old code left off.
The most common performance issue in Cloud Bigtable is? Key hotspotting
What is Avro? A compressed file format and serialization system which BigQuery can load and use very efficiently.
The default file format for loads into BigQuery is what? CSV
The preferred compressed format to use for data loaded into BigQuery is? Avro
What's the default file encoding that BigQuery expects for CSV files? UTF-8
What will happen if you load a CSV file into BigQuery from Cloud Storage that isn't encoded with UTF-8, and you don't warn BigQuery about the encoding change? BigQuery will attempt to dynamically detect the file encoding and load it anyway, but the load might not be byte for byte the same as the original.
What's the max amount of data that can be loaded into Google Sheets 2mil cells
What the only allowable encoding for JSON files imported into BigQuery? UTF-8
In BigQuery, what's a good way to limit table data to exactly what a group of users needs, without copying the data into different tables or external systems. Set up a view that shows the users exactly the set of data they need. They can even run queries over the view data.
You want Dataflow to scale as needed. What setting do you need to change to allow for scaling? Autoscaling is enabled by default. Specify --maxNumWorkers to change the max scale vm count. Note, you can't change the max for a running system. Shut it down and redeploy with the new count.
In Datastore, what effect does excluding a field from the index have? It decreases the key storage size because it no longer needs entries involving that field. It also means that no filter operations involving said field will work.
What's an easy way to automate a daily Dataflow job? Create a cron job in Google App Engine Cron Service and have it run the Dataflow job
Uploading lots of small files with gsutil will work fastest with what option enabled? Use the -m switch to enable multithreaded, parallel uploads. gsutil -m cp -r place gs://...
What is S3 storage? It's the AWS equivalent to Cloud Storage
When should you move data with the Google Storage Transfer Service? When the destination is Google Cloud Storage and the source is AWS S3, an HTTP(S) location, or another Cloud Storage bucket. It can be a one off transfer or you can schedule it. It's also smart enough to spot just the changed files if the transfer is set up on a periodic timer.
What is Redis? It's an in memory, data store/Database.
What is the Google Managed version of Redis? Memstore
What is Memstore/Redis good for? Highly available, in memory caches with sub ms access speeds.
How much data can be stored in Memstore? 300GB with up to 12Gbps network throughput. You pay for the memory you use, by the hour. Also, the throughput is directly related to the amount of data you are storing, with 12gps being the max.
What is HBase? It is the Hadoop database that sits on top of the Hadoop File System (HDFS)
Does denormalizing data decrease or increase total DB storage size? You frequently have repeated data so, increases. Normalized data tends to be more concise and smaller.
What file formats can BigQuery accept and using what compression formats. CSV and JSON with GZIP, Avro with DEFLATE or SNAPPY. CSV won't work for repeated or nested data.
What are GCP primitive roles? What's the problem with them? The original security roles created by Google in the early days of GCP: Owner, Editor, and Viewer. The problem with them is their lack of granularity.
BigQuery caches results by default for about ___ hours. 24
BigQuery caching is enabled / is not enabled by default. is enabled
Does BigQuery charge for a query that returns its results from cache? No, rerun queries that load from cache are free, but if the underlying data has been modified or new data has been streamed in, then the cache expires and the query is re-executed at normal fees.
When would a BigQuery not cache a query's results? If the results are sent into a new table.
In a basic query what do the following statement elements do, briefly? SELECT ____, WHERE ____, FROM ____, and LIMIT ____. Select chooses the particular columns being returned (Projection), WHERE limits the rows returned based on a condition, FROM specifies the data source, and LIMIT will limit the number of rows returned, though not the number of rows processed by a given query.
In BigQuery, when storing data denormalized, how is nested data setup? You set the datatype of the field to RECORD, set the mode to REPEATED, and then add the nested fields and their types.
Another name for a machine learning regression based estimator is? Regressor
Provide some examples for the type of prediction a wide and deep neural network would handle well. Language translation, self driving cars, image recognition, colorizing black and white photos, hand writing analysis, etc.
In machine learning, what's a categorical feature? A feature that represents category data, as opposed to continuous numerical data. Examples: shirt size, movie rating, zip code
To help a neural network learn about the relationships between categories in a categorical feature, you might consider adding a what? For small sets of categories, like shirt size, one hot encoding the feature works well. If there are a lot of category choices, like all the words in English, you might use an embedding layer/column. Embedding creates an index for each choice, and then links the index to a fixed length vector. So "dog" might have an index 107 and might link to a vector of size 32 where each value is a weighted number. The training model will update these values as it learns. So over time, the vector for "hello" might grow very close in values to the vector for "dogs" or "canine"
In a neural network, what are hidden layers? The neurons between the input layer and the output layer. They aren't really hidden as much as in between.
In Dataflow (Beam), what's a sink? A sink represents an output location which Beam can write to
If you want to stop a Dataflow pipeline that's currently handling data, what are your two options? Cancel, which kills the process and in flight data is not processed, and Drain, which stops data intake but processes data in flight.
Can a Dataflow pipeline be tested outside of Dataflow? Yes, Dataflow uses the open source Beam framework. Use DirectPipelineRunner if executing on a local machine.
In Dataflow, the Sink and Source APIs are for what? Source is for creating custom data loader "read()" code. Sink is for writing custom output "write()" code.
In Dataflow, what's a ParDo used for? The ParDo or Parallel Do looks at every element of the incoming PCollection, does something to it, and generates an output PCollection for the next step in the pipeline. This is a key part of most Beam transformations.
The machines in a Dataproc cluster are actually created where? A regional Compute Engine Instance Group (don't mess with it!)
When loading data through the BigQuery web UI what are the limits on the uploaded file size? Less than 10mb and 16,000 rows. They also have to be loaded one at a time.
What is one key way to limit the number of rows processed by a BigQuery query? Using sharded and/or partitioned tables
In machine learning why is it important that you keep some of your labeled data back for testing. Testing. Once you have your model trained, you need to run some data through it, see it's predictions, then test the prediction accuracy against your acceptable range limits.
When configuring the GCP Machine Learning Engine, what are the three machine types you are altering? And what's the purpose of each type? The Master Node, there can be only one, and it's responsible for controlling the cluster and coordinating all the parts of the job graph. The Worker Nodes, responsible for handling the various tasks in the the job. The Parameter Server Nodes, responsible for storing
In a Dataproc cluster, where is YARN running? On the master node.
What are some of the ways you can customize the software in a Dataproc cluster? SSH into the master and make changes Use --properties to to mod config files Setup initialization actions
An SSH connection automatically sends data encrypted. True/False True, Secure SHell passes data through an encrypted channel.
To switch Bigtable from HDD to SSD drives, what steps need to be taken? You can't switch a Bigtable instance drive type. You'd have to export out all the data and reload it into a new instance. ................................................... GCP Data Engineer Exam
INGEST Services App Engine - Compute Engine - Kubernetes Engine - Cloud Pub/Sub - Stackdriver Logging - Cloud Transfer Service - Transfer Appliance
STORE Services Cloud Storage - Cloud SQL - Cloud Datastore - Cloud Bigtable - BigQuery - Cloud Store for Firebase - Cloud FireStore - Cloud Spanner
PROCESS/ANALYTICS Services Dataflow - Dataproc - BigQuery - Cloud ML - Cloud APIs - Dataprep
EXPLORE/VISUALIZE Services Datalab - Data Studio - Google Sheets
What is Cloud Datastore? No ops, highly scalable, TRANSACTIONAL, NoSQL Relational Database
What should you use Cloud Datastore for? Highly available, structured data, < 1 TB | E.g. Product Catalogs, Game Save States, User profiles
What should you NOT use Cloud Datastore for? Analytics (Use BigQuery/Spanner) - Extreme scale (Use Bigtable) - Existing MySQL (Use Cloud SQL)
What's the difference between analytical and transactional databases? Analytical databases are designed for higher scale with aggregating calculations. Transactional databases are optimized for finding individual rows in tables (e.g. based on ids).
Relational Database --> DataStore Table --> ? | Row --> ? | Field --> ? | Primary Key --> ? | Kind | Entity | Property | Key
What do you query in DataStore? Entities
How can you query Datastore? Programmatic - Web Console - Google Query Language
How do you avoid bad indexes in Datastore? Create your own custom indexes. Don't index properties that don't need to be indexed
What is data consistency in queries? How up to date are these results?
What is Strong Consistency? Changes happen in order --> query is guaranteed to update but it will take longer. E.g. Financial transactions
What is Eventual Consistency? Changes happen out of order --> faster query but can have "stale" results - E.g. Census population
What is Cloud BigTable? Highly scalable, ANALYTICAL, NoSQL database - Ideal for large analytics workloads
What are some use-cases for BigTable? Financial Data, IoT, Marketing data
What is a BigTable "instance"? Each BigTable project is an "instance"
How is a BigTable instance structured? Nodes are grouped into clusters. 1 or more clusters in an instance
What are the instance types in BigTable? Development - low cost, single node | Production - 3+ nodes per cluster
Can you change disk type (HDD->SSD) within an instance? No, you need a new instance
How do you interact with BigTable? Command line tool (cbt - preferred) or Hbase shell - You can also use BigQuery!
How is a table stored in BigTable? It is sharded across tablets
How is a BigTable table organized? First row --> row key | Columns are grouped into families
How do you query BigTable tables? Index on the row key --> requires good schema design!
Where should related entities be in BigTable? They should be in adjacent rows
What are the three challenges of data streaming? Volume (amount of data) - Velocity (speed of data transfer and analysis) - Variety (Types of data to process)
How should data be stored on BigTable nodes? It should be spread over many nodes to prevent "hotspotting"
What are good row key practices? 1) Reverse domain names (com.website...) 2) String identifiers 3) Timestamps in REVERSE
Why is it beneficial to separate compute and storage? It enables autoscaling
How do you create a BigTable Cluster? Use gcloud
What types of BigTable row keys should you avoid? Domain names (some may be more active than others) - Sequential IDs (newer users could be more active) - Static, updated identifiers -
Would it be better to have one 10 node cluster or two 5 node clusters? One Cluster - Multiple clusters introduces latency b/c the cluster getting written to also has to process read functions
Should you make changes to BigTable immediately? No, BigTable can learn how to best optimize your data structures
What is Cloud Spanner? Highly scalable, RELATIONAL, database. - Similar in structure to BigTable
When should you use Cloud Spanner? When you have bigger workloads than Cloud SQL can handle (>8000 queries/sec). ACID Compliance
What does ACID compliance stand for? A - Atomicity C - Consistency I - Isolation D - Durability
How is data sharded in Cloud Spanner? Within a zone
When would you use Cloud SQL instead of Cloud Spanner? To replicate an existing on-premise relational database. Spanner - Designed for use in the cloud
Is Cloud Spanner an easy replacement for MySQL? No, work is required for migration, but it will enable higher scalability
What's the difference between horizontal and vertical scalability? Horizontal - More nodes sharing the load (more consistent) | Vertical - more compute on a single instance
What are Cloud Spanner tables called? RBDMS
How are tables handled in Cloud Spanner? They use table interleaving - combines what would be multiple tables into one table using "Parent/Child" tables
How do you ingest data to Cloud SQL? Batch data imports
What is a tightly coupled system and what are the issues with them? (Pub/Sub) Senders and receivers talk directly to each other. If one side goes down, data is lost
What are the benefits of a Loosely Coupled System? Fault Tolerant - Scalable - Message Queuing
What is Cloud Pub/Sub? Asynchronous messaging bus - Decouples senders and receivers
Does Pub/Sub guarantee message deliver? Yes, it guarantees delivery at least once
How does messaging flow in Pub/Sub? Topics --> Messages --> Subscription <-- Subscribers
What's the difference between PUSH and PULL in Pub/Sub? Push - lower latency | Pull - better for larger volumes; batch delivery
Does Pub/Sub manage the order messages are sent? No, it doesn't manage message delivery order, so messages can arrive out of order
How can you deal with messages being out of order? Have Dataflow handle it - Have Pub/Sub include metadata that helps with ordering the messages
What happens with the subscriber receives a message? They send a receipt acknowledging delivery, but this doesn't always get sent before a duplicate message is sent --> Use Dataflow
What are the three steps in data processing? 1) Read Data (Ingest) 2) Process (ETL) 3) Output
What has historically been the problem with having streaming and batch data? (3) They had to come in through different pipelines. - Streaming was faster, but batch was more accurate - It was hard to compare recent and historical data
What is Cloud Dataflow? Built on Apache Beam - No ops, scalable, stream and batch data processing
How are Dataflow pipelines organized? They are region based
Why use Dataflow over Dataproc? (4) Less overhead (No-ops) - Unified Batch and Streaming - Pre-processing for ML - Serverless solution
Why use Dataproc over Dataflow? (3) Familiar tools - Better for existing pipelines - Has iterative processing and SparkML
What are the differences in base packages between Dataflow and Dataproc? Dataflow --> Apache Beam | Dataproc --> Hadoop/Spark
What is a Dataflow "Element"? Single entry of data - row
What is a "PCollection"? Distributed Dataset, data input, and output
What is a Dataflow "Transform"? Data processing operation in a pipeline - Uses programming conditionals (for/while loops)
What is a ParDo? Transform applied to individual elements - Filter out/extract data elements
What three things does Dataflow use to deal with late data? Windows - Watermarks - Triggers
What is a Dataflow "window"? Dataflow logically divides data into event-time-based groups
What are the three types of Dataflow windows? Fixed - fixed period of time (8-9) | Sliding - windows overlap with each other; better for smoother changes in avgs | Sessions - users logging in for a session; good for "bursty" data, e.g. rain
What is a Dataflow "Watermark"? Timestamp | Event time - when the data was generated | Processing time - when the data is processed in the pipeline | Pub/Sub or data-source can provide a watermark
What is a Dataflow "Trigger"? Says when a results is emitted from Dataflow - Default is to trigger at the watermark
Why are Dataflow triggers useful? They allow for re-aggregation of metrics with late arriving data
How long do messages stay in Pub/Sub? Seven days
What would the steps be to set up a streaming data pipeline using BigQuery and Dataflow? 1) Create BigQuery Table for output 2) Create Cloud Storage bucket for Dataflow staging 3) Create Pub/Sub topic for streaming data 4) Create Dataflow pipeline to connect to Pub/Sub and deposit data into BigQuery Table
What is cloud Dataproc? On-demand managed Hadoop clusters | NOT No ops - you need to configure clusters yourself | DOESN'T autoscale
Is Cloud Dataproc a no ops solution? No, you need to manually configure clusters
What is the main use-case for Dataproc? Migrating existing Hadoop structures to the cloud
What is MapReduce? Map - Take big data and distribute to many workers (nodes) | Reduce - Combine the results of those pieces | Distributed/parallel computing
How do you create a Dataproc cluster? Manually or using gcloud commands
What is are preemptible VMs and what should you use them for? Low cost worker nodes that can be lost with no warning if demand goes up - Use for processing only!
Is HDFS a good storage solution for Dataproc? No, use Cloud Storage instead if needed - Cluster can be stateless - shut down at any time
What is the recommended ration for Preemptible VM to Permanent VMs? 50/50
Are you guaranteed Preemptible VMs? No, they won't be available if the region is very busy
How do you access you Dataproc cluster? SSH in via cloud console or gcloud
What are the two ways to access a Dataproc cluster from the web? Firewall ports (8088, 9870) - SOCKS proxy
What is the benefit of using a SOCKS proxy? It doesn't expose your firewall ports
What format is data in when it's migrated to Dataproc? HDFS
How we do we change the way we think about clusters when moving to Dataproc? Clusters become ephemeral (temporary) entities instead of permanent ones
What are Dataproc migration best practices? 1) Move data first (generally to Cloud Storage) 2) Small scale experimentation
What is a storage benefit of migrating to Dataproc? You can separate storage (Cloud Storage) and compute (Dataproc). This means you don't need your Dataproc clusters running all the time - On-Premise --> GCP | HDFS --> ? | Hive --> ? | Hbase --> ? | Cloud Storage | BigQuery | Bigtable
What is the term for optimizing on-premise structures to the cloud? "Lift and Leverage"
What is BigQuery? Fully managed data warehouse - Serverless, no ops
What are the benefits of BigQuery? Serverless - Autoscales - Good for storage and analysis - Accepts batch and streaming - Durable, multi-regional
What data types are stored in BigQuery vs Cloud Storage? BigQuery --> Tables } Cloud Storage --> Files
How is data stored in BigQuery? Columnar instead of Record Oriented - Each value is stored on a different storage volume
Does BigQuery update existing records? NO - It's not transactional
What are the good and bad aspects of columnar data storage? Fast read but slow write
What are BigQuery datasets and tables? Dataset - collection of tables and views } Table - collection of columns
How do you pay for with BigQuery? Storage, queries, and streaming data inserts
How can you interact with BigQuery? (4) Web UI - Command line (CLI) - Programmatic (REST API) - Via queries
What is a BigQuery "View"? Virtual table defined by a query
What are the benefits of caching BigQuery queries and how are cached queries stored? If you make the same query repeatedly, you won't have to pay for it - Queries are cached at the user level
What are User Defined Functions (UDFs) and what are their benefits? SQL code combined with JavaScript functions - Allows for more complex functions like loops and complex conditionals
What services can be used to ingest data to BigQuery? Cloud Storage (Batch, Read) - Bigtable (Read) - Dataflow (Streaming) - Google Drive (Read)
How would you connect Dataproc job outputs to BigQuery? Write Dataproc output to Cloud Storage then have GCS write to BigQuery
Where can BigQuery export data to and what are the max file sizes? ONLY to Cloud Storage Max 1 GB/file, but files can be split up
What are six best practices of BigQuery queries? 1) Avoid SELECT * (loads too much data, costs a lot) 2) Denormalize data 3) Filter early (minimizes later data processing) 4) Do the biggest JOINs first 5) LIMIT does not affect the cost!! (just the output you see; cost is per column) 6) Partition Data by date
Why does the LIMIT clause not affect the cost of a BigQuery query? Queries charge per column and LIMIT doesn't change how much data is processed
What are the four steps (and their colors) to look at to best optimize query performance? Look at the "Execution Plan" view: - Wait (Yellow) - Read (Purple) - Compute (Orange) - Write (Blue)
What are machine learning algorithms ultimately trying to do? Make accurate predictions using completely new data based on historical data they've seen
What are the six steps in developing a machine learning model? 1) Collect Data 2) Organize Data 3) Develop model - use historical data 4) Test model - using different historical data 5) Train model - using new data 6) Deploy model - then keep training with new data!
What is an ML "feature"? A variable - raw or calculated (engineered feature)
What is "Inference"? Apply trained model to new examples
What is model overfitting? When model too closely predicts based on the test dataset. It won't be able to make good predictions on new data.
What is Supervised Learning? Create a function from labeled training data. - The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and the desired output value (also called the supervisory signal).
What are the two main types of Supervised Learning and what are they used for? Regression - Continuous variables; predict stock prices | Classification - Categorical variables; yes/no, is this a cat?
What is unsupervised learning? Uses unlabled data and tries to find natural clusters and patterns.
What are the three types of ML "learning"? Supervised - Unsupervised - Reinforcement
What is Reinforcement learning and what are some examples of it? Use positive/negative reinforcement to complete a task - E.g. complete a maze or learn chess
What is a neural network model? Model with multiple layers with connected units (neurons)
What is an ML neuron? Node, combines input values and creates one output value
What is a neural network "input"? Value fed to a neuron (e.g. cat pic)
What is a neural network "hidden layer"? Set of neurons operating from the same input set
What is feature engineering? Deciding which features to use in a dataset; which features can we calculate that might be predictive?
Why do we do feature engineering in ML? Brings in human intuition, we can hypothesize variables that might be predictive
What is a neural network "feature"? A transformation of inputs
What is an ML "epoch"? A single pass through the training dataset - Each run through building the ML model
What are ML model weights? Multiplication of input values
What is ML "bias"? Value of output given a weight of zero
What is Gradient Descent? We walk down an error surface trying to find the smallest error (MSE) - Iterative process
What are Small and Large steps in ML Learning rates? Small - finds the smallest error but takes a long time |Large - runs faster but isn't as accurate; may never converge to the true minimum
How are images processed in ML models? Each pixel is a 1 or 0 value - The input to every neuron is a pixel
What are hyperparameters? Variables used in the training process itself - Learning rate, etc
What are deep and wide neural networks? Deep - has many hidden layers; good for generalization --> easier for neural networks (e.g. numeric data, pictures) | Wide - has many features; good for memorization; hard for neural networks (e.g. one-hot encoded) - The best models have deep and wide parts of the network!
Should you automatically throw out outliers? No, see why outliers might be occurring in the datset
What are two best practices for ML dataset? Make sure data covers all possible use cases - Include negative examples and near misses
What is the precision of a classifier model and when is it useful? Accuracy when the classifier says "yes" | True Positive / (True Positive + False Positive) | Use when things you're trying to find are very common
What is "recall" in a classifier model and when is it useful? Accuracy when the truth is "yes" | TP / (TP + FN) | Use when the things you're trying to find are very rare
What is "accuracy" in a classifier model and when is it useful? Accuracy of all evaluate data | Cross-entropy for classifiers | Use when the things you're trying to find are balanced
What is TensorFlow? Software library for ML model creation - Pre-processing, feature creation, and model training
What should TensorFlow be used for and by whom? Creating an ML model and initial training - Used by ML researchers
What is Cloud ML Engine? Tensorflow library | Distributed training and prediction | Hyperparameter tuning
What are the two types Cloud ML Engine predictions? What are their inputs and outputs? Batch - Get inference on large amounts of data; Cloud Storage is the input and output | Online - Fast requests with minimal latency; Input - JSON strings, output - returned in response message
What frameworks does ML Engine support? TensorFlow | scikit-learn | XGBoost
Should you read directly from BigQuery to ML Engine? No, it's better to pre-process in Cloud Storage
What are the three types of regressors in TensorFlow? Linear (regression) | Linear Classifier (logistic) | Deep Neural Network (DNN)
What is the benefit of training in Cloud ML Engine instead of TensorFlow? Cloud ML Engine can distribute training across many machines, decreasing processing time and increasing stability
Why is TextLineReader is an efficient way to read TensorFlow? It reads data directly into the computation graph
What four characteristics make a good ML model feature? 1) It's related to the objective 2) Should be known at production time 3) Has to be numeric with meaningful magnitude 4) Needs a big enough sample size
What is feature crossing and why is it useful? It concatenates multiple variables - Can simplify learning
How do you use categorical variables in ML models? You need to one-hot encode them --> create sparse columns
Why do you bucketize/discretize continuous variables? To not weight variables with specific values that are overly meaningful (e.g. latitude)
When would you use pre-trained APIs over a custom ML model? I need it quickly - I can't make an ML model - It fits my use-case
How do you pay for pre-trained APIs? You pay per API request
What is Datalab? Interactive notebook for exploring and visualizing data - Based on Jupyter notebooks
What GCP services can you visualize with Datalab? BigQuery, ML Engine, Cloud Storage, Compute Engine, and Stackdriver
How do you connect to Datalab? SSH in Cloud console and create Datalab instance
What is the Datalab web preview port? 8081
What is Dataprep? Intelligent data preparation - Managed, serverless, and web-based
How does Dataprep work? It's backed by Dataflow - Has automatic options (remove outliers, dedupe), but you can add custom options as well
What is DataStudio? For data viz and dashboards (e.g. Tableau) - Part of G suite NOT Google Cloud
What services can DataStudio connect to? GCP - BigQuery, Spanner, Cloud SQL, GCS | G Suite - YouTube Analytics, Sheets, AdWords | Many third party integrations
Can you change the region of a BigQuery dataset? No, datasets are immutable
What is the best metric to use for determining when to scale? CPU utilization, NOT storage utilization
At what level is BigQuery data access controlled? At the data-set level
What's the difference between Failover and Disaster Recovery? Failover has a very short downtime - Disaster Recovery may incur delays before service is restored
What is a best practice for Identify and Access Management (IAM)? Assign roles to groups and give groups access privileges
In ML Engine, should you monitor jobs or operations? Jobs
Do you need to restore restore snapshots to use them? No, you can use them right away
What is Apache Kafka? It's an on-premise version of Pub/Sub
What is Cloud Composer? A workflow orchestration tool build on Apache Airflow Works with cloud and on-premise servers
What is Cloud Memorystore? Fully managed in-memory data store service for Redis
What is BigQuery ML? ML model development in a SQL querying language. Enables data analysts to build ML models without having to export data and re-import
What four model types does BigQuery ML support? Linear regression - Binary Logistic regression - Multiclass logistic regression for classification - K-means clustering
What is "Prefetch" caching? Caching ahead of time to predict what might be searched for. - Only possible on Owner credentials. - Can be turned off (unlike query caching)
How many standard nodes are required before you can use Preemptible nodes in a Dataproc cluster? Two; you can't only have preemptible nodes
What are the two shutdown options for Dataflow? Cancel - stops all processes immediately | Drain - stops ingesting and finishes processing current data
At what levels can you control IAM in Pub/Sub? Project - Topic - Subscription
What IAM roles exist for Pub/Sub? Admin, Editor, Publisher, Subscriber
What types of accounts are a best practice for IAM in Pub/Sub? Service accounts
What are two partitioning methods in BigQuery? Ingestion-time partitioned | Partition by specific timestamp/date column
What is the command to train models on ML Engine? 'submit job train model'
What is the command to deploy trained models on ML Engine? 'submit job deploy trained model'
What are the "Project and Model" IAM roles in ML Engine? Admin - full control | Developer - Create jobs, request predictions | Viewer - Read only
What are the "Model Only" IAM roles in ML Engine? Model Owner - Full access | Model user - Read models and use for prediction |
What are the Machine scale tiers for ML Engine? BASIC - single worker | STANDARD_1 - 1 master, 4 workers, 3 parameter | PREMIUM_1 - 1 master, 19 workers, 11 parameter | BASIC_GPU - 1 worker with GPU | CUSTOM
How is ML Engine priced? Per hour of use
What is a deep neural network? A neural network with at least three hidden layers
What benefits are there to training a model locally before deploying it to ML Engine? Lower cost - Quick iterations
What IAM roles are required to use Datalab? Compute Instance Admin - Service Account Actor
How is data shared in Datalab? Via shared Cloud Source Repository - At the Project level
What langauges does Datalab support? Javascript, Python, and SQL
What three things does a Datalab instance run on? On a Compute Engine instance, dedicated VPC, and Cloud Source repository
What are the supported file types for Dataprep? CSV, JSON, txt, XLS, LOG, TSV, AVRO
What is an AVRO file? Serialized data file for Apache Hadoop projects
What is Cloud TPU? Tensor Processing Unit - custom ML training computer
What service should you use for semi-structured data < 1 TB in size? Datastore
What are the four IAM levels for Dataflow? Admin - Full pipeline access and machine type config | Developer - Full pipeline access, NO machine type config | Viewer - Permissions only | Worker - Only for service accounts
Can Pub/Sub stream to Dataproc? Yes, then Dataproc can move data to the right place
How do you use Dataproc for real-time streaming data and analytics on that data? Pub/Sub --> Dataproc | Dataproc --> Bigtable and Cloud Storage (analytics on both)
Can different environments be in the same project? No, you need to create different projects for environments that must be fully isolated
Can someone with an IAM role for one service have the same role for others? No, IAM roles are assigned by service. - E.g. a Dataflow Developer with no other IAM roles can't see the underlying data; they can just manage the pipelines
What are two benefits of denormalizing data for BigQuery? Increased query performance - Decreased query complexity (you don't have to use JOIN clauses)
Does denormalization change the amount of data in BigQuery? No, but performance is increased since you don't have to query against as many tables
What is the recommended amount of data to store in Bigtable? 1 TB
What is the maximum number of tables BigQuery can query at once? 1000
What data types can be imported to Cloud SQL? CSV and SQL dumps
How is data shared betweek Datalab notebooks? The Cloud Source Repository
What are the two options for creating team Datalab notebooks and is the IAM difference between the two? Team lead creates notebooks for users - Everyone accesses the same shared repository for notebooks - requires all users to be project editors
What are the three types of triggers in Dataflow? Element count - Combinations of triggers - Timestamps
What are the four levels of Cloud Spanner IAM? Admin - Full access | Database Admin - Create, edit databases; grant access to databases | Reader - read/execute database/schema | Viewer - cannot modify or read from database; can only view instances
What are the three levels of Dataproc IAM? Editor - Full access | Viewer - view access only | Worker - service accounts (e.g. read/write to gcs)
What are the six levels of BigQuery IAM? Admin - full access | Data Owner - full dataset access | Data Editor - edit dataset tables | Data Viewer - view datasets and tables | Job User - run jobs | User - run queries and create datasets
What are the two levels of Dataprep IAM? User - Run Dataprep in a project | Service Agent - necessary for cross-project access; Trifecta necessary access
How can you use BigQuery and Dataproc together? Use the BigQuery connector in Dataproc (uses cloud storage for staging)
How do you convert images, video, etc for use with teh APIs? Use Cloud Storage URI - Encode in base64 format
What language does Cloud Composer use? Python
What are the Cloud Composer charts called? Directed Acyclic Graphs
What are the two main components of Cloud IoT Core? Device Manager - Manages devices programmatically or via CLI | Protocol Bridge - load balancing, publishes device telemetry to Pub/Sub
What's the difference between hybrid cloud and multi-cloud setups? Hybrid Cloud has a public cloud and a private datacenter - Multi-Cloud has at least two public clouds in the system, but can also have a private datacenter
What is Google VPC? Google Virtual Private Cloud Network - has a VPN, allows you to control networking, firewalls, ports, etc
What are the three Hybrid connectivity products? Cloud Interconnect - Connects on-premise devices to the GCP | Cloud VPN | Peering - Direct and Carrier; Doesn't require GCP
What are the two types of Cloud Interconnect? Dedicated Connect - Connect directly to GCP VPC | Partner Interconnect - Connect to a service partner which connects to GCP; doesn't require equipment maintenance
What service can you use to EASILY build semi-custom ML models? AutoML: Natural Language - Tables - Vision - Translation - Video Intelligence
What is Dialogflow? A service to build conversastional interfaces: chatbots, assistants, etc. Use Cases: Customer Service, Commerce,
What is the purpose of the Data Loss Prevention API? To find sensitive data
What does the COPPA legislation protect? The rights of data for children under 13
What is Datastore a replacement for? Cassandra
What parameters does an ML algorithm adjust? Weights and biases
How many concurrent interactive queries can you run at once? 50
How many concurrent queries can you run against BigTable? 4
How many concurrent queries can you run that use Legacy SQL and UDFs? 6
How does denormalizing data help queries run faster? It enables the data processing to be done in parallel
What are two use cases for streaming insert data? Non-transactional data - Aggregate analysis
What is Key Visualizer? A tool to help you understand BigTable usage - Helps you find where hotspotting is occurring - Which rows have too much data? - Are access patterns balanced?
How do you create a Side Input in a Dataflow pipeline? Turn a PCollection into a view; call the ParDo with a side input.
Side Inputs are useful for when the ParDo needs additional data for its operations, but the data is pulled at runtime (not hard coded)
What command do you use to continuously sync between on-prem and GCP? rsync
Why use wildcard tables in BigQuery and how do you use them? They allow you to query multiple tables at once
Add an asterisk to the end of the table name
What BigQuery keyword do you use to select from multiple wildcard tables by their suffices? _TABLE_SUFFIX
Put _TABLE_SUFFIX criteria in the WHERE clause of the query
Is streaming or batch loading more cost effective for BigQuery? Batch loading
What are three use cases for BigQuery external tables? ETL Operations on data - Frequently changed data - Data ingested periodically
What is an external data source in BigQuery? A datasource you can query directly even though it's not stored in BigQuery
When is table clustering useful in BigQuery? It speeds up queries; Use for queries with aggregations or WHERE clause filtering
What kinds of tables can you cluster in BigQuery? Ingestion-time and date-time partitioned tables
What's the difference between Legacy and Standard SQL when using a project qualified table name? Use a period between the project and table instead of a colon
What is the recommended authentication account for Cloud Composer? Service Account
What are BigQuery template tables and how are they made? They're tables partitioned by a non-date variable (e.g. user-id). You make them by spefifying a <templateSuffix> in the table insert request
BigQuery streaming inserts: max row size? max rows/second? max rows/request? 1 MB - 100,000 rows - 10,000 (GCP Recommends 500)
What is the F1 value for model builds? Weighted average of precision and recall
What are Sigmoids and Softmaxes? Sigmoid - useful for mapping numbers to probability in log reg | Softmax - like sigmoid, but for multiple inputs
What is regularization? Approach to overfitting (drop layers, add weights, etc..)
What are Estimators in modeling? A Tensorflow high-level representation of a complete model
How does Cloud Spanner distribute data across nodes? Load-based splitting
What do you use to update, insert, or delete data in Cloud Spanner tables? Data-Manipulation Language (DML)
When should you use Dataproc autoscaling? When using external storage solutions (GCS, BigQuery) - Clusters that process many different jobs - Scale up single-job clusters
When should you NOT use Dataproc autoscaling? HDFS - YARN Node labels - Spark Structured Streaming - Idle Clusters
Which external data sources can you query directly from BigQuery? Bigtable - Cloud Storage - Google Drive
Note: Not as fast as querying directly from BigQuery
Can you change a Spanner instance configuration after you make it? No
Does Spanner autoscale the number of nodes when workloads increase? No
What are two approaches to ETL of external data to BigQuery? BigQuery UI - Dataflow
What is Kubernetes Engine for? Deploying containerized applications
What is a logpoint snapshot in Stackdriver Debugger? It's a debug snapshot generated while the program is running. You put a line of code into the program to create the snapshot without stopping the program itself
What is Sqoop used for? Transfer data between relational databases and Hadoop
What is DirectPipelineRunner used for? Running Dataflow pipeline operations locally
What does Stackdriver Error Reporting tell you? It aggregates crashes in your cloud services
What is Stackdriver Trace used for? Finding bottlenecks and latency in your data processing structures
What kind of data sources does BigQuery Data Transfer Service support? SaaS services (from Google and others, e.g. Amazon S3)
How should you handle invalid inputs in a Dataflow pipeline? Create a "deadletter" output with the invalid inputs as a Side Input and re-process the data later
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.