谷歌云數據工程師考試 - Data Proc 復習筆記

Dataproc Summary

How to load data?

a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.

Dataproc connects to BigQuery

Option 1:
Screen Shot 2018-07-15 at 12.34.04 am.png


BigQuery does not natively know how to work with a Hadoop file system.

Cloud storage can act as an intermediary between BigQuery and data proc.

You would export the data from BigQuery into cloud storage as sharded data.

Then the worker notes in data proc would read the sharded data.

Symmetrically, if the data proc job is producing output it can be stored in a format in cloud storage that can be input to BigQuery.

Appropriate for periodic or infrequent transfers

Option 2:

Another option is to setup a BigQuery connector on the Dataproc cluster. The connector is a Java library that enables read write access from Spark and Hadoop directly into BigQuery.

Need to save BigQuery result as table first.

![Screen Shot 2018-07-15 at 12.48.01 am.png](https://upload-images.jianshu.io/upload_images/9976001-6fcaa78c38c1d404.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) ![Screen Shot 2018-07-15 at 12.50.02 am.png](https://upload-images.jianshu.io/upload_images/9976001-9a1b2c9c68b70469.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

Screen Shot 2018-07-15 at 12.44.25 am.png

Screen Shot 2018-07-15 at 12.44.35 am.png

Screen Shot 2018-07-15 at 12.48.01 am.png

Screen Shot 2018-07-15 at 12.50.02 am.png

Screen Shot 2018-07-15 at 12.50.20 am.png
Option 3:

When you want to process data in memory for speed - Pandas Dataframe

In memory, fast but limited in size

Creating a Dataproc cluster

Ways:
Deployment manager template, which is an infrastructure automation service in Google Cloud.
CLI commands
Google cloud console

Keys:

0 Create a cluster specifically for one job

1 Match your data location to the compute location
-> better performance
-> also able to shut down cluster when not processing jobs

2 use Cloud Storage instead of HDFS, shutdown the cluster when it’s not actually processing data
-> It reduces the complexity of disk provisioning and enables you to shut down your cluster when it's not processing a job.

3 Use custom machine types to closely manage the resources that the job requires

4 On non-critical jobs requiring huge clusters, use preemptible VMs to hasten results and cut costs at the same time

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,436評論 0 10
  • 果粉爸媽的心情都掛在孩子腦門上 理發并非關系國計民生的大事,但不可否認其重要性,所以說“雖為毫末技術,卻是頂上功夫...
    酉_酉閱讀 929評論 23 30
  • 一縷陽光就能照亮整個抑郁的心房,一絲溫暖就能化成一股熱流,涌遍身心。別去想未隧的愿望,別去想沒到達的遠方。把心清空...
    我是蘭姐閱讀 276評論 1 5
  • W11901 《十二味生活設計》 這本書的動機起源于作者的——如果有機會真想和這個人見見面,看看他的工作室,和他談...
    4plus閱讀 495評論 1 0
  • 個人翻譯,如有不妥之處,敬請指正,共同學習,共同進步! 原文地址:Importing Assets 資源導入 在U...
    _lijinglong閱讀 6,268評論 0 2