GETTING STARTED WITH APACHE SPARK ON GOOGLE CLOUD SERVICES USING DATAPROC

b4failrise ㅣ 2019. 3. 3. 10:29

Introduction

Google Cloud Dataproc is Google’s implementation of the Hadoop ecosystem that includes the Hadoop Distributed File System (HDFS) and Map/Reduce processing framework. In addition the Google Cloud Dataproc system includes a number of applications such as Hive, Mahout, Pig, Spark and Hue that are built on top of Hadoop.

Apache Spark is a processing framework that operates on top of HDFS (as well as other data stores). It features interactive shells that can launch distributed process jobs across a cluster. Spark supports Spark-SQL, a limited implementation of the Structured Query Language.

Prerequisites

Before starting this tutorial, the following tutorials and notes should be reviewed.

Basic familiarity with the Hadoop Ecosystem is required to get the most out of this tutorial. My Introduction to Hadoop notes – Including Spark can help with a review of these concepts.

This tutorial assumes you already have a Google Cloud Services account set up and funded. If you do not have a Google Cloud Services account set up, please follow these instructions first.

Topics Outline

Processing a data set using Spark on Google Cloud Dataproc requires the following main steps:

'Apache Spark' 카테고리의 다른 글

[error]sc.textFile: Input path does not exist (0)	2019.03.05
IntelliJ IDEA Tutorial (0)	2019.02.17
Error in Synchronizing SBT and IntelliJ IDEA projects (0)	2019.02.17

b4failrise@devgraphy