Introduction

Google Cloud Dataproc is Google’s implementation of the Hadoop ecosystem that includes the Hadoop Distributed File System (HDFS) and Map/Reduce processing framework. In addition the Google Cloud Dataproc system includes a number of applications such as Hive, Mahout, Pig, Spark and Hue that are built on top of Hadoop.

Apache Spark is a processing framework that operates on top of HDFS (as well as other data stores). It features interactive shells that can launch distributed process jobs across a cluster. Spark supports Spark-SQL, a limited implementation of the Structured Query Language.

Prerequisites

Before starting this tutorial, the following tutorials and notes should be reviewed.

Basic familiarity with the Hadoop Ecosystem is required to get the most out of this tutorial. My Introduction to Hadoop notes – Including Spark can help with a review of these concepts.

This tutorial assumes you already have a Google Cloud Services account set up and funded. If you do not have a Google Cloud Services account set up, please follow these instructions first.

Topics Outline

Processing a data set using Spark on Google Cloud Dataproc requires the following main steps:

  1. Enable the Google Cloud Compute Engine API
  2. Create a Google Cloud Storage bucket to hold data inputs, data outputs and logs
  3. Create, Configure and Launch an Google Cloud Dataproc cluster
  4. Log in to the Hadoop cluster master node
  5. Run the Command Line Interface and issue Spark-SQL commands to create tables and run queries
  6. Shut down cluster and remove any temporary resources


'Apache Spark' 카테고리의 다른 글

[error]sc.textFile: Input path does not exist  (0) 2019.03.05
IntelliJ IDEA Tutorial  (0) 2019.02.17
Error in Synchronizing SBT and IntelliJ IDEA projects  (0) 2019.02.17