The goal of this series is to help you get started with Apache Spark’s MLlib library. Together we will explore how to solve various interesting machine learning use-cases in a well structured way. By the end, you will be able to use MLlib with high confidence and learn to implement an organized and easy to maintain workflow for your future projects
In the first part of the series we will focus on the very basics of MLlib. We will cover the necessary steps to create a regression model to predict housing prices. More complicated MLlib features and functions are to be published in future posts of the series.
Before going further let us start with some definitions.
Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Apache Spark MLlib is the machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives.
Moving to the Big Data Era requires heavy iterative computations on very big datasets. Standard implementations of machine learning algorithms require very powerful machines to be able to run. Depending on high-end machines is not advantageous due to their high price and improper costs of scaling up. The idea of using distributed computing engines is to distribute the calculations to multiple low-end machines (commodity hardware) instead of a single high-end one. This definitely speeds up the learning phase and allows us to create better models.
In order to follow up with this tutorial you have to install the following:
- Apache Spark
- findspark library
Installing Apache Spark is so simple. You just have to download the package from the official website.
To test your implementation:
- uncompress the file
- go to bin directory
- run the following command
% ./pyspark --version
The output should look like the following:
To make reaching Apache Spark easier, we will use findspark. It is a very simple library that automatically sets up the development environment to import Apache Spark library.
To install findspark, run the following in your shell:
% pip install findspark
Numpy is a famous numeric computation library in Python. MLlib uses it internally for its computations.
Install it with the following command:
% pip install numpy
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
To install Jupyter:
% pip install jupyter
This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive, and has been used extensively throughout the literature to benchmark algorithms.
The dataset is small in size with only 506 cases. It contains 14 features described as follows:
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town.
- CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes in $1000’s
The goal is to use the 13 features to predict the value of MEDV (which represents the housing price).
It is time to get your hands dirty. Let us jump into Spark and MLlib.
Setting Up Apache Spark
To get your development environment ready lunch Jupyter and create a new notebook.
% jupyter notebook
We start by importing the findspark library and initializing it by passing in the path to Apache Spark folder.
import findspark findspark.init('/opt/spark')
Every Spark application requires a SparkSession.
To create a SparkSession we write:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate()
Loading the Data
data = spark.read.csv('./boston_housing.csv', header=True, inferSchema=True)
- header=True signals that the first line contains the header
- inferSchema=True enables automatic detection of the underlying data schema
To display the data:
Setting up the Features
Now for the fun part… MLlib’s algorithms expect the data to be represented in two columns: Features and Labels. Features is an array of data points of all the features to be used for prediction. Labels contain the output label for each data point.
In our example, the features are the columns from 1 → 13, the labels is the MEDV column that contains the price.
The goal is to predict the label from the features.
Creating a features array is straight forward. You just have to import the VectorAssembler class and pass in a list of the feature column names.
feature_columns = data.columns[:-1] # here we omit the final column
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=feature_columns,outputCol="features")
- outputCol=”features” defines the name of the output vector that combines all the values
Now we use the assembler to create the features column:
data_2 = assembler.transform(data)
That is it! If you print the values of data_2 you will notice a new column named “features” that contains all the values combined in one list:
As in any machine learning work flow, we split the data into train and test set. Here we split it to 70% training examples and 30% testing examples.
train, test = data_2.randomSplit([0.7, 0.3])
Training the Machine Learning Algorithm
We move to another interesting part, let us train a simple LinearRegressionmodel on our data. First, we import the necessary class.
from pyspark.ml.regression import LinearRegression
Next we define the algorithm variable. We need to specify the name of the features column and the labels column.
algo = LinearRegression(featuresCol="features", labelCol="medv")
Time for training… We call the fit method to start training our model on the train set.
model = algo.fit(train)
Voila! You have trained your first model using MLlib!
Evaluating Model Performance
Completing the training phase is not enough. We have to calculate how good our model is. Luckily the model object has an evaluate method:
evaluation_summary = model.evaluate(test)
Use the evaluation_summary object to access a vast amount of metrics:
evaluation_summary.meanAbsoluteError # Output: 3.39 evaluation_summary.rootMeanSquaredError # Output: 5.16 evaluation_summary.r2 # Output: 0.58
Well, not bad for a simple model.
To predict outputs for unlabeled data you call model.transform function while passing your DataFrame.
For example, let us predict values from the test set:
predictions = model.transform(test)
predictions is a DataFrame that contains: the original columns, the features column and predictions column generated by the model.
predictions.select(predictions.columns[13:]).show() # here I am filtering out some columns just for the figure to fit
It was a long article I know, but I hope it was worth your time. We introduced Apache Spark and its amazing MLlib library. We used MLlib on a regression problem to predict housing prices. Next I will cover more of MLlib features with other use-cases. Stay tuned…