Spark MLlib Tutorial – Linear Regression


The goal of this series is to help you get started with Apache Spark’s MLlib library. Together we will explore how to solve various interesting machine learning use-cases in a well structured way. By the end, you will be able to use MLlib with high confidence and learn to implement an organized and easy to maintain workflow for your future projects

In the first part of the series we will focus on the very basics of MLlib. We will cover the necessary steps to create a regression model to predict housing prices. More complicated MLlib features and functions are to be published in future posts of the series.

Before going further let us start with some definitions.


Apache Spark

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.


Apache Spark MLlib is the machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives.

Why MLlib?

Moving to the Big Data Era requires heavy iterative computations on very big datasets. Standard implementations of machine learning algorithms require very powerful machines to be able to run. Depending on high-end machines is not advantageous due to their high price and improper costs of scaling up. The idea of using distributed computing engines is to distribute the calculations to multiple low-end machines (commodity hardware) instead of a single high-end one. This definitely speeds up the learning phase and allows us to create better models.

Software Requirements

In order to follow up with this tutorial you have to install the following:

  • Python
  • Apache Spark
  • findspark library
  • Numpy
  • Jupyter

Apache Spark

Installing Apache Spark is so simple. You just have to download the package from the official website.

To test your implementation:

  1. uncompress the file
  2. go to bin directory
  3. run the following command
% ./pyspark --version

The output should look like the following:

Testing Apache Spark version

findspark Library

To make reaching Apache Spark easier, we will use findspark. It is a very simple library that automatically sets up the development environment to import Apache Spark library.

To install findspark, run the following in your shell:

% pip install findspark


Numpy is a famous numeric computation library in Python. MLlib uses it internally for its computations.

Install it with the following command:

% pip install numpy


The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

To install Jupyter:

% pip install jupyter

Problem Definition

The first problem in this series is Regression. We are going to train a model to predict the famous Boston Housing dataset (download from here).

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive, and has been used extensively throughout the literature to benchmark algorithms.

The dataset is small in size with only 506 cases. It contains 14 features described as follows:

  1. CRIM: per capita crime rate by town
  2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS: proportion of non-retail business acres per town.
  4. CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX: nitric oxides concentration (parts per 10 million)
  6. RM: average number of rooms per dwelling
  7. AGE: proportion of owner-occupied units built prior to 1940
  8. DIS: weighted distances to five Boston employment centres
  9. RAD: index of accessibility to radial highways
  10. TAX: full-value property-tax rate per $10,000
  11. PTRATIO: pupil-teacher ratio by town
  12. B: 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
  13. LSTAT: % lower status of the population
  14. MEDV: Median value of owner-occupied homes in $1000’s

The goal is to use the 13 features to predict the value of MEDV (which represents the housing price).

It is time to get your hands dirty. Let us jump into Spark and MLlib.


Setting Up Apache Spark

To get your development environment ready lunch Jupyter and create a new notebook.

% jupyter notebook

We start by importing the findspark library and initializing it by passing in the path to Apache Spark folder.

import findspark

Every Spark application requires a SparkSession.

To create a SparkSession we write:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Loading the Data

data ='./boston_housing.csv', header=True, inferSchema=True)
  • header=True signals that the first line contains the header
  • inferSchema=True enables automatic detection of the underlying data schema

To display the data:
Top 20 rows of the data

Setting up the Features

Now for the fun part… MLlib’s algorithms expect the data to be represented in two columns: Features and Labels. Features is an array of data points of all the features to be used for prediction. Labels contain the output label for each data point.

In our example, the features are the columns from 1 → 13, the labels is the MEDV column that contains the price.

The goal is to predict the label from the features.

Creating a features array is straight forward. You just have to import the VectorAssembler class and pass in a list of the feature column names.

feature_columns = data.columns[:-1] # here we omit the final column
from import VectorAssembler
assembler = VectorAssembler(inputCols=feature_columns,outputCol="features")
  • outputCol=”features” defines the name of the output vector that combines all the values

Now we use the assembler to create the features column:

data_2 = assembler.transform(data)

That is it! If you print the values of data_2 you will notice a new column named “features” that contains all the values combined in one list:
Data after VectorAssembler

Train\Test Split

As in any machine learning work flow, we split the data into train and test set. Here we split it to 70% training examples and 30% testing examples.

train, test = data_2.randomSplit([0.7, 0.3])

Training the Machine Learning Algorithm

We move to another interesting part, let us train a simple LinearRegressionmodel on our data. First, we import the necessary class.

from import LinearRegression

Next we define the algorithm variable. We need to specify the name of the features column and the labels column.

algo = LinearRegression(featuresCol="features", labelCol="medv")

Time for training… We call the fit method to start training our model on the train set.

model =

Voila! You have trained your first model using MLlib!

Evaluating Model Performance

Completing the training phase is not enough. We have to calculate how good our model is. Luckily the model object has an evaluate method:

evaluation_summary = model.evaluate(test)

Use the evaluation_summary object to access a vast amount of metrics:

# Output: 3.39
# Output: 5.16
# Output: 0.58

Well, not bad for a simple model.

Predicting Values

To predict outputs for unlabeled data you call model.transform function while passing your DataFrame.

For example, let us predict values from the test set:

predictions = model.transform(test)

predictions is a DataFrame that contains: the original columns, the features column and predictions column generated by the model.[13:]).show() # here I am filtering out some columns just for the figure to fit

Final Thoughts

It was a long article I know, but I hope it was worth your time. We introduced Apache Spark and its amazing MLlib library. We used MLlib on a regression problem to predict housing prices. Next I will cover more of MLlib features with other use-cases. Stay tuned…

Partagez ce post sur vos réseaux

Laisser un commentaire

dix-sept + vingt =

Fermer le menu