Integrate Spark(Scala & PySpark) with Jupiter Notebook

2017-02-28 Published in categories blog Technology tagged with #Jupyter Notebook #Spark #Scala #Toree

Jupyer Notebook is an interactive notebook environment and it supports Spark.

1 . jupyter Notebook installation

2 . Scala, Spark installation

3 . Toree installation

Apache Toree is tool to configure Jupiter notebook to run with Spark.

$sudo pip install -i http://pypi.anaconda,org/hyoon/simple toree

4 . Configure different Apache Toree kernels

for Scala:

$jupyter toree install spark_home=~/data/spark-2.0.1-bin-hadoop2.6/

for PySpark:

$jupyter toree install --spark_home=~/data/spark-2.0.1-bin-hadoop2.6/  --interpreters=PySpark

even SparkR, SQL

$Jupyter toree install --interpreter=SparkR,SQL

5 . and you can Check installation by

$jupyter kernelspec list

6. add 3rd Party Libraries

List all packages you will use.

SPARK_PKGS=$(cat « END | xargs echo | sed ‘s/ /,/g’ neo4j-contrib:neo4j-spark-connector:2.0.0-M2 graphframes:graphframes:0.5.0-spark2.0-s_2.11 END)

Define SPARK_OPTS and SPARK_HOME.

SPARK_OPTS=”–packages=$SPARK_PKGS” SPARK_HOME=~/data/spark-2.0.1-bin-hadoop2.6/

Configure Toree to use these packages.

sudo jupyter toree install
–spark_home=$SPARK_HOME
–spark_opts=$SPARK_OPTS

Adding Remote Packages You can use Apache Toree’s AddDeps magic to add dependencies from Maven central. You must specify the company name, artifact ID, and version. To resolve any transitive dependencies, you must explicitly specify the –transitive flag.

references:

More details

Detail test