Jupyer Notebook is an interactive notebook environment and it supports Spark.
1 . jupyter Notebook installation
2 . Scala, Spark installation
3 . Toree installation
Apache Toree is tool to configure Jupiter notebook to run with Spark.
$sudo pip install -i http://pypi.anaconda,org/hyoon/simple toree
4 . Configure different Apache Toree kernels
for Scala:
$jupyter toree install spark_home=~/data/spark-2.0.1-bin-hadoop2.6/
for PySpark:
$jupyter toree install --spark_home=~/data/spark-2.0.1-bin-hadoop2.6/ --interpreters=PySpark
even SparkR, SQL
$Jupyter toree install --interpreter=SparkR,SQL
5 . and you can Check installation by
$jupyter kernelspec list
6. add 3rd Party Libraries
- List all packages you will use.
SPARK_PKGS=$(cat « END | xargs echo | sed ‘s/ /,/g’ neo4j-contrib:neo4j-spark-connector:2.0.0-M2 graphframes:graphframes:0.5.0-spark2.0-s_2.11 END)
- Define SPARK_OPTS and SPARK_HOME.
SPARK_OPTS=”–packages=$SPARK_PKGS” SPARK_HOME=~/data/spark-2.0.1-bin-hadoop2.6/
- Configure Toree to use these packages.
sudo jupyter toree install
–spark_home=$SPARK_HOME
–spark_opts=$SPARK_OPTS
Adding Remote Packages You can use Apache Toree’s AddDeps magic to add dependencies from Maven central. You must specify the company name, artifact ID, and version. To resolve any transitive dependencies, you must explicitly specify the –transitive flag.
references:
- https://toree.incubator.apache.org/documentation/user/installation.html
- http://stackoverflow.com/questions/39149541/integrate-pyspark-with-jupyter-notebook