Program your favorite data science pipeline in Spark.

Keywords

Authors

Abstract

Designed for the master's degree program in ``Computational and Data Science,'' the Faculty of Mathematics and Computer Science at Friedrich Schiller University Jena, Germany, offers a course that introduces students to distributed processing on massive datasets. Within that course, there is a three-week programming project where students learn to design, construct, and improve data analysis and machine learning pipelines using Hadoop, MapReduce, and Spark on the university’s central compute cluster. This short note sketches the main idea of the programming project, gives an example of a project instance, and reports on classroom experiences.