Exploring data science workflows: A practice-oriented approach to teaching processing of massive datasets.

Keywords

Abstract

Massive datasets are typically processed by a sequence of different stages, comprising data acquisition and preparation, data processing, data analysis, result validation, and visualization. In conjunction, these stages form a data science workflow, a key element enabling the solution of data-intensive problems. The complexity and heterogeneity of these stages require a diverse set of techniques and skills. This article discusses a hands-on practice-oriented approach aiming to enable and motivate graduate students to engage with realistic data science workflows. A major goal of the approach is to bridge the gap between academia and industry by integrating programming assignments that implement different data workflows with real-world data. In consecutive assignments, students are exposed to the methodology of solving problems using big data frameworks and are required to implement different data workflows of varying complexity. This practice-oriented approach is well received by students, as confirmed by different surveys.