Modeling

Introduction

The second swimlane in Fig. 1 illustrates the Modeling work stream. In this work stream, the data scientst explores the data sets prepared in the Pipelining work stream, experiments various ML algorithms and models with the data sets, optimizes the hyperparameters to get the desired results, and finally codes the selected and turned ML model.

Activity Description Inputs Outputs
Explore Data The available data sets are explored and the suitable ones are prepared (in the Pipelining work stream for modeling the ML problem. Raw or cleaned data Suitable data set
Experiment Models Various ML algorithms and models (e.g., neural network architectures) are experimented using the data to select an effective model for the ML problem Suitable data set Selected ML model
Optimize Hyperparameters Hyperparameters are turned to test the selected model so that ML training can be efficiently performed. Selected ML model Finalized model & hyperparameters
Code Model The finalized ML model is programmed for training and inference. Application development may be involved. Finalized model & hyperparameters ML model coded for training & inference

Explore Data

In this activity, the data scientist explores the data sets prepared in the Pipelining work stream and selects the suitable data sets for training and inference. Usually, the data scientist uses interactive data science tools (e.g., R Studio, Jupyter Notebook, and Apache Zeppelin) to explore the data sets. If the suitable data sets are unavailable or incomplete or malformed or unclean, further data acquisition, cleansing, or preprocessing work in the Pipelining work stream will be required. The data scientist can also suggest how the suitable data sets should be prepared in the Pipelining work stream.

Experiment Models

In this activity, the data scientist designs the ML model using the selected data sets. For example, he or she experiments various ML models based on different algorithms (e.g., neural network architectures), trains the models with the training data set, validates the models with the testing data set, and then selects the suitable model based on the inference accuracy measures, such as precision and recall.

Optimize Hyperparameters

In this activity, the data scientist tunes the hyperparameters so that the model can be further optimized, for example, in terms of training speed and inference accurancy. Different ML algorithms may involve different sets of hyperparameters (e.g., learning rate, model size, number of passes, regularization). The data scientist may need to go back and forth between the Experiment Models activity and the Optimize Hyperparameters activity in order to optimize ML model.

Code Model

After the ML model with the hyperparameters are defined, the ML model is coded using ML libraries, e.g., TensorFlow, CNTK, and PyTorch. The coded model is to be incorporated in the inference application.