Algorithmic Chains and Pipelines: Building a Pipeline

For many machine learning algorithms, the specific data representation you provide is very important.

  • The data is first scaled, then features are merged manually, and features are learned using unsupervised machine learning.
  • Therefore, most machine learning applications require not only applying a single algorithm, but also chaining together many different processing steps and machine learning models.

Give an example to illustrate the importance of model chains.
We know that kernel SVM performance on cancer dataset can be greatly improved by preprocessing with MinMaxScaler.
The following code implements dividing the data, computing the min and max values, scaling the data, and training the SVM:

  from sklearn.datasets import load_breast_cancer
  from sklearn.svm import SVC
  from sklearn.preprocessing import MinMaxScaler
  from sklearn.model_selection import train_test_split

  #Load and partition data
  cancer = load_breast_cancer()
  X_train,X_test,y_train,y_test = train_test_split(cancer.data,cancer.target,random_state=0)


  #Data scaling
  scaler = MinMaxScaler()
  scaler.fit(X_train)
  X_train_scaled = scaler.transform(X_train)

  #Learning SVM on scaled data

  svc = SVC().fit(X_train_scaled,y_train)

  X_test_scaled = scaler.transform(X_test)

  print("Test score:{}".format(svc.score(X_test_scaled,y_test)))

  '''
  `Test score:0.972027972027972`
  '''

1. Parameter selection with preprocessing

Now, suppose we want to use GridSearchCV to find better SVC parameters. What should we do? A simple way might look like this:

  from sklearn.model_selection import GridSearchCV


  #grid parameters
  param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
                'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}


  #Create an instance of GridSearchCV with a 50% discount

  grid = GridSearchCV(SVC(),param_grid,cv=5)
  #fit
  grid.fit(X_train_scaled,y_train)

  #print optimal parameters

  print("Best parammetes:{}".format(grid.best_params_))
  print("Best cross-validation accuracy:{:.3f}".format(grid.best_score_))
  print("Test score:{:.3f}".format(grid.score(X_test_scaled,y_test)))


  '''
  ```
  Best parammetes:{'C': 1, 'gamma': 1}
  Best cross-validation accuracy:0.981
  Test score:0.972
  ```
  '''

📣
Here we use the scaled data to perform a grid search on the SVC parameters. However, there is a subtle pitfall in the code above.

When scaling the data, we used all the data in the training set to find the way to train. Then, we use the scaled training data to run a grid search with cross-validation.

For each split in cross-validation, a part of the original training set is divided into the training part and the other part is divided into the test part. The test part is used to measure the performance of the model trained on the training part on new data. However, we have used the information contained in the test section when scaling the data.

⭐Cross Validation The test part of each split is part of the training set and we use information from the entire training set to find the correct scaling of the data.
To the model, this data looks very different from the new data.
If we look at new data (such as data in the test set), then this data is not used to scale the training data, and its maximum and minimum values ​​may also be different from the training data.
The following example shows the difference in data processing between cross-validation and final evaluation:

  mglearn.plots.plot_improper_processing()

📣
Therefore, for the modeling process, the partitions in cross-validation cannot correctly reflect the characteristics of the new data.

  • We have leak ed information about this part of the data to the modeling process.
  • This will lead to overly optimistic results during cross-validation and may lead to suboptimal parameter selection.

To solve this problem, during cross-validation, the partitioning of the dataset should be done before any preprocessing.

Any processing that extracts information from the dataset should only be applied to the training portion of the dataset, so any cross-validation should be in the "outermost loop" of the processing.

In scikit-learn, to do this with the cross_val_score function and the GridSearchCV function, you can use the Pipeline class.

  • The Pipeline class can combine (glue) multiple processing steps into a single scikit-learn estimator.
  • The Pipeline class itself has fit, predict and score methods that behave like other models in scikit-learn.
  • The most common use case for the Pipeline class is to link preprocessing steps (such as data scaling) with a supervised model (such as a classifier).

2. Build the pipeline

Use the Pipeline class to represent the workflow of training an SVM after scaling the data with MinMaxScaler (without grid search for now).

  #Build a pipeline object consisting of a list of steps

  from sklearn.pipeline import Pipeline

  #Each step is a tuple containing a name (any string of your choice) and an instance of the estimator
  pipe = Pipeline([('scaler',MinMaxScaler()),('svm',SVC())])

  '''
  Here we create two steps: the first one is called "scaler",Yes MinMaxScaler instance of ; the second is called "svm",Yes SVC instance.
  Now we can be like any other scikit-learn Fit this pipeline like an estimator:
  '''
  #First call fit on the first step (scaler), then use the scaler to transform the training data, and finally fit the SVM with the scaled data.
  pipe.fit(X_train,y_train)

  #To evaluate on test data, we simply call pipe.score
  print("Test score:{:.2f}".format(pipe.score(X_test,y_test)))

  '''
  ```

  Test score:0.97
  ```
  '''

3. Using Pipes in Grid Search

Using pipelines in grid search works the same as using any other estimator. We define a grid of parameters to search and build a GridSearchCV with the pipeline and the grid of parameters.

However, there is a slight change in specifying the parametric grid.

  • We need to specify for each parameter which step in the pipeline it belongs to.
  • The two parameters we want to adjust, C and gamma, are both SVC parameters and belong to the second step.
    • We give this step the name "svm":
    • The syntax for defining a grid of parameters for a pipeline is to specify the step name for each parameter followed by __ (double underscore), then the parameter name.
    • Therefore, to search for the C parameter of an SVC, you must use "svm__C" as the key to the parameter grid dictionary, and the same is true for the gamma parameter:

  param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
                'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

  #With this parametric grid, we can use GridSearchCV as usual:
  #Note that the model here uses pipe
  grid = GridSearchCV(pipe,param_grid,cv=5)
  grid.fit(X_train,y_train)

  #Print
  print("Best parammetes:{}".format(grid.best_params_))
  print("Best cross-validation accuracy:{:.3f}".format(grid.best_score_))
  print("Test score:{:.3f}".format(grid.score(X_test,y_test)))

  '''
  ```
  Best parammetes:{'svm__C': 1, 'svm__gamma': 1}
  Best cross-validation accuracy:0.981
  Test score:0.972
  ```
  '''

  #Unlike the grid search done earlier, now for each split of the cross-validation, only the training part is used to fit the MinMaxScaler, the information from the test part is not leaked into the parameter search.

  mglearn.plots.plot_proper_processing()

In cross-validation, the magnitude of the impact of information leakage depends on the nature of the preprocessing steps. Using the test part to estimate the extent of the data usually has no dire effect, but using the test part in feature extraction and feature selection can lead to significant differences in results.

4. References

"python Machine Learning Basic Tutorial"

Posted by Decipher on Thu, 02 Jun 2022 18:52:33 +0530