Python: The difference between fit(),transform() and fit_transform() in sklearn data preprocessing

1 Overview

Note that this is the method in data preprocessing:

  • Fit(): Method calculates the parameters μ and σ and saves them as internal objects.

Explanation: Simply put, it is to find the mean, variance, maximum, and minimum properties of the training set X of the training set X. Can be understood as a training process

  • Transform(): Method using these calculated parameters apply the transformation to a particular dataset.

Explanation: On the basis of Fit, standardization, dimensionality reduction, normalization and other operations are performed (see which tool is used, such as PCA, StandardScaler, etc.).

  • Fit_transform(): joins the fit() and transform() method for transformation of dataset.

Explanation: fit_transform is a combination of fit and transform, including both training and transformation.

The functions of both transform() and fit_transform() are to perform some kind of unified processing on the data (such as normalize ~N(0,1), scale (map) the data to a fixed interval, normalize, regularize, etc. )

fit_transform(trainData) first fits part of the data, finds the overall indicators of the part, such as mean, variance, maximum and minimum values, etc. (according to the purpose of the specific transformation), and then transforms the trainData to realize the transformation of the data. Standardize, normalize, etc.

According to the overall indicators of fit on the previous part of the trainData, the remaining data (testData) is transformed (testData) using the same indicators such as mean, variance, maximum and minimum values, so as to ensure that train and test are processed in the same way. So, it is generally used like this:

  1. from sklearn.preprocessing import StandardScaler
  2. sc = StandardScaler()
  3. sc.fit_tranform(X_train)
  4. sc.tranform(X_test)


  • You must use fit_transform(trainData) first, then transform(testData)
  • If you directly transform(testData), the program will report an error
  • If after fit_transfrom(trainData), use fit_transform(testData) instead of transform(testData), although it can be normalized, the two results are not under the same "standard" and have obvious differences. (must avoid this)


2 Examples

Preprocessed with PCA, for example:

  1. import pandas as pd
  2. import numpy as np
  3. from sklearn.decomposition import PCA
  4. #==========================================================================================
  5. X1=pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','b','c'],
  6. columns=['one','two','three'])
  7. pca=PCA(n_components=1)
  8. newData1=pca.fit_transform(X1)
  10. newData12=pca.transform(X1)
  11. """
  12. newData1 and newData2 The results are consistent
  13. """
  14. #==========================================================================================
  15. a=[[1,2,3],[5,6,7],[4,5,8]]
  16. X2=pd.DataFrame(np.array(a),index=['a','b','c'],
  17. columns=['one','two','three'])
  18. pca_new=PCA(n_components=1)
  19. pca_new.transform(X2)
  20. """
  21. No fit,direct transform Error:
  22. NotFittedError: This PCA instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
  23. """

3 References

  1. what is the difference between 'transform' and 'fit_transform' in sklearn
  2. How to use fit, fit_transform, transform in sklearn.decomposition.PCA method
  3. The difference between scikit-learn data preprocessing fit_transform() and transform() (transfer) - CSDN Blog


Tags: Python Machine Learning

Posted by tomboy78 on Tue, 31 May 2022 06:00:23 +0530