笔记本内容

标准化，或均值去除和方差缩放 #

数据集的标准化是许多机器学习任务的共同要求，如果单一特征看起来不像标准正态分布（均值为零且单位方差为高斯），模型可能会出现表现不佳的情况。

在实践中，我们经常忽略数据的分布，只是通过除以每个特征的平均值来转换数据以使其居中，然后通过将极值特征除以其标准差来对其进行缩放。

机器学习许多算法的目标函数（例如支持向量机的 RBF 核或线性模型的 l1 和 l2 正则化器）都假设所有特征都以零为中心并且具有相同顺序的方差。如果一个特征的方差比其他特征大几个数量级，它可能会主导目标函数并使估计器无法按预期正确地从其他特征中学习。

from sklearn import preprocessing
import numpy as np

变换前 #

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)

print('Mean:', X_train.mean(axis=0))
print('Std:', X_train.std(axis=0))

Mean: [1.         0.         0.33333333]
Std: [0.81649658 0.81649658 1.24721913]

变换后 #

scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

print('Mean:', X_scaled.mean(axis=0))
print('Std:', X_scaled.std(axis=0))

Mean: [0. 0. 0.]
Std: [1. 1. 1.]

例子 #

在 sklearn 里可以通过 make_pipeline 来使得对训练集的变换同样能够应用到测试集上。

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# 创建实验数据集
X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# 使用 make_pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X_train, y_train)  # apply scaling on training data

# 应用到测试集上
pipe.score(X_test, y_test)  # apply scaling on testing data, without leaking training data.

0.96

笔记本

数据标准化小例子

标准化，或均值去除和方差缩放 #

变换前 #

变换后 #

例子 #

评论（0条）