Data Preprocessing | Day 1

针对英语文档阅读使用能力和ML知识点开的一个新坑
不定期更新
尽量使用英语

Data Preprocessing | Day 1

Step 1: Import the required Libraries

These Two are essential libraries which we will import every time.

NumPy: Library which contains Mathematical functions.

Pandas: Library used to import and manage the data sets.

Step 2: Importing the Data Set

Data sets are generally available in .csv format. A CSV file stores tabular data in plain text(纯文本). Each lines of the file is a data record. We use the read_csv method of the pandas library to read a local CSV file as a dataframe. Then we make separate(分离) Matrix and Vector of independent and dependent variables from the dataframe.(然后我们从dataframe中制作自变量和因变量的矩阵和向量)

Step 3: Handling the Missing Data

The data we get is rarely homogeneous(同质的).Data can be missing due to various and needs to be handled so that it does not reduce the performance of our machine learning model. We can replace the missing data by the Mean or median of the entire column. We use imputer class of sklearn.preprocessing for this task.

Step 4: Encoding Categorical Data

Categorical data are variables that contain label values(标签值) rather than numeric values(数值).The number of possible values is often limited to a fixed set. Example values such as “Yes” and “No” cannot be used in mathematical equations(数学方程) of the model so we need to encode these variables into numbers. To achieve this we import LabelEncoder class from Sklearn.preprocessing library.

Step 5: Splitting the dataset into test set and training set

We make two partitions of dataset one for training the model called training set and other for testing the performance of the trained model called test set. The split generally 80/20. We import train_test_split() method of sklearn.crossvalidation library.

Step 6: Feature Scaling(特征归一化)

Most of the machine learning algorithms use the Euclidean distance(欧式距离) between two data points in their computations, features highly varying(变化) in magnitudes(大小), units and range(范围) pose(提出) problems. high magnitudes(幅度) features will weigh more in the distance calculations than features with low magnitudes. Done by Feature standardization or Z-score normalization(正常化). StandardScalar of sklearn.preprocessing is imported.

Code

import numpy as np
import pandas as pd

dataset = pd.read_csv("D:\\APP\\DataSet\\100-Days-Of-ML-Code-master\\datasets\\Data.csv")
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])

onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y =  labelencoder_Y.fit_transform(Y)

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

分析：

代码详解

1 2	import numpy as np import pandas as pd

首先我们使用一个简单的数据集。每一个数据集都会包括两部分，独立变量（independent variable）和依赖变量（dependent variable)。机器学习的目的就是需要通过独立变量来预测非独立变量（prediction）。
独立变量不会被影响而非独立变量可能被独立变量影响。

在以下数据集中Age和Salary就是独立变量，我们需要通过这两个独立变量预测是否会Purchase。所以Purchased就是非独立变量。

把np作为numpy的缩写，后面可以直接使用np来调用各种方法。

==>

numpy系统是python的一种开源的数值计算扩展。
这种工具可用来存储和处理大型矩阵，比python自身的嵌套列表结构要高效的多。
你可以理解为凡是和矩阵有关的都用numpy这个库。

==>

pandas该工具是为了解决数据分析任务而创建的。pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。
pandas提供了大量能使我们快速便捷地处理数据的函数和方法。它是使python成为强大而高效的数据分析环境的重要因素之一.

==>

pandas导入语法：

导入路径斜线问题

1
2
3

file_path1 = 'D:/0Raw_data/ftm_p.csv'
file_path2 = 'D:\\0Raw_data\\ftm_p.csv'
file_path3 = r'D:\0Raw_data\ftm_p.csv'

中文路径问题

当错误类型如下，则一般是中文路径问题。

OSError: Initializing from file failed

不废话，解决方案就是先用open打开，而且一般用open先打开，能直接解决编码问题：

1
2
3

file_path = 'D:/0Raw_data/zhaoyang_charge_sta/京AW7531'
path = open(file_path)
data = pd.read_csv(path)

编码问题

报错：UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xb9 in position 0: invalid start byte

import pandas as pd
file_path = 'D:/0Raw_data/zhaoyang_charge_sta/京AW7531'
f = open(file_path,encoding='utf-8')
data = pd.read_csv(f)
f.close()

解决方案2：

1
2
3

import pandas as pd
file_path = 'D:/0Raw_data/zhaoyang_charge_sta/京AW7531'
data = pd.read_csv('D:/0Raw_data/ftm_p.csv',encoding='gbk')

1 2	# create 独立变量vector X = dataset.iloc[:, :-1].values # 第一个冒号是所有列（row），第二个是所有行（column）除了最后一个(Purchased)

1 2	# create 依赖变量vector Y = dataset.iloc[:, 3 ].values # 只取最后一个column作为依赖变量。

# 处理丢失数据
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])   # (inclusive column 1, exclusive column 3, means col 1 & 2 逗号之前代表 所有行 ：,后面代表 [1,3)列])
X[:, 1:3] = imputer.transform(X[:, 1:3]) # 将imputer 应用到数据

sklearn.preprocessing.Imputer解析:

sklearn.preprocessing.Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0, copy=True)

missing_values：缺失值，可以为整数或NaN(缺失值numpy.nan用字符串‘NaN’表示)，默认为NaN

strategy：替换策略，字符串，默认用均值‘mean’替换

①若为mean时，用特征列的均值替换

②若为median时，用特征列的中位数替换

③若为most_frequent时，用特征列的众数替换

axis：指定轴数，默认axis=0代表列，axis=1代表行

copy：设置为True代表不在原数据集上修改，设置为False时，就地修改，存在如下情况时，即使设置为False时，也不会就地修改

①X不是浮点值数组

②X是稀疏且missing_values=0

③axis=0且X为CRS矩阵

④axis=1且X为CSC矩阵

statistics_属性：axis设置为0时，每个特征的填充值数组，axis=1时，报没有该属性错误

处理之前：

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

处理之后

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

Sklearn数据预处理中fit()和transform()与fit_transform()的区别

fit():Method calculates the parameters μ and σ and saves them as internal objects.

Imputer定义了规则，imputer指定训练范围，进行fit ，这里提到的模型都是非常简单的，无非平均数、方差这种。

transform():Method using these calculated parameters apply the transformation to a particular dataset.

transform，我理解是这样的，fit和transform的区别有点类似训练模型和训练数据，transform类似于训练数据这一块的

fit_transform():joins the fit() and transform() method for transformation of dataset.

将训练模型和训练数据放到一起的一个步骤。

Note

必须先用fit_transform(trainData)，之后再transform(testData)
如果直接transform(testData)，程序会报错
如果fit_transfrom(trainData)后，使用fit_transform(testData)而不transform(testData)，虽然也能归一化，但是两个结果不是在同一个“标准”下的，具有明显差异。(一定要避免这种情况)

==>

什么是独热编码？

独热码，在英文文献中称做 one-hot code, 直观来说就是有多少个状态就有多少比特，而且只有一个比特为1，其他全为0的一种码制。举例如下：

直观来说就是有多少个状态就有多少比特，而且只有一个比特为1，其他全为0的一种码制。举例如下：

假如有三种颜色特征：红、黄、蓝。在利用机器学习的算法时一般需要进行向量化或者数字化。那么你可能想令红=1，黄=2，蓝=3。那么这样其实实现了标签编码，即给不同类别以标签。然而这意味着机器可能会学习到“红<黄<蓝”，但这并不是我们的让机器学习的本意，只是想让机器区分它们，并无大小比较之意。

所以这时标签编码是不够的，需要进一步转换。因为有三种颜色状态，所以就有3个比特。即红色：1 0 0 ，黄色: 0 1 0，蓝色：0 0 1 。

如此一来每两个向量之间的距离都是根号2，在向量空间距离都相等，所以这样不会出现偏序性，基本不会影响基于向量空间度量算法的效果。

OneHotEncoder 和 LabelEncoder 独热编码和标签编码

首先了解机器学习中的特征类别：连续型特征和离散型特征

拿到获取的原始特征，必须对每一特征分别进行归一化，比如，特征A的取值范围是[-1000,1000]，特征B的取值范围是[-1,1]，如果使用logistic回归，w1x1+w2x2，因为x1取值太大了，所以x2基本起不了作用。所以，必须进行特征的归一化，每个特征都单独进行归一化。

对于连续性特征：

Rescale bounded continuous features: All continuous input that are bounded, rescale them to [-1, 1] through x = (2x - max - min)/(max - min). 线性放缩到[-1,1]
Standardize all continuous features: All continuous input should be standardized and by this I mean, for every continuous feature, compute its mean (u) and standard deviation (s) and do x = (x - u)/s. 放缩到均值为0，方差为1

对于离散性特征：

Binarize categorical/discrete features: 对于离散的特征基本就是按照one-hot（独热）编码，该离散特征有多少取值，就用多少维来表示该特征。

1、方差是各个数据分别与其平均数之差的平方的和的平均数，用字母D表示。在概率论和数理统计中，方差（Variance）用来度量随机变量和其数学期望（即均值）之间的偏离程度。在许多实际问题中，研究随机变量和均值之间的偏离程度有着重要意义。

2、平方差公式（difference of two squares）是数学公式的一种，它属于乘法公式、因式分解及恒等式，被普遍使用。平方差指一个平方数或正方形，减去另一个平方数或正方形得来的乘法公式：a²-b²=(a+b)(a-b)

3、标准差（Standard Deviation），中文环境中又常称均方差，但不同于均方误差（mean squared error，均方误差是各数据偏离真实值的距离平方的平均数，也即误差平方和的平均数，计算公式形式上接近方差，它的开方叫均方根误差，均方根误差才和标准差形式上接近），标准差是离均差平方和平均后的方根，用σ表示。假设有一组数值X1,X2,X3,……XN（皆为实数），其平均值（算术平均值）为μ，公式如图。

概率论还是要慢慢补。。。

Reference：

https://blog.csdn.net/appleyuchi/article/details/73503282

scikit-learn user guide