数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下:
今天使用泰坦尼克数据集,完整介绍以上 7 步的具体操作过程。
这不废话吗,第一步就是读入数据。
data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv')
data_raw
结果:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
891 rows × 12 columns
data_raw.info()
data_raw.describe(include='all')
结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Hakkarainen, Mr. Pekka Pietari male NaN NaN NaN 1601 NaN G6 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN N
data1 = data_raw.copy(deep=True)
data1.isnull().sum()
结果:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Age 列 177 个空值,Cabin 687 个空值,一共才 891 行,估计没啥价值了!Embarked 2 个。
data1['Age'].fillna(data1['Age'].median(), inplace = True)
data1['Embarked'].fillna(data1['Embarked'].mode()[0], inplace = True)
data1.isnull().sum()
补全操作check:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)
增加一列FamilySize
data1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1
data1
打印结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1
... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1
891 rows × 10 columns
再创建一列:
data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)
再创建一列:
data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
data1
结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr
... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr
891 rows × 12 columns
data1['FareCut'] = pd.qcut(data1['Fare'], 4)
data1['AgeCut'] = pd.cut(data1['Age'].astype(int), 6)
data1
结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title FareCut AgeCut
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr (-0.001, 7.91] (13.333, 26.667]
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs (31.0, 512.329] (26.667, 40.0]
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss (7.91, 14.454] (13.333, 26.667]
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs (31.0, 512.329] (26.667, 40.0]
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr (7.91, 14.454] (26.667, 40.0]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev (7.91, 14.454] (26.667, 40.0]
887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss (14.454, 31.0] (13.333, 26.667]
888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss (14.454, 31.0] (26.667, 40.0]
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr (14.454, 31.0] (13.333, 26.667]
890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr (-0.001, 7.91] (26.667, 40.0]
891 rows × 14 columns
使用 Sklearn 的 LabelEncoder
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data1['Sex_Code'] = label.fit_transform(data1['Sex'])
data1['Embarked_Code'] = label.fit_transform(data1['Embarked'])
data1['Title_Code'] = label.fit_transform(data1['Title'])
data1['AgeBin_Code'] = label.fit_transform(data1['AgeCut'])
data1['FareBin_Code'] = label.fit_transform(data1['FareCut'])
data1
结果 data1 选取某些列,算法模型终于能认出它们了,多不容易!
get_dummies 将长 DataFrame 变为宽 DataFrame:
pd.get_dummies(data1['Sex'])
结果:
female male
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1
... ... ...
886 0 1
887 1 0
888 1 0
889 0 1
890 0 1
891 rows × 2 columns
而 LabelEncoder 编码后,仅仅是把 Female 编码为 0, male 编码为 1.
label.fit_transform(data1['Sex'])
0 1
1 0
2 0
3 0
4 1
..
886 1
887 0
888 0
889 1
890 1
Name: Sex_Code, Length: 891, dtype: int64
# Step 7: data cleaning check
data1[data1_x_alg].info()
print('-'*50)
data1_dummy.info()
结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Sex_Code 891 non-null int64
Pclass 891 non-null int64
Embarked_Code 891 non-null int64
Title_Code 891 non-null int64
SibSp 891 non-null int64
Parch 891 non-null int64
Age 891 non-null float64
Fare 891 non-null float64
dtypes: float64(2), int64(6)
memory usage: 55.8 KB
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 29 columns):
Pclass 891 non-null int64
SibSp 891 non-null int64
Parch 891 non-null int64
Age 891 non-null float64
Fare 891 non-null float64
FamilySize 891 non-null int64
IsAlone 891 non-null int64
Sex_female 891 non-null uint8
Sex_male 891 non-null uint8
Embarked_C 891 non-null uint8
Embarked_Q 891 non-null uint8
Embarked_S 891 non-null uint8
Title_Capt 891 non-null uint8
Title_Col 891 non-null uint8
Title_Don 891 non-null uint8
Title_Dr 891 non-null uint8
Title_Jonkheer 891 non-null uint8
Title_Lady 891 non-null uint8
Title_Major 891 non-null uint8
Title_Master 891 non-null uint8
Title_Miss 891 non-null uint8
Title_Mlle 891 non-null uint8
Title_Mme 891 non-null uint8
Title_Mr 891 non-null uint8
Title_Mrs 891 non-null uint8
Title_Ms 891 non-null uint8
Title_Rev 891 non-null uint8
Title_Sir 891 non-null uint8
Title_the Countess 891 non-null uint8
dtypes: float64(2), int64(5), uint8(22)
memory usage: 68.0 KB