数据集说明:数据集来自威斯康星州医院的699条乳腺肿瘤数据,每条数据包含以下内容:

  1. Sample code number\ \ \ \ \ \ \ \ \ \ \ \ \ \id number
  2. Clump Thickness\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
  3. Uniformity of Cell Size\ \ \ \ \ \ \ \ \ \ \ 1 - 10
  4. Uniformity of Cell Shape\ \ \ \ \ \ \ \ \ \ 1 - 10
  5. Marginal Adhesion\ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
  6. Single Epithelial Cell Size\ \ \ \ \ \ \ \ \ 1 - 10
  7. Bare Nuclei\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
  8. Bland Chromatin\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
  9. Normal Nucleoli\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
    1. Mitoses \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
    2. Class: \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2 for benign, 4 for malignant)

程序说明:采用KNN算法、支持向量机SVM,由Python语言实现良性恶性肿瘤预测。</br>
算法理论请参照:KNN算法支持向量机SVM</br>
Ipynb演示文件:Ipynb文件</br>
Python代码:Python代码</br>

1
2
3
4
5
6
7
8
9
10
11
12
'''获取并预处理原始数据集'''
import pandas as pd
df = pd.read_csv('breast-cancer-wisconsin.data.txt')

#将所有列中为空的或未知的数据用-99999替代。
df.replace('?', -99999,inplace=True)
df.fillna(-99999, inplace=True)

# 去除id列,其与肿瘤是否为良性还是恶性无关,加入会严重影响分类的结果。
df.drop(['id'], 1, inplace=True)

df.head()
1
2
3
4
5
6
7
'''将数据集划分为训练集及测试集合'''
import numpy as np
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
'''选择算法、训练算法并测试算法'''
from sklearn import svm, neighbors
cls_dict={
'SVM-SVC':svm.SVC(),
'KNN':neighbors.KNeighborsClassifier()
}

# 训练并测试算法:若算法需要调优,可手动删除model序列化文件。
for name, cls in cls_dict.items():
try:
import pickle
with open('%s.pickle' % name, 'rb') as f:
cls = pickle.load(f)
except Exception, e:
# 训练算法
cls.fit(X_train, y_train)
print e

# 序列化算法
with open('%s.pickle' % name, 'wb') as f:
pickle.dump(cls, f)

# 测试算法
print "%s Algorithm Accuracy: %s" % (name, cls.score(X_test, y_test))

# 预测
samples = np.array([[4, 2, 1, 1, 1, 2, 3, 2, 1], [4, 2, 1, 2, 2, 2, 3, 2, 1]])
samples = samples.reshape(len(samples), -1)
prediction = cls.predict(samples)
print "%s Algorithm prediction: %s\n" % (name, prediction)