良性恶性乳腺肿瘤预测

数据集说明：数据集来自威斯康星州医院的699条乳腺肿瘤数据，每条数据包含以下内容：

Sample code number\ \ \ \ \ \ \ \ \ \ \ \ \ \id number
Clump Thickness\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
Uniformity of Cell Size\ \ \ \ \ \ \ \ \ \ \ 1 - 10
Uniformity of Cell Shape\ \ \ \ \ \ \ \ \ \ 1 - 10
Marginal Adhesion\ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
Single Epithelial Cell Size\ \ \ \ \ \ \ \ \ 1 - 10
Bare Nuclei\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
Bland Chromatin\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
Normal Nucleoli\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
1. Mitoses \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
2. Class: \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2 for benign, 4 for malignant)

程序说明：采用KNN算法、支持向量机SVM，由Python语言实现良性恶性肿瘤预测。</br>
算法理论请参照：KNN算法、支持向量机SVM</br>
Ipynb演示文件：Ipynb文件</br>
Python代码：Python代码</br>

'''获取并预处理原始数据集'''
import pandas as pd
df = pd.read_csv('breast-cancer-wisconsin.data.txt')

#将所有列中为空的或未知的数据用-99999替代。
df.replace('?', -99999,inplace=True)
df.fillna(-99999, inplace=True)

# 去除id列，其与肿瘤是否为良性还是恶性无关，加入会严重影响分类的结果。
df.drop(['id'], 1, inplace=True)

df.head()

'''将数据集划分为训练集及测试集合'''
import numpy as np
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

'''选择算法、训练算法并测试算法'''
from sklearn import svm, neighbors
cls_dict={
    'SVM-SVC':svm.SVC(),
    'KNN':neighbors.KNeighborsClassifier()
}

# 训练并测试算法：若算法需要调优，可手动删除model序列化文件。
for name, cls in cls_dict.items():
    try:
        import pickle
        with open('%s.pickle' % name, 'rb') as f:
            cls = pickle.load(f)
    except Exception, e:
        # 训练算法
        cls.fit(X_train, y_train)
        print e

        # 序列化算法
        with open('%s.pickle' % name, 'wb') as f:
            pickle.dump(cls, f)

    # 测试算法
    print "%s Algorithm Accuracy: %s" % (name, cls.score(X_test, y_test))

    # 预测
    samples = np.array([[4, 2, 1, 1, 1, 2, 3, 2, 1], [4, 2, 1, 2, 2, 2, 3, 2, 1]])
    samples = samples.reshape(len(samples), -1)
    prediction = cls.predict(samples)
    print "%s Algorithm prediction: %s\n" % (name, prediction)