数据集说明:数据集来自威斯康星州医院的699条乳腺肿瘤数据,每条数据包含以下内容:
- Sample code number\ \ \ \ \ \ \ \ \ \ \ \ \ \id number
- Clump Thickness\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
- Uniformity of Cell Size\ \ \ \ \ \ \ \ \ \ \ 1 - 10
- Uniformity of Cell Shape\ \ \ \ \ \ \ \ \ \ 1 - 10
- Marginal Adhesion\ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
- Single Epithelial Cell Size\ \ \ \ \ \ \ \ \ 1 - 10
- Bare Nuclei\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
- Bland Chromatin\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
- Normal Nucleoli\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
- Mitoses \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1 - 10
- Class: \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2 for benign, 4 for malignant)
程序说明:采用KNN算法、支持向量机SVM,由Python语言实现良性恶性肿瘤预测。</br>
算法理论请参照:KNN算法、支持向量机SVM</br>
Ipynb演示文件:Ipynb文件</br>
Python代码:Python代码</br>
1 2 3 4 5 6 7 8 9 10 11 12
| '''获取并预处理原始数据集''' import pandas as pd df = pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?', -99999,inplace=True) df.fillna(-99999, inplace=True)
df.drop(['id'], 1, inplace=True)
df.head()
|
1 2 3 4 5 6 7
| '''将数据集划分为训练集及测试集合''' import numpy as np X = np.array(df.drop(['class'], 1)) y = np.array(df['class'])
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| '''选择算法、训练算法并测试算法''' from sklearn import svm, neighbors cls_dict={ 'SVM-SVC':svm.SVC(), 'KNN':neighbors.KNeighborsClassifier() }
for name, cls in cls_dict.items(): try: import pickle with open('%s.pickle' % name, 'rb') as f: cls = pickle.load(f) except Exception, e: cls.fit(X_train, y_train) print e
with open('%s.pickle' % name, 'wb') as f: pickle.dump(cls, f)
print "%s Algorithm Accuracy: %s" % (name, cls.score(X_test, y_test))
samples = np.array([[4, 2, 1, 1, 1, 2, 3, 2, 1], [4, 2, 1, 2, 2, 2, 3, 2, 1]]) samples = samples.reshape(len(samples), -1) prediction = cls.predict(samples) print "%s Algorithm prediction: %s\n" % (name, prediction)
|