机器学习算法 Python 实现-FinClip官网

机器学习算法 Python 实现

网友投稿 988 2022-08-21

机器学习算法 Python 实现

机器学习算法Python实现

逻辑回归_手写数字识别_OneVsAll

六、PCA主成分分析（降维）

3、主成分分析PCA与线性回归的区别

6、主成分个数的选择（即要降的维度）

9、使用scikit-learn库中的PCA实现降维

七、异常检测 Anomaly Detection

1、高斯分布（正态分布）

3、评价的好坏，以及的选取

4、选择使用什么样的feature（单元高斯分布）

6、单元和多元高斯分布特点

一、线性回归

全部代码

1、代价函数

其中：

下面就是要求出theta，使代价最小，即代表我们拟合出来的方程距离真实值最近

共有m条数据，其中代表我们要拟合出来的方程到真实值距离的平方，平方的原因是因为可能有负值，正负可能会抵消

前面有系数 2 的原因是下面求梯度是对每个变量求偏导， 2 可以消去

实现代码：

# 计算代价函数

def computerCost(X,y,theta):

m = len(y)

J = 0

J = (np.transpose(X*theta-y))*(X*theta-y)/(2*m) #计算代价J

return J

注意这里的X是真实数据前加了一列1，因为有theta(0)

2、梯度下降算法

代价函数对求偏导得到：

所以对theta的更新可以写为：

其中为学习速率，控制梯度下降的速度，一般取 0.01,0.03,0.1,0.3.....

为什么梯度下降可以逐步减小代价函数

假设函数 f(x)

泰勒展开： f(x+-x)=f(x)+f'(x)*-x+o(-x)

令： -x=-α*f'(x) ,即负梯度方向乘以一个很小的步长 α

将 -x 代入泰勒展开式中： f(x+x)=f(x)-α*[f'(x)]²+o(-x)

可以看出， α 是取得很小的正数， [f'(x)]² 也是正数，所以可以得出： f(x+-x)<=f(x)

所以沿着负梯度放下，函数在减小，多维情况一样。

实现代码

# 梯度下降算法

def gradientDescent(X,y,theta,alpha,num_iters):

m = len(y)

n = len(theta)

temp = np.matrix(np.zeros((n,num_iters))) # 暂存每次迭代计算的theta，转化为矩阵形式

J_history = np.zeros((num_iters,1)) #记录每次迭代计算的代价值

for i in range(num_iters): # 遍历迭代次数

h = np.dot(X,theta) # 计算内积，matrix可以直接乘

temp[:,i] = theta - ((alpha/m)*(np.dot(np.transpose(X),h-y))) #梯度的计算

theta = temp[:,i]

J_history[i] = computerCost(X,y,theta) #调用计算代价函数

print '.',

return theta,J_history

3、均值归一化

目的是使数据都缩放到一个范围内，便于使用梯度下降算法

其中为所有此feture数据的平均值

可以是最大值-最小值，也可以是这个feature对应的数据的标准差

实现代码：

# 归一化feature

def featureNormaliza(X):

X_norm = np.array(X) #将X转化为numpy数组对象，才可以进行矩阵的运算

#定义所需变量

mu = np.zeros((1,X.shape[1]))

sigma = np.zeros((1,X.shape[1]))

mu = np.mean(X_norm,0) # 求每一列的平均值（0指定为列，1代表行）

sigma = np.std(X_norm,0) # 求每一列的标准差

for i in range(X.shape[1]): # 遍历列

X_norm[:,i] = (X_norm[:,i]-mu[i])/sigma[i] # 归一化

return X_norm,mu,sigma

注意预测的时候也需要均值归一化数据

4、最终运行结果

5、使用scikit-learn库中的线性模型实现

导入包

from sklearn import linear_model

from sklearn.preprocessing import StandardScaler #引入缩放的包

归一化

# 归一化操作

scaler = StandardScaler()

scaler.fit(X)

x_train = scaler.transform(X)

x_test = scaler.transform(np.array([1650,3]))

线性模型拟合

# 线性模型拟合

model = linear_model.LinearRegression()

model.fit(x_train, y)

预测

#预测结果

result = model.predict(x_test)

二、逻辑回归

全部代码

1、代价函数

可以看出，当趋于 1 ， y=1 ,与预测值一致，此时付出的代价 cost 趋于 0 ，若趋于 0 ， y=1 ,此时的代价 cost 值非常大，我们最终的目的是最小化代价值

2、梯度

3、正则化

目的是为了防止过拟合

在代价函数中加上一项

注意j是重1开始的，因为theta(0)为一个常数项，X中最前面一列会加上1列1，所以乘积还是theta(0),feature没有关系，没有必要正则化

正则化后的代价：

# 代价函数

def costFunction(initial_theta,X,y,inital_lambda):

m = len(y)

J = 0

h = sigmoid(np.dot(X,initial_theta)) # 计算h(z)

theta1 = initial_theta.copy() # 因为正则化j=1从1开始，不包含0，所以复制一份，前theta(0)值为0

theta1[0] = 0

temp = np.dot(np.transpose(theta1),theta1)

J = (-np.dot(np.transpose(y),np.log(h))-np.dot(np.transpose(1-y),np.log(1-h))+temp*inital_lambda/2)/m # 正则化的代价方程

return J

正则化后的代价的梯度

# 计算梯度

def gradient(initial_theta,X,y,inital_lambda):

m = len(y)

grad = np.zeros((initial_theta.shape[0]))

h = sigmoid(np.dot(X,initial_theta))# 计算h(z)

theta1 = initial_theta.copy()

theta1[0] = 0

grad = np.dot(np.transpose(X),h-y)/m+inital_lambda/m*theta1 #正则化的梯度

return grad

4、S型函数（即）

实现代码：

# S型函数

def sigmoid(z):

h = np.zeros((len(z),1)) # 初始化，与z的长度一置

h = 1.0/(1.0+np.exp(-z))

return h

5、映射为多项式

因为数据的feture可能很少，导致偏差大，所以创造出一些feture结合

eg:映射为2次方的形式:

实现代码：

# 映射为多项式

def mapFeature(X1,X2):

degree = 3; # 映射的最高次方

out = np.ones((X1.shape[0],1)) # 映射后的结果数组（取代X）

'''

这里以degree=2为例，映射为1,x1,x2,x1^2,x1,x2,x2^2

'''

for i in np.arange(1,degree+1):

for j in range(i+1):

temp = X1**(i-j)*(X2**j) #矩阵直接乘相当于matlab中的点乘.*

out = np.hstack((out, temp.reshape(-1,1)))

return out

6、使用 scipy 的优化方法

梯度下降使用 scipy 中 optimize 中的 fmin_bfgs 函数

调用scipy中的优化算法fmin_bfgs（拟牛顿法Broyden-Fletcher-Goldfarb-Shanno

costFunction是自己实现的一个求代价的函数，

initial_theta表示初始化的值,

fprime指定costFunction的梯度

args是其余测参数，以元组的形式传入，最后会将最小化costFunction的theta返回

result = optimize.fmin_bfgs(costFunction, initial_theta, fprime=gradient, args=(X,y,initial_lambda))

7、运行结果

8、使用scikit-learn库中的逻辑回归模型实现

导入包

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

from sklearn.cross_validation import train_test_split

import numpy as np

划分训练集和测试集

# 划分为训练集和测试集

x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

归一化

# 归一化

scaler = StandardScaler()

scaler.fit(x_train)

x_train = scaler.fit_transform(x_train)

x_test = scaler.fit_transform(x_test)

逻辑回归

#逻辑回归

model = LogisticRegression()

model.fit(x_train,y_train)

预测

# 预测

predict = model.predict(x_test)

right = sum(predict == y_test)

predict = np.hstack((predict.reshape(-1,1),y_test.reshape(-1,1))) # 将预测值和真实值放在一块，好观察

print predict

print ('测试集准确率：%f%%'%(right*100.0/predict.shape[0])) #计算在测试集上的准确度

逻辑回归_手写数字识别_OneVsAll

全部代码

1、随机显示100个数字

# 显示100个数字

def display_data(imgData):

sum = 0

'''

显示100个数（若是一个一个绘制将会非常慢，可以将要画的数字整理好，放到一个矩阵中，显示这个矩阵即可）

- 初始化一个二维数组

- 将每行的数据调整成图像的矩阵，放进二维数组

- 显示即可

'''

pad = 1

display_array = -np.ones((pad+10*(20+pad),pad+10*(20+pad)))

for i in range(10):

for j in range(10):

display_array[pad+i*(20+pad):pad+i*(20+pad)+20,pad+j*(20+pad):pad+j*(20+pad)+20] = (imgData[sum,:].reshape(20,20,order="F")) # order=F指定以列优先，在matlab中是这样的，python中需要指定，默认以行

sum += 1

plt.imshow(display_array,cmap='gray') #显示灰度图像

plt.axis('off')

plt.show()

2、OneVsAll

3、手写数字识别

# 求每个分类的theta，最后返回所有的all_theta

def oneVsAll(X,y,num_labels,Lambda):

# 初始化变量

m,n = X.shape

all_theta = np.zeros((n+1,num_labels)) # 每一列对应相应分类的theta,共10列

X = np.hstack((np.ones((m,1)),X)) # X前补上一列1的偏置bias

class_y = np.zeros((m,num_labels)) # 数据的y对应0-9，需要映射为0/1的关系

initial_theta = np.zeros((n+1,1)) # 初始化一个分类的theta

# 映射y

for i in range(num_labels):

class_y[:,i] = np.int32(y==i).reshape(1,-1) # 注意reshape(1,-1)才可以赋值

#np.savetxt("class_y.csv", class_y[0:600,:], delimiter=',')

'''遍历每个分类，计算对应的theta值'''

for i in range(num_labels):

result = optimize.fmin_bfgs(costFunction, initial_theta, fprime=gradient, args=(X,class_y[:,i],Lambda)) # 调用梯度下降的优化方法

all_theta[:,i] = result.reshape(1,-1) # 放入all_theta中

all_theta = np.transpose(all_theta)

return all_theta

4、预测

之前说过，预测的结果是一个概率值，利用学习出来的 theta 代入预测的 S型函数中，每行的最大值就是是某个数字的最大概率，所在的列号就是预测的数字的真实值,因为在分类时，所有为 0 的将 y 映射在第一列，为1的映射在第二列，依次类推

实现代码：

# 预测

def predict_oneVsAll(all_theta,X):

m = X.shape[0]

num_labels = all_theta.shape[0]

p = np.zeros((m,1))

X = np.hstack((np.ones((m,1)),X)) #在X最前面加一列1

h = sigmoid(np.dot(X,np.transpose(all_theta))) #预测

'''

返回h中每一行最大值所在的列号

- np.max(h, axis=1)返回h中每一行的最大值（是某个数字的最大概率）

- 最后where找到的最大概率所在的列号（列号即是对应的数字）

'''

p = np.array(np.where(h[0,:] == np.max(h, axis=1)[0]))

for i in np.arange(1, m):

t = np.array(np.where(h[i,:] == np.max(h, axis=1)[i]))

p = np.vstack((p,t))

return p

5、运行结果

6、使用scikit-learn库中的逻辑回归模型实现

1、导入包

from scipy import io as spio

import numpy as np

from sklearn import svm

from sklearn.linear_model import LogisticRegression

2、加载数据

data = loadmat_data("data_digits.mat")

X = data['X'] # 获取X数据，每一行对应一个数字20x20px

y = data['y'] # 这里读取mat文件y的shape=(5000, 1)

y = np.ravel(y) # 调用sklearn需要转化成一维的(5000,)

3、拟合模型

model = LogisticRegression()

model.fit(X, y) # 拟合

4、预测

predict = model.predict(X) #预测

print u"预测准确度为：%f%%"%np.mean(np.float64(predict == y)*100)

三、BP神经网络

全部代码

1、神经网络model

2、代价函数

假设最后输出的，即代表输出层有K个单元

其中，代表第 i 个单元输出

与逻辑回归的代价函数差不多，就是累加上每个输出（共有K个输出）

3、正则化

# 代价函数

def nnCostFunction(nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,Lambda):

length = nn_params.shape[0] # theta的中长度

# 还原theta1和theta2

Theta1 = nn_params[0:hidden_layer_size*(input_layer_size+1)].reshape(hidden_layer_size,input_layer_size+1)

Theta2 = nn_params[hidden_layer_size*(input_layer_size+1):length].reshape(num_labels,hidden_layer_size+1)

# np.savetxt("Theta1.csv",Theta1,delimiter=',')

m = X.shape[0]

class_y = np.zeros((m,num_labels)) # 数据的y对应0-9，需要映射为0/1的关系

# 映射y

for i in range(num_labels):

class_y[:,i] = np.int32(y==i).reshape(1,-1) # 注意reshape(1,-1)才可以赋值

'''去掉theta1和theta2的第一列，因为正则化时从1开始'''

Theta1_colCount = Theta1.shape[1]

Theta1_x = Theta1[:,1:Theta1_colCount]

Theta2_colCount = Theta2.shape[1]

Theta2_x = Theta2[:,1:Theta2_colCount]

# 正则化向theta^2

term = np.dot(np.transpose(np.vstack((Theta1_x.reshape(-1,1),Theta2_x.reshape(-1,1)))),np.vstack((Theta1_x.reshape(-1,1),Theta2_x.reshape(-1,1))))

'''正向传播,每次需要补上一列1的偏置bias'''

a1 = np.hstack((np.ones((m,1)),X))

z2 = np.dot(a1,np.transpose(Theta1))

a2 = sigmoid(z2)

a2 = np.hstack((np.ones((m,1)),a2))

z3 = np.dot(a2,np.transpose(Theta2))

h = sigmoid(z3)

'''代价'''

J = -(np.dot(np.transpose(class_y.reshape(-1,1)),np.log(h.reshape(-1,1)))+np.dot(np.transpose(1-class_y.reshape(-1,1)),np.log(1-h.reshape(-1,1)))-Lambda*term/2)/m

return np.ravel(J)

4、反向传播BP

上面正向传播可以计算得到 J(θ) ,使用梯度下降法还需要求它的梯度

BP反向传播的目的就是求代价函数的梯度

假设4层的神经网络, 记为--> l 层第 j 个单元的误差

《===》（向量化）

没有，因为对于输入没有误差

因为S型函数的导数为：，所以上面的和可以在前向传播中计算出来

反向传播计算梯度的过程为：

（是大写的）

for i=1-m: - -正向传播计算（l=2,3,4...L） -反向计算、 ... ； - -

最后，即得到代价函数的梯度

实现代码：

# 梯度

def nnGradient(nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,Lambda):

length = nn_params.shape[0]

Theta1 = nn_params[0:hidden_layer_size*(input_layer_size+1)].reshape(hidden_layer_size,input_layer_size+1).copy() # 这里使用copy函数，否则下面修改Theta的值，nn_params也会一起修改

Theta2 = nn_params[hidden_layer_size*(input_layer_size+1):length].reshape(num_labels,hidden_layer_size+1).copy()

m = X.shape[0]

class_y = np.zeros((m,num_labels)) # 数据的y对应0-9，需要映射为0/1的关系

# 映射y

for i in range(num_labels):

class_y[:,i] = np.int32(y==i).reshape(1,-1) # 注意reshape(1,-1)才可以赋值

'''去掉theta1和theta2的第一列，因为正则化时从1开始'''

Theta1_colCount = Theta1.shape[1]

Theta1_x = Theta1[:,1:Theta1_colCount]

Theta2_colCount = Theta2.shape[1]

Theta2_x = Theta2[:,1:Theta2_colCount]

Theta1_grad = np.zeros((Theta1.shape)) #第一层到第二层的权重

Theta2_grad = np.zeros((Theta2.shape)) #第二层到第三层的权重

'''正向传播，每次需要补上一列1的偏置bias'''

a1 = np.hstack((np.ones((m,1)),X))

z2 = np.dot(a1,np.transpose(Theta1))

a2 = sigmoid(z2)

a2 = np.hstack((np.ones((m,1)),a2))

z3 = np.dot(a2,np.transpose(Theta2))

h = sigmoid(z3)

'''反向传播，delta为误差，'''

delta3 = np.zeros((m,num_labels))

delta2 = np.zeros((m,hidden_layer_size))

for i in range(m):

#delta3[i,:] = (h[i,:]-class_y[i,:])*sigmoidGradient(z3[i,:]) # 均方误差的误差率

delta3[i,:] = h[i,:]-class_y[i,:] # 交叉熵误差率

Theta2_grad = Theta2_grad+np.dot(np.transpose(delta3[i,:].reshape(1,-1)),a2[i,:].reshape(1,-1))

delta2[i,:] = np.dot(delta3[i,:].reshape(1,-1),Theta2_x)*sigmoidGradient(z2[i,:])

Theta1_grad = Theta1_grad+np.dot(np.transpose(delta2[i,:].reshape(1,-1)),a1[i,:].reshape(1,-1))

Theta1[:,0] = 0

Theta2[:,0] = 0

'''梯度'''

grad = (np.vstack((Theta1_grad.reshape(-1,1),Theta2_grad.reshape(-1,1)))+Lambda*np.vstack((Theta1.reshape(-1,1),Theta2.reshape(-1,1))))/m

return np.ravel(grad)

5、BP可以求梯度的原因

6、梯度检查

检查利用 BP 求的梯度是否正确

利用导数的定义验证：

求出来的数值梯度应该与BP求出的梯度非常接近

验证BP正确后就不需要再执行验证梯度的算法了

实现代码：

# 检验梯度是否计算正确

def checkGradient(Lambda = 0):

'''构造一个小型的神经网络验证，因为数值法计算梯度很浪费时间，而且验证正确后之后就不再需要验证了'''

input_layer_size = 3

hidden_layer_size = 5

num_labels = 3

m = 5

initial_Theta1 = debugInitializeWeights(input_layer_size,hidden_layer_size);

initial_Theta2 = debugInitializeWeights(hidden_layer_size,num_labels)

X = debugInitializeWeights(input_layer_size-1,m)

y = 1+np.transpose(np.mod(np.arange(1,m+1), num_labels))# 初始化y

y = y.reshape(-1,1)

nn_params = np.vstack((initial_Theta1.reshape(-1,1),initial_Theta2.reshape(-1,1))) #展开theta

'''BP求出梯度'''

grad = nnGradient(nn_params, input_layer_size, hidden_layer_size,

num_labels, X, y, Lambda)

'''使用数值法计算梯度'''

num_grad = np.zeros((nn_params.shape[0]))

step = np.zeros((nn_params.shape[0]))

e = 1e-4

for i in range(nn_params.shape[0]):

step[i] = e

loss1 = nnCostFunction(nn_params-step.reshape(-1,1), input_layer_size, hidden_layer_size,

num_labels, X, y,

Lambda)

loss2 = nnCostFunction(nn_params+step.reshape(-1,1), input_layer_size, hidden_layer_size,

num_labels, X, y,

Lambda)

num_grad[i] = (loss2-loss1)/(2*e)

step[i]=0

# 显示两列比较

res = np.hstack((num_grad.reshape(-1,1),grad.reshape(-1,1)))

print res

7、权重的随机初始化

神经网络不能像逻辑回归那样初始化 theta 为 0 ,因为若是每条边的权重都为0，每个神经元都是相同的输出，在反向传播中也会得到同样的梯度，最终只会预测一种结果。

所以应该初始化为接近0的数

实现代码

# 随机初始化权重theta

def randInitializeWeights(L_in,L_out):

W = np.zeros((L_out,1+L_in)) # 对应theta的权重

epsilon_init = (6.0/(L_out+L_in))**0.5

W = np.random.rand(L_out,1+L_in)*2*epsilon_init-epsilon_init # np.random.rand(L_out,1+L_in)产生L_out*(1+L_in)大小的随机矩阵

return W

8、预测

正向传播预测结果

实现代码

# 预测

def predict(Theta1,Theta2,X):

m = X.shape[0]

num_labels = Theta2.shape[0]

#p = np.zeros((m,1))

'''正向传播，预测结果'''

X = np.hstack((np.ones((m,1)),X))

h1 = sigmoid(np.dot(X,np.transpose(Theta1)))

h1 = np.hstack((np.ones((m,1)),h1))

h2 = sigmoid(np.dot(h1,np.transpose(Theta2)))

'''

返回h中每一行最大值所在的列号

- np.max(h, axis=1)返回h中每一行的最大值（是某个数字的最大概率）

- 最后where找到的最大概率所在的列号（列号即是对应的数字）

'''

#np.savetxt("h2.csv",h2,delimiter=',')

p = np.array(np.where(h2[0,:] == np.max(h2, axis=1)[0]))

for i in np.arange(1, m):

t = np.array(np.where(h2[i,:] == np.max(h2, axis=1)[i]))

p = np.vstack((p,t))

return p

9、输出结果

四、SVM支持向量机

1、代价函数

2、Large Margin

3、SVM Kernel（核函数）

4、使用 scikit-learn 中的 SVM 模型代码

全部代码

线性可分的,指定核函数为 linear ：

'''data1——线性分类'''

data1 = spio.loadmat('data1.mat')

X = data1['X']

y = data1['y']

y = np.ravel(y)

plot_data(X,y)

model = svm.SVC(C=1.0,kernel='linear').fit(X,y) # 指定核函数为线性核函数

非线性可分的，默认核函数为 rbf

'''data2——非线性分类'''

data2 = spio.loadmat('data2.mat')

X = data2['X']

y = data2['y']

y = np.ravel(y)

plt = plot_data(X,y)

plt.show()

model = svm.SVC(gamma=100).fit(X,y) # gamma为核函数的系数，值越大拟合的越好

5、运行结果

五、K-Means聚类算法

全部代码

1、聚类过程

# 找到每条数据距离哪个类中心最近

def findClosestCentroids(X,initial_centroids):

m = X.shape[0] # 数据条数

K = initial_centroids.shape[0] # 类的总数

dis = np.zeros((m,K)) # 存储计算每个点分别到K个类的距离

idx = np.zeros((m,1)) # 要返回的每条数据属于哪个类

'''计算每个点到每个类中心的距离'''

for i in range(m):

for j in range(K):

dis[i,j] = np.dot((X[i,:]-initial_centroids[j,:]).reshape(1,-1),(X[i,:]-initial_centroids[j,:]).reshape(-1,1))

'''返回dis每一行的最小值对应的列号，即为对应的类别

- np.min(dis, axis=1)返回每一行的最小值

- np.where(dis == np.min(dis, axis=1).reshape(-1,1)) 返回对应最小值的坐标

- 注意：可能最小值对应的坐标有多个，where都会找出来，所以返回时返回前m个需要的即可（因为对于多个最小值，属于哪个类别都可以）

'''

dummy,idx = np.where(dis == np.min(dis, axis=1).reshape(-1,1))

return idx[0:dis.shape[0]] # 注意截取一下

计算类中心实现代码：

# 计算类中心

def computerCentroids(X,idx,K):

n = X.shape[1]

centroids = np.zeros((K,n))

for i in range(K):

centroids[i,:] = np.mean(X[np.ravel(idx==i),:], axis=0).reshape(1,-1) # 索引要是一维的,axis=0为每一列，idx==i一次找出属于哪一类的，然后计算均值

return centroids

2、目标函数

3、聚类中心的选择

随机初始化，从给定的数据中随机抽取K个作为聚类中心

随机一次的结果可能不好，可以随机多次，最后取使代价函数最小的作为中心

实现代码：(这里随机一次)

# 初始化类中心--随机取K个点作为聚类中心

def kMeansInitCentroids(X,K):

m = X.shape[0]

m_arr = np.arange(0,m) # 生成0-m-1

centroids = np.zeros((K,X.shape[1]))

np.random.shuffle(m_arr) # 打乱m_arr顺序

rand_indices = m_arr[:K] # 取前K个

centroids = X[rand_indices,:]

return centroids

4、聚类个数K的选择

5、应用——图片压缩

将图片的像素分为若干类，然后用这个类代替原来的像素值

执行聚类的算法代码：

# 聚类算法

def runKMeans(X,initial_centroids,max_iters,plot_process):

m,n = X.shape # 数据条数和维度

K = initial_centroids.shape[0] # 类数

centroids = initial_centroids # 记录当前类中心

previous_centroids = centroids # 记录上一次类中心

idx = np.zeros((m,1)) # 每条数据属于哪个类

for i in range(max_iters): # 迭代次数

print u'迭代计算次数：%d'%(i+1)

idx = findClosestCentroids(X, centroids)

if plot_process: # 如果绘制图像

plt = plotProcessKMeans(X,centroids,previous_centroids) # 画聚类中心的移动过程

previous_centroids = centroids # 重置

centroids = computerCentroids(X, idx, K) # 重新计算类中心

if plot_process: # 显示最终的绘制结果

plt.show()

return centroids,idx # 返回聚类中心和数据属于哪个类

6、使用scikit-learn库中的线性模型实现聚类

导入包

from sklearn.cluster import KMeans

使用模型拟合数据

model = KMeans(n_clusters=3).fit(X) # n_clusters指定3类，拟合数据

聚类中心

centroids = model.cluster_centers_ # 聚类中心

7、运行结果

六、PCA主成分分析（降维）

全部代码

1、用处

数据压缩（Data Compression）,使程序运行更快

可视化数据，例如 3D-->2D 等

......

2、2D-->1D，nD-->kD

3、主成分分析PCA与线性回归的区别

线性回归是找 x 与 y 的关系，然后用于预测 y

PCA 是找一个投影面，最小化data到这个投影面的投影误差

4、PCA降维过程

数据预处理（均值归一化）

公式：

就是减去对应feature的均值，然后除以对应特征的标准差（也可以是最大值-最小值）

实现代码：

# 归一化数据

def featureNormalize(X):

'''（每一个数据-当前列的均值）/当前列的标准差'''

n = X.shape[1]

mu = np.zeros((1,n));

sigma = np.zeros((1,n))

mu = np.mean(X,axis=0)

sigma = np.std(X,axis=0)

for i in range(n):

X[:,i] = (X[:,i]-mu[i])/sigma[i]

return X,mu,sigma

计算协方差矩阵Σ （Covariance Matrix）：

注意这里的 Σ 和求和符号不同

协方差矩阵对称正定（不理解正定的看看线代）

大小为 nxn , n 为 feature 的维度

实现代码：

Sigma = np.dot(np.transpose(X_norm),X_norm)/m # 求Sigma

# 映射数据

def projectData(X_norm,U,K):

Z = np.zeros((X_norm.shape[0],K))

U_reduce = U[:,0:K] # 取前K个

Z = np.dot(X_norm,U_reduce)

return Z

过程总结：

Sigma = X'*X/m

U,S,V = svd(Sigma)

Ureduce = U[:,0:k]

Z = Ureduce'*x

5、数据恢复

因为：

所以：（注意这里是X的近似值）

又因为 Ureduce 为正定矩阵，【正定矩阵满足：，所以：】，所以这里：

实现代码：

# 恢复数据

def recoverData(Z,U,K):

X_rec = np.zeros((Z.shape[0],U.shape[0]))

U_recude = U[:,0:K]

X_rec = np.dot(Z,np.transpose(U_recude)) # 还原数据（近似）

return X_rec

6、主成分个数的选择（即要降的维度）

如何选择

投影误差（project error）：

总变差（total variation）:

若误差率（error ratio）：，则称 99% 保留差异性

误差率一般取 1%，5%，10% 等

如何实现

若是一个个试的话代价太大

之前 U,S,V = svd(Sigma) ,我们得到了 S ，这里误差率error ratio:

可以一点点增加 K 尝试。

7、使用建议

不要使用PCA去解决过拟合问题 Overfitting ，还是使用正则化的方法（如果保留了很高的差异性还是可以的）

只有在原数据上有好的结果，但是运行很慢，才考虑使用PCA

8、运行结果

9、使用scikit-learn库中的PCA实现降维

导入需要的包：

#-*- coding: utf-8 -*-

# Author:bob

# Date:2016.12.22

import numpy as np

from matplotlib import pyplot as plt

from scipy import io as spio

from sklearn.decomposition import pca

from sklearn.preprocessing import StandardScaler

归一化数据

'''归一化数据并作图'''

scaler = StandardScaler()

scaler.fit(X)

x_train = scaler.transform(X)

使用PCA模型拟合数据，并降维

n_components 对应要将的维度

'''拟合数据'''

K=1 # 要降的维度

model = pca.PCA(n_components=K).fit(x_train) # 拟合数据，n_components定义要降的维度

Z = model.transform(x_train) # transform就会执行降维操作

数据恢复

model.components_ 会得到降维使用的 U 矩阵

'''数据恢复并作图'''

Ureduce = model.components_ # 得到降维用的Ureduce

x_rec = np.dot(Z,Ureduce) # 数据恢复

七、异常检测 Anomaly Detection

全部代码

1、高斯分布（正态分布） Gaussian distribution

2、异常检测算法

例子

训练集： ,其中

假设相互独立，建立model模型：

过程

选择具有代表异常的 feature :xi

参数估计：

计算 p(x) ,若是 P(x)<ε 则认为异常，其中 ε 为我们要求的概率的临界值 threshold

这里只是单元高斯分布，假设了 feature 之间是独立的，下面会讲到多元高斯分布，会自动捕捉到 feature 之间的关系

参数估计实现代码

# 参数估计函数（就是求均值和方差）

def estimateGaussian(X):

m,n = X.shape

mu = np.zeros((n,1))

sigma2 = np.zeros((n,1))

mu = np.mean(X, axis=0) # axis=0表示列，每列的均值

sigma2 = np.var(X,axis=0) # 求每列的方差

return mu,sigma2

3、评价 p(x) 的好坏，以及 ε 的选取

# 选择最优的epsilon，即：使F1Score最大

def selectThreshold(yval,pval):

'''初始化所需变量'''

bestEpsilon = 0.

bestF1 = 0.

F1 = 0.

step = (np.max(pval)-np.min(pval))/1000

'''计算'''

for epsilon in np.arange(np.min(pval),np.max(pval),step):

cvPrecision = pval

tp = np.sum((cvPrecision == 1) & (yval == 1)).astype(float) # sum求和是int型的，需要转为float

fp = np.sum((cvPrecision == 1) & (yval == 0)).astype(float)

fn = np.sum((cvPrecision == 1) & (yval == 0)).astype(float)

precision = tp/(tp+fp) # 精准度

recision = tp/(tp+fn) # 召回率

F1 = (2*precision*recision)/(precision+recision) # F1Score计算公式

if F1 > bestF1: # 修改最优的F1 Score

bestF1 = F1

bestEpsilon = epsilon

return bestEpsilon,bestF1

4、选择使用什么样的feature（单元高斯分布）

如果一些数据不是满足高斯分布的，可以变化一下数据，例如 log(x+C),x^(1/2) 等

如果 p(x) 的值无论异常与否都很大，可以尝试组合多个 feature ,(因为feature之间可能是有关系的)

5、多元高斯分布

# 多元高斯分布函数

def multivariateGaussian(X,mu,Sigma2):

k = len(mu)

if (Sigma2.shape[0]>1):

Sigma2 = np.diag(Sigma2)

'''多元高斯分布函数'''

X = X-mu

argu = (2*np.pi)**(-k/2)*np.linalg.det(Sigma2)**(-0.5)

p = argu*np.exp(-0.5*np.sum(np.dot(X,np.linalg.inv(Sigma2))*X,axis=1)) # axis表示每行

return p

6、单元和多元高斯分布特点

单元高斯分布

人为可以捕捉到 feature 之间的关系时可以使用

计算量小

多元高斯分布

自动捕捉到相关的feature

计算量大，因为：

m>n 或 Σ 可逆时可以使用。（若不可逆，可能有冗余的x，因为线性相关，不可逆，或者就是m

7、程序运行结果

tp = np.sum((cvPrecision == 1) & (yval == 1)).astype(float) # sum求和是int型的，需要转为float

fp = np.sum((cvPrecision == 1) & (yval == 0)).astype(float)

fn = np.sum((cvPrecision == 1) & (yval == 0)).astype(float)

precision = tp/(tp+fp) # 精准度

recision = tp/(tp+fn) # 召回率

F1 = (2*precision*recision)/(precision+recision) # F1Score计算公式

if F1 > bestF1: # 修改最优的F1 Score

bestF1 = F1

bestEpsilon = epsilon

return bestEpsilon,bestF1

4、选择使用什么样的feature（单元高斯分布）

如果一些数据不是满足高斯分布的，可以变化一下数据，例如 log(x+C),x^(1/2) 等

如果 p(x) 的值无论异常与否都很大，可以尝试组合多个 feature ,(因为feature之间可能是有关系的)

5、多元高斯分布

# 多元高斯分布函数

def multivariateGaussian(X,mu,Sigma2):

k = len(mu)

if (Sigma2.shape[0]>1):

Sigma2 = np.diag(Sigma2)

'''多元高斯分布函数'''

X = X-mu

argu = (2*np.pi)**(-k/2)*np.linalg.det(Sigma2)**(-0.5)

p = argu*np.exp(-0.5*np.sum(np.dot(X,np.linalg.inv(Sigma2))*X,axis=1)) # axis表示每行

return p

6、单元和多元高斯分布特点

单元高斯分布

人为可以捕捉到 feature 之间的关系时可以使用

计算量小

多元高斯分布

自动捕捉到相关的feature

计算量大，因为：

m>n 或 Σ 可逆时可以使用。（若不可逆，可能有冗余的x，因为线性相关，不可逆，或者就是m

7、程序运行结果

python基础代码让你了解Scipy（python代码介绍）

988 2022-08-21

机器学习算法 Python 实现

Python怎么读取图片文件？（Python读取图片）

python基础代码让你了解Scipy（python代码介绍）

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计