Ashing's Blog: 機器學習(4)--資料標準常態化與隨機梯度下降法( standardization & Stochastic Gradient descent)

這篇承接上一篇適應線性神經元與梯度下降法，講述隨機梯度下降法(Stochastic Gradient descent，簡稱SGD)與資料標準常態化(standardization)。
有關適應線性神經元與梯度下降法可先閱讀底下連結：
機器學習(3)--適應線性神經元與梯度下降法(Adaline neuron and Gradient descent)

首先資料標準常態化(standardization)是一種特徵縮放方法，標準化後，特徵值會滿足標準常態分佈，並且每個平均值都是0，標準差都是1。例如，若要標準化樣本x第j個特徵，只要將樣本減去平均值μ，再除以標準差σ，就完成了，計算方式如下：

可以簡單的使用Numpy的mean與std方法便可快速，簡單的完成標準化工作，程式如下：
X_std[:, 0] = (X[:, 0] - X[:, 0].mean()) / X[:, 0].std()
X_std[:, 1] = (X[:, 1] - X[:, 1].mean()) / X[:, 1].std()
print("標準化特徵值0：X_std[:, 0]",X_std[:, 0] )
print("標準化特徵值1：X_std[:, 1]",X_std[:, 1] )

結果如下圖一：特徵值0及特徵值1分別是上一篇所使用鳶尾花的花萼長及花瓣長。原本的資料單位為公分，可以看出原始資料比1.0或0.0大很多，而標準化後的資料皆在正負0~2之間，以直觀的數學來說這會比原本的資料容易做訓練及運算。因此資料在做過標準常態化的特徵縮放後可以獲得較佳的效能。

<圖一>未標準化與標準化後的資料

下圖二是使用上一節適應線性神經元與梯度下降法做資料標準化後的收斂速度比較，在上半
圖為標準化後的資料，學習速率設定為0.01在15輪的迭代後，成本即收斂至最佳化，而在下
圖為資料未標準化，學習速率設定為0.0001，需要150輪迭代後，成本才收斂至最佳化，為
何學習速率設定不也設定在0.01呢?因這個例子來講當資料未標準化，將學習速率設定為0.01，會發現成本函數無法收斂至最佳化也就是上一節所說的，當學習速率過大時會衝過全域最小值而無法收斂。故資料標準化後配合適當的學習速率便可得到好的訓練效能。

<圖二>未標準化與標準化後訓練資料收斂的速度比較

隨機梯度下降法(Stochastic Gradient descent，SGD)

上一篇所講的梯度下降法(GD)是以所有樣本批次(batch)去做運算，每往全域最小值前進一步便須將所有樣本數再做一次運算，然而當樣本數很龐大時相對的，會耗費很大的計算資源。一種替代的方法便是隨機梯度下降法(Stochastic Gradient descent，SGD)。
他並不是基於所有樣本X的累積誤差總和來更新加權，原本公式如下：

改以對每個樣本以遞增的方式來更新加權：

隨機梯度下降法可以更迅速的跳過區域最小值，向全域最小值收斂。然而要讓隨機梯度下降法得到正確的結果，一個重要的關鍵是，數據不能以排序方式出現處裡而必須以隨機的方式處裡，故，這也是為什麼我們要在每一輪處理時，重新將數據攪亂，重新洗牌(shuffle)，以避免出現週期。
另一個隨機梯度下降法的好處是，可以用它來完成線上學習。在一個網路環境，只要有新的數據來到，可即時用它來訓練我們的模型。
而在批次梯度下降法(GD與隨機梯度下降法(SGD)還有一個稱為小批次學習(mini-batch)的折衷方式，例如以20筆資料為一批次去做資料運算及更新，其速度會比批次梯度下降法(GD快收斂，這也是現今深度學習最常用的方式。

下圖三同樣使用鳶尾花數據集來做訓練，標準化後的數據，學習速率同樣設定在0.01，預設shuffle=True。在此範例每一輪的成本定義為樣本的平均成本。可以看出在6~7輪時便可以收斂至最小成本。

<圖三>隨機梯度下降法訓練鳶尾花數據集分類

<隨機梯度下降法完整範例程式：>

from numpy.random import seed
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import ListedColormap

#劃出決策分布圖
def plot_decision_regions(X, y, classifier, resolution=0.02):

 # setup marker generator and color map
 markers = ('s', 'x', 'o', '^', 'v')
 colors = ('red', 'green', 'lightgreen', 'gray', 'cyan')
 #np.unique =>Find the unique elements of an array
 cmap = ListedColormap(colors[:len(np.unique(y))])
 # plot the decision surface
 x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1 #feature 1
 x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1 #feature 2

 xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
         np.arange(x2_min, x2_max, resolution))
 Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
 Z = Z.reshape(xx1.shape)

 plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
 plt.xlim(xx1.min(), xx1.max())
 plt.ylim(xx2.min(), xx2.max())
 # plot class samples
 for idx, cl in enumerate(np.unique(y)):
  #idx=0,1  ;cl=-1,1
  plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
     alpha=0.8, c=cmap(idx),
     marker=markers[idx], label=cl)
     
#適應線性神經元(Adaline)與隨機梯度下降法+
class AdalineSGD(object):
 """ADAptive LInear NEuron classifier.

 Parameters
 ------------
 eta : float
  Learning rate (between 0.0 and 1.0)
 n_iter : int
  Passes over the training dataset.

 Attributes
 -----------
 w_ : 1d-array
  Weights after fitting.
 cost_ : list
  Sum-of-squares cost function value averaged over all
  training samples in each epoch.
 shuffle : bool (default: True)
  Shuffles training data every epoch if True to prevent cycles.
 random_state : int (default: None)
  Set random state for shuffling and initializing the weights.
  
 """
 def __init__(self, eta=0.01, n_iter=10, shuffle=True, random_state=None):
  self.eta = eta                      #學習速率
  self.n_iter = n_iter              #Epochs
  self.w_initialized = False #self.w_ 是否已經初始
  self.shuffle = shuffle      #是否隨機洗牌
  if random_state:
   seed(random_state)
  
 def fit(self, X, y):
  """ Fit training data.
  Parameters
  ----------
  X : {array-like}, shape = [n_samples, n_features]
   Training vectors, where n_samples is the number of samples and
   n_features is the number of features.
  y : array-like, shape = [n_samples]
   Target values.
  Returns
  -------
  self : object
  """
  self._initialize_weights(X.shape[1])    #初始權重為0,大小應等於X.shape[1] 即為 輸入X的feature 行數
  self.cost_ = []            #初始成本list
  for i in range(self.n_iter):     #self.n_iter 輪計算
   if self.shuffle:          #如果設定隨機洗牌 ，也就是隨機選擇輸入的X樣本及對應的輸出y
    X, y = self._shuffle(X, y)
   cost = []
   for xi, target in zip(X, y):
    cost.append(self._update_weights(xi, target))  #計算cost
   avg_cost = sum(cost) / len(y)   #計算該輪平均cost
   self.cost_.append(avg_cost)   #將cost 加入到cost 的list裡
  return self

 def partial_fit(self, X, y):
  """Fit training data without reinitializing the weights"""
  if not self.w_initialized:
   self._initialize_weights(X.shape[1])
  if y.ravel().shape[0] > 1:
   for xi, target in zip(X, y):
    self._update_weights(xi, target)
  else:
   self._update_weights(X, y)
  return self

 def _shuffle(self, X, y):
  """Shuffle training data"""
  r = np.random.permutation(len(y))  #產生隨機排列 len(y) 大小的list
  return X[r], y[r]        #產生隨機打散的原始X,y資料
 
 def _initialize_weights(self, m):
  """Initialize weights to zeros"""
  self.w_ = np.zeros(1 + m)     #1 為bias 權重
  self.w_initialized = True
  
 def _update_weights(self, xi, target):
  """Apply Adaline learning rule to update the weights"""
  output = self.net_input(xi)
  error = (target - output)
  #print("xi=",xi,"target=",target)
  self.w_[1:] += self.eta * xi.dot(error)
  self.w_[0] += self.eta * error    #self.w_[0] 為bias 權重，預設輸入為1
  cost = 0.5 * error**2
  return cost
 
 def net_input(self, X):
  """Calculate net input"""
  return np.dot(X, self.w_[1:]) + self.w_[0]

 def activation(self, X):
  """Compute linear activation"""
  return self.net_input(X)

 def predict(self, X):
  """Return class label after unit step"""
  return np.where(self.activation(X) >= 0.0, 1, -1)
#
  
#載入iris data+  
df = pd.read_csv('iris.data', header=None)
#行4是Iris 種類，前50個是 Iris-setosa 後50個是Iris-versicolor,最後50個是Iris-virginica
#只取前兩種分類練習0-100,y.shape=(100,)
y = df.iloc[0:100, 4].values
#如果y=Iris-setosa ，把它標為-1，如果是Iris-versicolor 標為1
y = np.where(y == 'Iris-setosa', -1, 1)
#擷取行0及行2為sepal length and petal length 當輸入X的feature 0 及1
#X.shape=(100,2)
X = df.iloc[0:100, [0, 2]].values
print("原始資料特徵值0：",X[:, 0]) 
print("原始資料特徵值1：",X[:, 1]) 
# 將資料標準化以加速效率 ，標準化資料=(原始資料-平均值)/標準差
X_std = np.copy(X)
X_std[:, 0] = (X[:, 0] - X[:, 0].mean()) / X[:, 0].std()
X_std[:, 1] = (X[:, 1] - X[:, 1].mean()) / X[:, 1].std() 
print("標準化特徵值0：X_std[:, 0]",X_std[:, 0] )
print("標準化特徵值1：X_std[:, 1]",X_std[:, 1] )

#使用AdalineSGD classifier，設定疊代n_iter輪，學習率eta=0.01，隨機種子random_state
ada = AdalineSGD(n_iter=20, eta=0.01, shuffle=True,random_state=1)
#輸入標準常態化的樣本X_std及期望輸出y
ada.fit(X_std, y)

#劃出決策分布圖
plt.subplot(211)
plot_decision_regions(X_std, y, classifier=ada)
plt.title('Adaline - Stochastic Gradient Descent')
plt.xlabel('sepal length [standardized]')
plt.ylabel('petal length [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()

#劃出loss
plt.subplot(212)
plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Average Cost')
plt.tight_layout()
plt.show()

<參考資料>書名：Python機器學習，作者：Sebastian Raschka

https://github.com/rasbt/python-machine-learning-book

加入阿布拉機的3D列印與機器人的FB專頁
https://www.facebook.com/arbu00/

演算法(2)--使用Numpy.bincount來實作簡單的桶子排序法

機器學習(3)--適應線性神經元與梯度下降法(Adaline neuron and Gradient descent)

2017年2月15日 星期三

機器學習(4)--資料標準常態化與隨機梯度下降法( standardization & Stochastic Gradient descent)

隨機梯度下降法(Stochastic Gradient descent，SGD)

2017年2月15日星期三