摘要:和分別是樣本輸入和輸出二進制值第位,對于每個樣本有兩個值,分別是和對應(yīng)第位。最簡單實現(xiàn),沒有考慮偏置變量,只有兩個神經(jīng)元。存儲神經(jīng)元狀態(tài),包括,是內(nèi)部狀態(tài)矩陣記憶,是隱藏層神經(jīng)元輸出矩陣。表示當(dāng)前時序表示時序記憶單元。下載甄環(huán)傳小說原文。
真正掌握一種算法,最實際的方法,完全手寫出來。
LSTM(Long Short Tem Memory)特殊遞歸神經(jīng)網(wǎng)絡(luò),神經(jīng)元保存歷史記憶,解決自然語言處理統(tǒng)計方法只能考慮最近n個詞語而忽略更久前詞語的問題。用途:word representation(embedding)(詞語向量)、sequence to sequence learning(輸入句子預(yù)測句子)、機器翻譯、語音識別等。
100多行原始python代碼實現(xiàn)基于LSTM二進制加法器。https://iamtrask.github.io/20... ,翻譯http://blog.csdn.net/zzukun/a... :
import copy, numpy as np
np.random.seed(0)
最開始引入numpy庫,矩陣操作。
def sigmoid(x):
output = 1/(1+np.exp(-x)) return output
聲明sigmoid激活函數(shù),神經(jīng)網(wǎng)絡(luò)基礎(chǔ)內(nèi)容,常用激活函數(shù)sigmoid、tan、relu等,sigmoid取值范圍[0, 1],tan取值范圍[-1,1],x是向量,返回output是向量。
def sigmoid_output_to_derivative(output):
return output*(1-output)
聲明sigmoid求導(dǎo)函數(shù)。
加法器思路:二進制加法是二進制位相加,記錄滿二進一進位,訓(xùn)練時隨機c=a+b樣本,輸入a、b輸出c是整個lstm預(yù)測過程,訓(xùn)練由a、b二進制向c各種轉(zhuǎn)換矩陣和權(quán)重,神經(jīng)網(wǎng)絡(luò)。
int2binary = {}
聲明詞典,由整型數(shù)字轉(zhuǎn)成二進制,存起來不用隨時計算,提前存好讀取更快。
binary_dim = 8
largest_number = pow(2,binary_dim)
聲明二進制數(shù)字維度,8,二進制能表達最大整數(shù)2^8=256,largest_number。
binary = np.unpackbits(
np.array([range(largest_number)],dtype=np.uint8).T,axis=1)
for i in range(largest_number):
int2binary[i] = binary[i]
預(yù)先把整數(shù)到二進制轉(zhuǎn)換詞典存起來。
alpha = 0.1
input_dim = 2
hidden_dim = 16
output_dim = 1
設(shè)置參數(shù),alpha是學(xué)習(xí)速度,input_dim是輸入層向量維度,輸入a、b兩個數(shù),是2,hidden_dim是隱藏層向量維度,隱藏層神經(jīng)元個數(shù),output_dim是輸出層向量維度,輸出一個c,是1維。從輸入層到隱藏層權(quán)重矩陣是216維,從隱藏層到輸出層權(quán)重矩陣是161維,隱藏層到隱藏層權(quán)重矩陣是16*16維:
synapse_0 = 2*np.random.random((input_dim,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,output_dim)) - 1
synapse_h = 2*np.random.random((hidden_dim,hidden_dim)) - 1
2x-1,np.random.random生成從0到1之間隨機浮點數(shù),2x-1使其取值范圍在[-1, 1]。
synapse_0_update = np.zeros_like(synapse_0)
synapse_1_update = np.zeros_like(synapse_1)
synapse_h_update = np.zeros_like(synapse_h)
聲明三個矩陣更新,Delta。
for j in range(10000):
進行10000次迭代。
a_int = np.random.randint(largest_number/2)
a = int2binary[a_int]
b_int = np.random.randint(largest_number/2)
b = int2binary[b_int]
c_int = a_int + b_int
c = int2binary[c_int]
隨機生成樣本,包含二進制a、b、c,c=a+b,a_int、b_int、c_int分別是a、b、c對應(yīng)整數(shù)格式。
d = np.zeros_like(c)
d存模型對c預(yù)測值。
overallError = 0
全局誤差,觀察模型效果。
layer_2_deltas = list()
存儲第二層(輸出層)殘差,輸出層殘差計算公式推導(dǎo)公式http://deeplearning.stanford.... 。
layer_1_values = list()
layer_1_values.append(np.zeros(hidden_dim))
存儲第一層(隱藏層)輸出值,賦0值作為上一個時間值。
for position in range(binary_dim):
遍歷二進制每一位。
X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]])
y = np.array([[c[binary_dim - position - 1]]]).T
X和y分別是樣本輸入和輸出二進制值第position位,X對于每個樣本有兩個值,分別是a和b對應(yīng)第position位。把樣本拆成每個二進制位用于訓(xùn)練,二進制加法存在進位標(biāo)記正好適合利用LSTM長短期記憶訓(xùn)練,每個樣本8個二進制位是一個時間序列。
layer_1 = sigmoid(np.dot(X,synapse_0) + np.dot(layer_1_values[-1],synapse_h))
公式Ct = sigma(W0·Xt + Wh·Ct-1)
layer_2 = sigmoid(np.dot(layer_1,synapse_1))
這里使用的公式是C2 = sigma(W1·C1),
layer_2_error = y - layer_2
計算預(yù)測值和真實值誤差。
layer_2_deltas.append((layer_2_error)*sigmoid_output_to_derivative(layer_2))
反向傳導(dǎo),計算delta,添加到數(shù)組layer_2_deltas
overallError += np.abs(layer_2_error[0])
計算累加總誤差,用于展示和觀察。
d[binary_dim - position - 1] = np.round(layer_20)
存儲預(yù)測position位輸出值。
layer_1_values.append(copy.deepcopy(layer_1))
存儲中間過程生成隱藏層值。
future_layer_1_delta = np.zeros(hidden_dim)
存儲下一個時間周期隱藏層歷史記憶值,先賦一個空值。
for position in range(binary_dim):
遍歷二進制每一位。
X = np.array([[a[position],b[position]]])
取出X值,從大位開始更新,反向傳導(dǎo)按時序逆著一級一級更新。
layer_1 = layer_1_values[-position-1]
取出位對應(yīng)隱藏層輸出。
prev_layer_1 = layer_1_values[-position-2]
取出位對應(yīng)隱藏層上一時序輸出。
layer_2_delta = layer_2_deltas[-position-1]
取出位對應(yīng)輸出層delta。
layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(layer_1)
神經(jīng)網(wǎng)絡(luò)反向傳導(dǎo)公式,加上隱藏層?值。
synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)
累加權(quán)重矩陣更新,對權(quán)重(權(quán)重矩陣)偏導(dǎo)等于本層輸出與下一層delta點乘。
synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)
前一時序隱藏層權(quán)重矩陣更新,前一時序隱藏層輸出與本時序delta點乘。
synapse_0_update += X.T.dot(layer_1_delta)
輸入層權(quán)重矩陣更新。
future_layer_1_delta = layer_1_delta
記錄本時序隱藏層delta。
synapse_0 += synapse_0_update * alpha
synapse_1 += synapse_1_update * alpha
synapse_h += synapse_h_update * alpha
權(quán)重矩陣更新。
synapse_0_update *= 0
synapse_1_update *= 0
synapse_h_update *= 0
更新變量歸零。
if(j % 1000 == 0):
print "Error:" + str(overallError) print "Pred:" + str(d) print "True:" + str(c) out = 0 for index,x in enumerate(reversed(d)): out += x*pow(2,index) print str(a_int) + " + " + str(b_int) + " = " + str(out) print "------------"
每訓(xùn)練1000個樣本輸出總誤差信息,運行時看收斂過程。
LSTM最簡單實現(xiàn),沒有考慮偏置變量,只有兩個神經(jīng)元。
完整LSTM python實現(xiàn)。完全參照論文great intro paper實現(xiàn),代碼來源https://github.com/nicodjimen... ,作者解釋http://nicodjimenez.github.io... ,具體過程參考http://colah.github.io/posts/... 圖。
import random
import numpy as np
import math
def sigmoid(x):
return 1. / (1 + np.exp(-x))
聲明sigmoid函數(shù)。
def rand_arr(a, b, *args):
np.random.seed(0) return np.random.rand(*args) * (b - a) + a
生成隨機矩陣,取值范圍[a,b),shape用args指定。
class LstmParam:
def __init__(self, mem_cell_ct, x_dim): self.mem_cell_ct = mem_cell_ct self.x_dim = x_dim concat_len = x_dim + mem_cell_ct # weight matrices self.wg = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len) self.wi = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len) self.wf = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len) self.wo = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len) # bias terms self.bg = rand_arr(-0.1, 0.1, mem_cell_ct) self.bi = rand_arr(-0.1, 0.1, mem_cell_ct) self.bf = rand_arr(-0.1, 0.1, mem_cell_ct) self.bo = rand_arr(-0.1, 0.1, mem_cell_ct) # diffs (derivative of loss function w.r.t. all parameters) self.wg_diff = np.zeros((mem_cell_ct, concat_len)) self.wi_diff = np.zeros((mem_cell_ct, concat_len)) self.wf_diff = np.zeros((mem_cell_ct, concat_len)) self.wo_diff = np.zeros((mem_cell_ct, concat_len)) self.bg_diff = np.zeros(mem_cell_ct) self.bi_diff = np.zeros(mem_cell_ct) self.bf_diff = np.zeros(mem_cell_ct) self.bo_diff = np.zeros(mem_cell_ct)
LstmParam類傳遞參數(shù),mem_cell_ct是lstm神經(jīng)元數(shù)目,x_dim是輸入數(shù)據(jù)維度,concat_len是mem_cell_ct與x_dim長度和,wg是輸入節(jié)點權(quán)重矩陣,wi是輸入門權(quán)重矩陣,wf是忘記門權(quán)重矩陣,wo是輸出門權(quán)重矩陣,bg、bi、bf、bo分別是輸入節(jié)點、輸入門、忘記門、輸出門偏置,wg_diff、wi_diff、wf_diff、wo_diff分別是輸入節(jié)點、輸入門、忘記門、輸出門權(quán)重損失,bg_diff、bi_diff、bf_diff、bo_diff分別是輸入節(jié)點、輸入門、忘記門、輸出門偏置損失,初始化按照矩陣維度初始化,損失矩陣歸零。
def apply_diff(self, lr = 1): self.wg -= lr * self.wg_diff self.wi -= lr * self.wi_diff self.wf -= lr * self.wf_diff self.wo -= lr * self.wo_diff self.bg -= lr * self.bg_diff self.bi -= lr * self.bi_diff self.bf -= lr * self.bf_diff self.bo -= lr * self.bo_diff # reset diffs to zero self.wg_diff = np.zeros_like(self.wg) self.wi_diff = np.zeros_like(self.wi) self.wf_diff = np.zeros_like(self.wf) self.wo_diff = np.zeros_like(self.wo) self.bg_diff = np.zeros_like(self.bg) self.bi_diff = np.zeros_like(self.bi) self.bf_diff = np.zeros_like(self.bf) self.bo_diff = np.zeros_like(self.bo)
定義權(quán)重更新過程,先減損失,再把損失矩陣歸零。
class LstmState:
def __init__(self, mem_cell_ct, x_dim): self.g = np.zeros(mem_cell_ct) self.i = np.zeros(mem_cell_ct) self.f = np.zeros(mem_cell_ct) self.o = np.zeros(mem_cell_ct) self.s = np.zeros(mem_cell_ct) self.h = np.zeros(mem_cell_ct) self.bottom_diff_h = np.zeros_like(self.h) self.bottom_diff_s = np.zeros_like(self.s) self.bottom_diff_x = np.zeros(x_dim)
LstmState存儲LSTM神經(jīng)元狀態(tài),包括g、i、f、o、s、h,s是內(nèi)部狀態(tài)矩陣(記憶),h是隱藏層神經(jīng)元輸出矩陣。
class LstmNode:
def __init__(self, lstm_param, lstm_state): # store reference to parameters and to activations self.state = lstm_state self.param = lstm_param # non-recurrent input to node self.x = None # non-recurrent input concatenated with recurrent input self.xc = None
LstmNode對應(yīng)樣本輸入,x是輸入樣本x,xc是用hstack把x和遞歸輸入節(jié)點拼接矩陣(hstack是橫拼矩陣,vstack是縱拼矩陣)。
def bottom_data_is(self, x, s_prev = None, h_prev = None): # if this is the first lstm node in the network if s_prev == None: s_prev = np.zeros_like(self.state.s) if h_prev == None: h_prev = np.zeros_like(self.state.h) # save data for use in backprop self.s_prev = s_prev self.h_prev = h_prev # concatenate x(t) and h(t-1) xc = np.hstack((x, h_prev)) self.state.g = np.tanh(np.dot(self.param.wg, xc) + self.param.bg) self.state.i = sigmoid(np.dot(self.param.wi, xc) + self.param.bi) self.state.f = sigmoid(np.dot(self.param.wf, xc) + self.param.bf) self.state.o = sigmoid(np.dot(self.param.wo, xc) + self.param.bo) self.state.s = self.state.g * self.state.i + s_prev * self.state.f self.state.h = self.state.s * self.state.o self.x = x self.xc = xc
bottom和top是兩個方向,輸入樣本從底部輸入,反向傳導(dǎo)從頂部向底部傳導(dǎo),bottom_data_is是輸入樣本過程,把x和先前輸入拼接成矩陣,用公式wx+b分別計算g、i、f、o值,激活函數(shù)tanh和sigmoid。
每個時序神經(jīng)網(wǎng)絡(luò)有四個神經(jīng)網(wǎng)絡(luò)層(激活函數(shù)),最左邊忘記門,直接生效到記憶C,第二個輸入門,依賴輸入樣本數(shù)據(jù),按照一定“比例”影響記憶C,“比例”通過第三個層(tanh)實現(xiàn),取值范圍是[-1,1]可以正向影響也可以負向影響,最后一個輸出門,每一時序產(chǎn)生輸出既依賴輸入樣本x和上一時序輸出,還依賴記憶C,設(shè)計模仿生物神經(jīng)元記憶功能。
def top_diff_is(self, top_diff_h, top_diff_s): # notice that top_diff_s is carried along the constant error carousel ds = self.state.o * top_diff_h + top_diff_s do = self.state.s * top_diff_h di = self.state.g * ds dg = self.state.i * ds df = self.s_prev * ds # diffs w.r.t. vector inside sigma / tanh function di_input = (1. - self.state.i) * self.state.i * di df_input = (1. - self.state.f) * self.state.f * df do_input = (1. - self.state.o) * self.state.o * do dg_input = (1. - self.state.g ** 2) * dg # diffs w.r.t. inputs self.param.wi_diff += np.outer(di_input, self.xc) self.param.wf_diff += np.outer(df_input, self.xc) self.param.wo_diff += np.outer(do_input, self.xc) self.param.wg_diff += np.outer(dg_input, self.xc) self.param.bi_diff += di_input self.param.bf_diff += df_input self.param.bo_diff += do_input self.param.bg_diff += dg_input # compute bottom diff dxc = np.zeros_like(self.xc) dxc += np.dot(self.param.wi.T, di_input) dxc += np.dot(self.param.wf.T, df_input) dxc += np.dot(self.param.wo.T, do_input) dxc += np.dot(self.param.wg.T, dg_input) # save bottom diffs self.state.bottom_diff_s = ds * self.state.f self.state.bottom_diff_x = dxc[:self.param.x_dim] self.state.bottom_diff_h = dxc[self.param.x_dim:]
反向傳導(dǎo),整個訓(xùn)練過程核心。假設(shè)在t時刻lstm輸出預(yù)測值h(t),實際輸出值是y(t),之間差別是損失,假設(shè)損失函數(shù)為l(t) = f(h(t), y(t)) = ||h(t) - y(t)||^2,歐式距離,整體損失函數(shù)是L(t) = ∑l(t),t從1到T,T表示整個事件序列最大長度。最終目標(biāo)是用梯度下降法讓L(t)最小化,找到一個最優(yōu)權(quán)重w使得L(t)最小,當(dāng)w發(fā)生微小變化L(t)不再變化,達到局部最優(yōu),即L對w偏導(dǎo)梯度為0。
dL/dw表示當(dāng)w發(fā)生單位變化L變化多少,dh(t)/dw表示當(dāng)w發(fā)生單位變化h(t)變化多少,dL/dh(t)表示當(dāng)h(t)發(fā)生單位變化時L變化多少,(dL/dh(t)) * (dh(t)/dw)表示第t時序第i個記憶單元w發(fā)生單位變化L變化多少,把所有由1到M的i和所有由1到T的t累加是整體dL/dw。
第i個記憶單元,h(t)發(fā)生單位變化,整個從1到T時序所有局部損失l的累加和,是dL/dh(t),h(t)只影響從t到T時序局部損失l。
假設(shè)L(t)表示從t到T損失和,L(t) = ∑l(s)。
h(t)對w導(dǎo)數(shù)。
L(t) = l(t) + L(t+1),dL(t)/dh(t) = dl(t)/dh(t) + dL(t+1)/dh(t),用下一時序?qū)?shù)得出當(dāng)前時序?qū)?shù),規(guī)律推導(dǎo),計算T時刻導(dǎo)數(shù)往前推,在T時刻,dL(T)/dh(T) = dl(T)/dh(T)。
class LstmNetwork():
def __init__(self, lstm_param): self.lstm_param = lstm_param self.lstm_node_list = [] # input sequence self.x_list = [] def y_list_is(self, y_list, loss_layer): """ Updates diffs by setting target sequence with corresponding loss layer. Will *NOT* update parameters. To update parameters, call self.lstm_param.apply_diff() """ assert len(y_list) == len(self.x_list) idx = len(self.x_list) - 1 # first node only gets diffs from label ... loss = loss_layer.loss(self.lstm_node_list[idx].state.h, y_list[idx]) diff_h = loss_layer.bottom_diff(self.lstm_node_list[idx].state.h, y_list[idx]) # here s is not affecting loss due to h(t+1), hence we set equal to zero diff_s = np.zeros(self.lstm_param.mem_cell_ct) self.lstm_node_list[idx].top_diff_is(diff_h, diff_s) idx -= 1 ### ... following nodes also get diffs from next nodes, hence we add diffs to diff_h ### we also propagate error along constant error carousel using diff_s while idx >= 0: loss += loss_layer.loss(self.lstm_node_list[idx].state.h, y_list[idx]) diff_h = loss_layer.bottom_diff(self.lstm_node_list[idx].state.h, y_list[idx]) diff_h += self.lstm_node_list[idx + 1].state.bottom_diff_h diff_s = self.lstm_node_list[idx + 1].state.bottom_diff_s self.lstm_node_list[idx].top_diff_is(diff_h, diff_s) idx -= 1 return loss
diff_h(預(yù)測結(jié)果誤差發(fā)生單位變化損失L多少,dL(t)/dh(t)數(shù)值計算),由idx從T往前遍歷到1,計算loss_layer.bottom_diff和下一個時序bottom_diff_h和作為diff_h(第一次遍歷即T不加bottom_diff_h)。
loss_layer.bottom_diff:
def bottom_diff(self, pred, label): diff = np.zeros_like(pred) diff[0] = 2 * (pred[0] - label) return diff
l(t) = f(h(t), y(t)) = ||h(t) - y(t)||^2導(dǎo)數(shù)l"(t) = 2 * (h(t) - y(t))
。當(dāng)s(t)發(fā)生變化,L(t)變化來源s(t)影響h(t)和h(t+1),影響L(t)。
h(t+1)不會影響l(t)。
左邊式子(dL(t)/dh(t)) * (dh(t)/ds(t)),由t+1到t來逐級反推dL(t)/ds(t)。
神經(jīng)元self.state.h = self.state.s self.state.o,h(t) = s(t) o(t),dh(t)/ds(t) = o(t),dL(t)/dh(t)是top_diff_h。
top_diff_is,Bottom means input to the layer, top means output of the layer. Caffe also uses this terminology. bottom表示神經(jīng)網(wǎng)絡(luò)層輸入,top表示神經(jīng)網(wǎng)絡(luò)層輸出,和caffe概念一致。
def top_diff_is(self, top_diff_h, top_diff_s):
top_diff_h表示當(dāng)前t時序dL(t)/dh(t), top_diff_s表示t+1時序記憶單元dL(t)/ds(t)。
ds = self.state.o * top_diff_h + top_diff_s do = self.state.s * top_diff_h di = self.state.g * ds dg = self.state.i * ds df = self.s_prev * ds
前綴d表達誤差L對某一項導(dǎo)數(shù)(directive)。
ds是在根據(jù)公式dL(t)/ds(t)計算當(dāng)前t時序dL(t)/ds(t)。
do是計算dL(t)/do(t),h(t) = s(t) o(t),dh(t)/do(t) = s(t),dL(t)/do(t) = (dL(t)/dh(t)) (dh(t)/do(t)) = top_diff_h * s(t)。
di是計算dL(t)/di(t)。s(t) = f(t) s(t-1) + i(t) g(t)。dL(t)/di(t) = (dL(t)/ds(t)) (ds(t)/di(t)) = ds g(t)。
dg是計算dL(t)/dg(t),dL(t)/dg(t) = (dL(t)/ds(t)) (ds(t)/dg(t)) = ds i(t)。
df是計算dL(t)/df(t),dL(t)/df(t) = (dL(t)/ds(t)) (ds(t)/df(t)) = ds s(t-1)。
di_input = (1. - self.state.i) * self.state.i * di df_input = (1. - self.state.f) * self.state.f * df do_input = (1. - self.state.o) * self.state.o * do dg_input = (1. - self.state.g ** 2) * dg
sigmoid函數(shù)導(dǎo)數(shù),tanh函數(shù)導(dǎo)數(shù)。di_input,(1. - self.state.i) * self.state.i,sigmoid導(dǎo)數(shù),當(dāng)i神經(jīng)元輸入發(fā)生單位變化時輸出值有多大變化,再乘di表示當(dāng)i神經(jīng)元輸入發(fā)生單位變化時誤差L(t)發(fā)生多大變化,dL(t)/d i_input(t)。
self.param.wi_diff += np.outer(di_input, self.xc) self.param.wf_diff += np.outer(df_input, self.xc) self.param.wo_diff += np.outer(do_input, self.xc) self.param.wg_diff += np.outer(dg_input, self.xc) self.param.bi_diff += di_input self.param.bf_diff += df_input self.param.bo_diff += do_input self.param.bg_diff += dg_input
w_diff是權(quán)重矩陣誤差,b_diff是偏置誤差,用于更新。
dxc = np.zeros_like(self.xc) dxc += np.dot(self.param.wi.T, di_input) dxc += np.dot(self.param.wf.T, df_input) dxc += np.dot(self.param.wo.T, do_input) dxc += np.dot(self.param.wg.T, dg_input)
累加輸入xdiff,x在四處起作用,四處diff加和后作xdiff。
self.state.bottom_diff_s = ds * self.state.f self.state.bottom_diff_x = dxc[:self.param.x_dim] self.state.bottom_diff_h = dxc[self.param.x_dim:]
bottom_diff_s是在t-1時序上s變化和t時序上s變化時f倍關(guān)系。dxc是x和h橫向合并矩陣,分別取兩部分diff信息bottom_diff_x和bottom_diff_h。
def x_list_clear(self):
self.x_list = [] def x_list_add(self, x): self.x_list.append(x) if len(self.x_list) > len(self.lstm_node_list): # need to add new lstm node, create new state mem lstm_state = LstmState(self.lstm_param.mem_cell_ct, self.lstm_param.x_dim) self.lstm_node_list.append(LstmNode(self.lstm_param, lstm_state)) # get index of most recent x input idx = len(self.x_list) - 1 if idx == 0: # no recurrent inputs yet self.lstm_node_list[idx].bottom_data_is(x) else: s_prev = self.lstm_node_list[idx - 1].state.s h_prev = self.lstm_node_list[idx - 1].state.h self.lstm_node_list[idx].bottom_data_is(x, s_prev, h_prev)
添加訓(xùn)練樣本,輸入x數(shù)據(jù)。
def example_0():
# learns to repeat simple sequence from random inputs np.random.seed(0) # parameters for input data dimension and lstm cell count mem_cell_ct = 100 x_dim = 50 concat_len = x_dim + mem_cell_ct lstm_param = LstmParam(mem_cell_ct, x_dim) lstm_net = LstmNetwork(lstm_param) y_list = [-0.5,0.2,0.1, -0.5] input_val_arr = [np.random.random(x_dim) for _ in y_list] for cur_iter in range(100): print "cur iter: ", cur_iter for ind in range(len(y_list)): lstm_net.x_list_add(input_val_arr[ind]) print "y_pred[%d] : %f" % (ind, lstm_net.lstm_node_list[ind].state.h[0]) loss = lstm_net.y_list_is(y_list, ToyLossLayer) print "loss: ", loss lstm_param.apply_diff(lr=0.1) lstm_net.x_list_clear()
初始化LstmParam,指定記憶存儲單元數(shù)為100,指定輸入樣本x維度是50。初始化LstmNetwork訓(xùn)練模型,生成4組各50個隨機數(shù),分別以[-0.5,0.2,0.1, -0.5]作為y值訓(xùn)練,每次喂50個隨機數(shù)和一個y值,迭代100次。
lstm輸入一串連續(xù)質(zhì)數(shù)預(yù)估下一個質(zhì)數(shù)。小測試,生成100以內(nèi)質(zhì)數(shù),循環(huán)拿出50個質(zhì)數(shù)序列作x,第51個質(zhì)數(shù)作y,拿出10個樣本參與訓(xùn)練1w次,均方誤差由0.17973最終達到了1.05172e-06,幾乎完全正確:
import numpy as np
import sys
from lstm import LstmParam, LstmNetwork
class ToyLossLayer:
""" Computes square loss with first element of hidden layer array. """ @classmethod def loss(self, pred, label): return (pred[0] - label) ** 2 @classmethod def bottom_diff(self, pred, label): diff = np.zeros_like(pred) diff[0] = 2 * (pred[0] - label) return diff
class Primes:
def __init__(self): self.primes = list() for i in range(2, 100): is_prime = True for j in range(2, i-1): if i % j == 0: is_prime = False if is_prime: self.primes.append(i) self.primes_count = len(self.primes) def get_sample(self, x_dim, y_dim, index): result = np.zeros((x_dim+y_dim)) for i in range(index, index + x_dim + y_dim): result[i-index] = self.primes[i%self.primes_count]/100.0 return result
def example_0():
mem_cell_ct = 100 x_dim = 50 concat_len = x_dim + mem_cell_ct lstm_param = LstmParam(mem_cell_ct, x_dim) lstm_net = LstmNetwork(lstm_param) primes = Primes() x_list = [] y_list = [] for i in range(0, 10): sample = primes.get_sample(x_dim, 1, i) x = sample[0:x_dim] y = sample[x_dim:x_dim+1].tolist()[0] x_list.append(x) y_list.append(y) for cur_iter in range(10000): if cur_iter % 1000 == 0: print "y_list=", y_list for ind in range(len(y_list)): lstm_net.x_list_add(x_list[ind]) if cur_iter % 1000 == 0: print "y_pred[%d] : %f" % (ind, lstm_net.lstm_node_list[ind].state.h[0]) loss = lstm_net.y_list_is(y_list, ToyLossLayer) if cur_iter % 1000 == 0: print "loss: ", loss lstm_param.apply_diff(lr=0.01) lstm_net.x_list_clear()
if name == "__main__":
example_0()
質(zhì)數(shù)列表全都除以100,這個代碼訓(xùn)練數(shù)據(jù)必須是小于1數(shù)值。
torch是深度學(xué)習(xí)框架。1)tensorflow,谷歌主推,時下最火,小型試驗和大型計算都可以,基于python,缺點是上手相對較難,速度一般;2)torch,facebook主推,用于小型試驗,開源應(yīng)用較多,基于lua,上手較快,網(wǎng)上文檔較全,缺點是lua語言相對冷門;3)mxnet,Amazon主推,主要用于大型計算,基于python和R,缺點是網(wǎng)上開源項目較少;4)caffe,facebook主推,用于大型計算,基于c++、python,缺點是開發(fā)不是很方便;5)theano,速度一般,基于python,評價很好。
torch github上lstm實現(xiàn)項目比較多。
在mac上安裝torch。https://github.com/torch/torc... 。
git clone https://github.com/torch/dist... ~/torch --recursive
cd ~/torch; bash install-deps;
./install.sh
qt安裝不成功問題,自己多帶帶安裝。
brew install cartr/qt4/qt
安裝后需要手工加到~/.bash_profile中。
. ~/torch/install/bin/torch-activate
source ~/.bash_profile后執(zhí)行th使用torch。
安裝itorch,安裝依賴
brew install zeromq
brew install openssl
luarocks install luacrypto OPENSSL_DIR=/usr/local/opt/openssl/
git clone https://github.com/facebook/i...
cd iTorch
luarocks make
用卷積神經(jīng)網(wǎng)絡(luò)實現(xiàn)圖像識別。
創(chuàng)建pattern_recognition.lua:
require "nn"
require "paths"
if (not paths.filep("cifar10torchsmall.zip")) then
os.execute("wget -c https://s3.amazonaws.com/torch7/data/cifar10torchsmall.zip") os.execute("unzip cifar10torchsmall.zip")
end
trainset = torch.load("cifar10-train.t7")
testset = torch.load("cifar10-test.t7")
classes = {"airplane", "automobile", "bird", "cat",
"deer", "dog", "frog", "horse", "ship", "truck"}
setmetatable(trainset,
{__index = function(t, i)
return {t.data[i], t.label[i]}
end}
);
trainset.data = trainset.data:double() -- convert the data from a ByteTensor to a DoubleTensor.
function trainset:size()
return self.data:size(1)
end
mean = {} -- store the mean, to normalize the test set in the future
stdv = {} -- store the standard-deviation for the future
for i=1,3 do -- over each image channel
mean[i] = trainset.data[{ {}, {i}, {}, {} }]:mean() -- mean estimation print("Channel " .. i .. ", Mean: " .. mean[i]) trainset.data[{ {}, {i}, {}, {} }]:add(-mean[i]) -- mean subtraction stdv[i] = trainset.data[{ {}, {i}, {}, {} }]:std() -- std estimation print("Channel " .. i .. ", Standard Deviation: " .. stdv[i]) trainset.data[{ {}, {i}, {}, {} }]:div(stdv[i]) -- std scaling
end
net = nn.Sequential()
net:add(nn.SpatialConvolution(3, 6, 5, 5)) -- 3 input image channels, 6 output channels, 5x5 convolution kernel
net:add(nn.ReLU()) -- non-linearity
net:add(nn.SpatialMaxPooling(2,2,2,2)) -- A max-pooling operation that looks at 2x2 windows and finds the max.
net:add(nn.SpatialConvolution(6, 16, 5, 5))
net:add(nn.ReLU()) -- non-linearity
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.View(1655)) -- reshapes from a 3D tensor of 16x5x5 into 1D tensor of 1655
net:add(nn.Linear(1655, 120)) -- fully connected layer (matrix multiplication between input and weights)
net:add(nn.ReLU()) -- non-linearity
net:add(nn.Linear(120, 84))
net:add(nn.ReLU()) -- non-linearity
net:add(nn.Linear(84, 10)) -- 10 is the number of outputs of the network (in this case, 10 digits)
net:add(nn.LogSoftMax()) -- converts the output to a log-probability. Useful for classification problems
criterion = nn.ClassNLLCriterion()
trainer = nn.StochasticGradient(net, criterion)
trainer.learningRate = 0.001
trainer.maxIteration = 5
trainer:train(trainset)
testset.data = testset.data:double() -- convert from Byte tensor to Double tensor
for i=1,3 do -- over each image channel
testset.data[{ {}, {i}, {}, {} }]:add(-mean[i]) -- mean subtraction testset.data[{ {}, {i}, {}, {} }]:div(stdv[i]) -- std scaling
end
predicted = net:forward(testset.data[100])
print(classes[testset.label[100]])
print(predicted:exp())
for i=1,predicted:size(1) do
print(classes[i], predicted[i])
end
correct = 0
for i=1,10000 do
local groundtruth = testset.label[i] local prediction = net:forward(testset.data[i]) local confidences, indices = torch.sort(prediction, true) -- true means sort in descending order if groundtruth == indices[1] then correct = correct + 1 end
end
print(correct, 100*correct/10000 .. " % ")
class_performance = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
for i=1,10000 do
local groundtruth = testset.label[i] local prediction = net:forward(testset.data[i]) local confidences, indices = torch.sort(prediction, true) -- true means sort in descending order if groundtruth == indices[1] then class_performance[groundtruth] = class_performance[groundtruth] + 1 end
end
for i=1,#classes do
print(classes[i], 100*class_performance[i]/1000 .. " %")
end
執(zhí)行th pattern_recognition.lua。
首先下載cifar10torchsmall.zip樣本,有50000張訓(xùn)練用圖片,10000張測試用圖片,分別都標(biāo)注,包括airplane、automobile等10種分類,對trainset綁定__index和size方法,兼容nn.Sequential使用,綁定函數(shù)看lua教程:http://tylerneylon.com/a/lear... ,trainset數(shù)據(jù)正規(guī)化,數(shù)據(jù)轉(zhuǎn)成均值為1方差為1的double類型張量。初始化卷積神經(jīng)網(wǎng)絡(luò)模型,包括兩層卷積、兩層池化、一個全連接以及一個softmax層,進行訓(xùn)練,學(xué)習(xí)率為0.001,迭代5次,模型訓(xùn)練好后對測試機第100號圖片做預(yù)測,打印出整體正確率以及每種分類準(zhǔn)確率。https://github.com/soumith/cv... 。
torch可以方便支持gpu計算,需要對代碼做修改。
比較流行的seq2seq基本都用lstm組成編碼器解碼器模型實現(xiàn),開源實現(xiàn)大都基于one-hot embedding(沒有詞向量表達信息量大)。word2vec詞向量 seq2seq模型,只有一個lstm單元機器人。
下載《甄環(huán)傳》小說原文。上網(wǎng)隨便百度“甄環(huán)傳 txt”,下載下來,把文件轉(zhuǎn)碼成utf-8編碼,把windows回車符都替換成n,以便后續(xù)處理。
對甄環(huán)傳切詞。切詞工具word_segment.py到github下載,地址在https://github.com/warmheartl... 。
python ./word_segment.py zhenhuanzhuan.txt zhenhuanzhuan.segment
生成詞向量。用word2vec,word2vec源碼 https://github.com/warmheartl... 。make編譯即可執(zhí)行。
./word2vec -train ./zhenhuanzhuan.segment -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
生成一個vectors.bin文件,基于甄環(huán)傳原文生成的詞向量文件。
訓(xùn)練代碼。
-- coding: utf-8 --import sys
import math
import tflearn
import chardet
import numpy as np
import struct
seq = []
max_w = 50
float_size = 4
word_vector_dict = {}
def load_vectors(input):
"""從vectors.bin加載詞向量,返回一個word_vector_dict的詞典,key是詞,value是200維的向量 """ print "begin load vectors" input_file = open(input, "rb") # 獲取詞表數(shù)目及向量維度 words_and_size = input_file.readline() words_and_size = words_and_size.strip() words = long(words_and_size.split(" ")[0]) size = long(words_and_size.split(" ")[1]) print "words =", words print "size =", size for b in range(0, words): a = 0 word = "" # 讀取一個詞 while True: c = input_file.read(1) word = word + c if False == c or c == " ": break if a < max_w and c != "n": a = a + 1 word = word.strip() vector = [] for index in range(0, size): m = input_file.read(float_size) (weight,) = struct.unpack("f", m) vector.append(weight) # 將詞及其對應(yīng)的向量存到dict中 word_vector_dict[word.decode("utf-8")] = vector input_file.close() print "load vectors finish"
def init_seq():
"""讀取切好詞的文本文件,加載全部詞序列 """ file_object = open("zhenhuanzhuan.segment", "r") vocab_dict = {} while True: line = file_object.readline() if line: for word in line.decode("utf-8").split(" "): if word_vector_dict.has_key(word): seq.append(word_vector_dict[word]) else: break file_object.close()
def vector_sqrtlen(vector):
len = 0 for item in vector: len += item * item len = math.sqrt(len) return len
def vector_cosine(v1, v2):
if len(v1) != len(v2): sys.exit(1) sqrtlen1 = vector_sqrtlen(v1) sqrtlen2 = vector_sqrtlen(v2) value = 0 for item1, item2 in zip(v1, v2): value += item1 * item2 return value / (sqrtlen1*sqrtlen2)
def vector2word(vector):
max_cos = -10000 match_word = "" for word in word_vector_dict: v = word_vector_dict[word] cosine = vector_cosine(vector, v) if cosine > max_cos: max_cos = cosine match_word = word return (match_word, max_cos)
def main():
load_vectors("./vectors.bin") init_seq() xlist = [] ylist = [] test_X = None #for i in range(len(seq)-100): for i in range(10): sequence = seq[i:i+20] xlist.append(sequence) ylist.append(seq[i+20]) if test_X is None: test_X = np.array(sequence) (match_word, max_cos) = vector2word(seq[i+20]) print "right answer=", match_word, max_cos X = np.array(xlist) Y = np.array(ylist) net = tflearn.input_data([None, 20, 200]) net = tflearn.lstm(net, 200) net = tflearn.fully_connected(net, 200, activation="linear") net = tflearn.regression(net, optimizer="sgd", learning_rate=0.1, loss="mean_square") model = tflearn.DNN(net) model.fit(X, Y, n_epoch=500, batch_size=10,snapshot_epoch=False,show_metric=True) model.save("model") predict = model.predict([test_X]) #print predict #for v in test_X: # print vector2word(v) (match_word, max_cos) = vector2word(predict[0]) print "predict=", match_word, max_cos
main()
load_vectors從vectors.bin加載詞向量,init_seq加載甄環(huán)傳切詞文本并存到一個序列里,vector2word求距離某向量最近詞,模型只有一個lstm單元。
經(jīng)過500個epoch訓(xùn)練,均方損失降到0.33673,以0.941794432002余弦相似度預(yù)測出下一個字。
強大gpu,調(diào)整參數(shù),整篇文章都訓(xùn)練,修改代碼predict部分,不斷輸出下一個字,自動吐出甄環(huán)體。基于tflearn實現(xiàn),tflearn官方文檔examples實現(xiàn)seq2seq直接調(diào)用tensorflow中的tensorflow/python/ops/seq2seq.py,基于one-hot embedding方法,一定沒有詞向量效果好。
詳情請閱讀原文
文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請注明本文地址:http://systransis.cn/yun/41675.html
摘要:經(jīng)過第一步的處理已經(jīng)把古詩詞詞語轉(zhuǎn)換為可以機器學(xué)習(xí)建模的數(shù)字形式,因為我們采用算法進行古詩詞生成,所以還需要構(gòu)建輸入到輸出的映射處理。 LSTM 介紹 序列化數(shù)據(jù)即每個樣本和它之前的樣本存在關(guān)聯(lián),前一數(shù)據(jù)和后一個數(shù)據(jù)有順序關(guān)系。深度學(xué)習(xí)中有一個重要的分支是專門用來處理這樣的數(shù)據(jù)的——循環(huán)神經(jīng)網(wǎng)絡(luò)。循環(huán)神經(jīng)網(wǎng)絡(luò)廣泛應(yīng)用在自然語言處理領(lǐng)域(NLP),今天我們帶你從一個實際的例子出發(fā),介紹循...
摘要:深度學(xué)習(xí)推動領(lǐng)域發(fā)展的新引擎圖擁有記憶能力最早是提出用來解決圖像識別的問題的一種深度神經(jīng)網(wǎng)絡(luò)。深度學(xué)習(xí)推動領(lǐng)域發(fā)展的新引擎圖深度神經(jīng)網(wǎng)絡(luò)最近相關(guān)的改進模型也被用于領(lǐng)域。 從2015年ACL會議的論文可以看出,目前NLP最流行的方法還是機器學(xué)習(xí)尤其是深度學(xué)習(xí),所以本文會從深度神經(jīng)網(wǎng)絡(luò)的角度分析目前NLP研究的熱點和未來的發(fā)展方向。我們主要關(guān)注Word Embedding、RNN/LSTM/CN...
閱讀 1238·2021-11-11 16:54
閱讀 887·2021-10-19 11:44
閱讀 1353·2021-09-22 15:18
閱讀 2456·2019-08-29 16:26
閱讀 2961·2019-08-29 13:57
閱讀 3106·2019-08-26 13:32
閱讀 1091·2019-08-26 11:58
閱讀 2340·2019-08-26 10:37