AI課程Chatbot作業

7月 06, 2018

AI課程Chatbot作業

上了八堂的AI課後，
我終於從神奇寶貝訓師變成了AI訓練師（當然是開玩笑的XD）

這次Chatbot是使用https://github.com/Ethan0423/Chatbot-Stanford-TF-Imp/tree/master/assignments/chatbot

直接拿裡面的模型跑，也可上網找他們的ppt介紹，但覺得可惜是這個Chatbot沒有paper，因而到底何時才算訓練結束？或是訓練如何改善、優化？都沒有。和老師討論過後，就以loss function為1.5算訓練完成。

雖然是直接拿別人的Chatbot來跑或改寫，但光是要看懂整個流程就滿吃力，因為八堂的課中關於影像辨識的內容佔85%，而關於NLP的內容只有15%。很多都只能自己狂google尋找解答。老實說，不知道是code的問題，還是jupyter問題，有時會一直出現這個bug：'module' object has no attribute xxx => 發現應該是.pyc未更新，再clear kernel後rerun就好。

做完這個作業後，回想一次老師說的：當一個AI工程師是不用寫code的。
我想這意思是基本的AI model的code的架構都滿固定，只是在實作時，一定會因工程上的考量而加一些東西。就像這次作業是使用seq2seq模型外，為了做bucketing而加了一層decoder mask層。

發現如何整理出training dataset和testing dataset真的是門大學問。我看了很久，只能說大概知道他們做了什麼，但卻不知道為何如此做？

7/6要交作業，在此想先在交之前先搶先公布在這：
以下為demo ppt連結：
https://docs.google.com/presentation/d/11_pc1a9qfvfNMOG8ZHjwFwhFj4ObEJzr_utHPypn_NE/edit#slide=id.g3cd62c6dfc_0_51

以下是整個流程的大概解釋（其實我寫在chatbot.ipynb，方便知道哪個步驟，但這裡就只列出流程）：

"""
step 1：
a. 如果使用argparse可用Python寫出看起來很專業的命令列指令。
但因jupyter好像沒有，所以這邊用刻死的。
b. call data.py進行：prepare raw data into train set & test set。
主要是載入movie lines和conversations的資料。
之後儲存整理好的train set & test set到自動產生的processed directory。
c. 開始prepare data to be model-ready：將儲存好的datasets呼叫basic_tokenizer(line)，也就是tokenize text into tokens。
之後呼叫build_vocab()：寫檔成vocab.dec和vocab.enc，其中會分為：<pad>、<unk>、<s>、<\s>。也會到config.py寫入ENC_VOCAB和DEC_VOCAB的數目。
最後呼叫token2id()：convert all tokens in data into their corresponding index in the vocabulary。會將train.enc、train.dec和
test.enc、test.dec與vocab.enc、vocab.dec一起處理，最後輸出：train_ids.enc、train_ids.dec和test_ids.enc、test_ids.dec。
"""

"""
step 2：
a. 如果都沒有checkpoints directory則會自動產生。（後來訓練中發現雖然資料夾中都沒存檔，但其實都有存進去）
b. 預設走train mode（如果是使用命令列的話）。
c. train bot：
- 呼叫_get_buckets()：load dataset into buckets based on their lengths。接著會choose a random bucket。
- train mode會create the backward path。
- 到model.py去instantiate一個ChatBotModel：
1. 呼叫model.build_graph()
2. 呼叫_create_placeholders()：feeds for inputs. It's a list of placeholders。
分為：encoder_inputs、decoder_inputs、decoder_masks（非seq2seq模型有的，是額外為了使用bucket手法做的）、targets。
3. 呼叫_inference()：if use sampled softmax, need an output projection.（sampled softmax only makes sense
if we sample less than vocabulary size.）簡言之，呼叫tf.nn.sampled_softmax_loss(...)是因配合buckets使用。
也就是：當input都已不用一筆筆而是批次輸入時，output應當配合不用一筆筆而是批次輸出。
4. 呼叫_create_loss：
呼叫tf.contrib.legacy_seq2seq.model_with_buckets(...) => create a sequence-to-sequence model with support
for bucketing。
上面函式也含有呼叫tf.contrib.legacy_seq2seq.embedding_attention_seq2seq(...) => embedding sequence-to-sequence
model with attention。
5. 呼叫_create_optimizer()：create optimizer, might take a couple of minutes depending on how many buckets you have。
其中有呼叫tf.clip_by_global_norm(tf.gradients(...))：Gradient Clipping的引入是為了處理gradient explosion或gradients
vanishing的問題。當在一次iteration中權重的更新過於迅猛，很容易導致loss divergence。Gradient Clipping的直觀作用為讓權重的更新限制
在一個合適的範圍。
- 回到chatbot.ipynb，開始running session：
1. 呼叫_get_random_bucket(...)：to choose a training sample.（random取得的data較不會有離峰現象，較能取到常態分佈內的data）。
2. 呼叫data.get_batch(...)：return one batch to feed into model。
3. 呼叫run_step(...)：分為在train mode或chat mode。run one step in training。
4. 呼叫_assert_lengths(...)：assert that the encoder input, decoder inputs & decoder masks are of the expected lengths。
5. input_feed：encoder inputs, decoder inputs, target_weights, as provided。
6. output_feed：depends on doing a backward step or no。

-- 訓練完成 --
"""

"""
-- 如果訓練完成要開始和chatbot聊天 --

a. in test mode, don't create the backward path。
b. chat流程類似train流程：
1. 呼叫_get_user_input()：transform into encoder input later。
2. 呼叫data.sentence2id(...)
3. 呼叫_find_right_bucket(...)：for an encoder input based on their length。
4. 呼叫_get_batch()
5. 呼叫run_step(...)
6. 呼叫_construct_response(...)：
- output_logits（最後一層輸出）：the outputs from sequence to sequence wrapper。
- This is a greedy decoder - outputs are just argmaxes of output_logits.

-- 聊天完成 --
"""

後來發現我的猜測是不對的，從結果可看出iteration 3000次並不能訓練好chatbot。
話說和chatbot對話過程真的很容易歪掉不正經，chatbot連「懷孕」都講出來了。可能資料來源是電影對話或台詞，使得 chatbot對話充滿戲劇感(dramatic) XD

這次訓練作業只能說算小小成功，因為很多對話chatbot都避而不答，雖然當初好像有設計不回答私人問題的機制，但或許部分問題來自訓練關係？

只能說，從6月初有空時複習AI，然後寫作業，這整個過程真的滿煎熬的。

搜尋此網誌

關於程式的那些事

AI課程Chatbot作業

留言

張貼留言

熱門文章

unicode編碼[\u4e00-\u9fa5]指中文範圍

Server(伺服器) & Database Server(資料庫伺服器)