/images/avatar.png

Deploy Nikola Org Mode on Travis

Recently, I enjoy using Spacemacs, so I decided to switch to org file from Markdown for writing blog. After several attempts, I managed to let Travis convert org file to HTML. Here are the steps.

Install Org Mode plugin

First you need to install Org Mode plugin on your computer following the official guide: Nikola orgmode plugin.

Edit conf.el

Org Mode will convert to HTML to display on Nikola. Org Mode plugin will call Emacs to do this job. When I run nikola build, it shows this message: Please install htmlize from https://github.com/hniksic/emacs-htmlize. I’m using Spacemacs, the htmlize package is already downloaded if the org layer is enabled. I just need to add htmlize folder to load-path. So here is the code:

Using Chinese Characters in Matplotlib

After searching from Google, here is easiest solution. This should also works on other languages:

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.font_manager as fm
f = "/System/Library/Fonts/PingFang.ttc"
prop = fm.FontProperties(fname=f)

plt.title("你好",fontproperties=prop)
plt.show()

Output:

LSTM and GRU

LSTM

The avoid the problem of vanishing gradient and exploding gradient in vanilla RNN, LSTM was published, which can remember information for longer periods of time.

Here is the structure of LSTM:

The calculate procedure are:

\[\begin{aligned} f_t&=\sigma(W_f\cdot[h_{t-1},x_t]+b_f)\\ i_t&=\sigma(W_i\cdot[h_{t-1},x_t]+b_i)\\ o_t&=\sigma(W_o\cdot[h_{t-1},x_t]+b_o)\\ \tilde{C_t}&=tanh(W_C\cdot[h_{t-1},x_t]+b_C)\\ C_t&=f_t\ast C_{t-1}+i_t\ast \tilde{C_t}\\ h_t&=o_t \ast tanh(C_t) \end{aligned}\]

\(f_t\),\(i_t\),\(o_t\) are forget gate, input gate and output gate respectively. \(\tilde{C_t}\) is the new memory content. \(C_t\) is cell state. \(h_t\) is the output.

Models and Architectures in Word2vec

Generally, word2vec is a language model to predict the words probability based on the context. When build the model, it create word embedding for each word, and word embedding is widely used in many NLP tasks.

Models

CBOW (Continuous Bag of Words)

Use the context to predict the probability of current word. (In the picture, the word is encoded with one-hot encoding, \(W_{V*N}\) is word embedding, and \(W_{V*N}^{’}\), the output weight matrix in hidden layer, is same as \(\hat{\upsilon}\) in following equations)

Semi-supervised text classification using doc2vec and label spreading

Here is a simple way to classify text without much human effort and get a impressive performance.

It can be divided into two steps:

  1. Get train data by using keyword classification
  2. Generate a more accurate classification model by using doc2vec and label spreading

Keyword-based Classification

Keyword based classification is a simple but effective method. Extracting the target keyword is a monotonous work. I use this method to automatic extract keyword candidate.

Parameters in doc2vec

Here are some parameter in gensim’s doc2vec class.

window

window is the maximum distance between the predicted word and context words used for prediction within a document. It will look behind and ahead.

In skip-gram model, if the window size is 2, the training samples will be this:(the blue word is the input word)

min_count

If the word appears less than this value, it will be skipped

sample

High frequency word like the is useless for training. sample is a threshold for deleting these higher-frequency words. The probability of keeping the word \(w_i\) is: