Machine Learning on KK's Blog (fromkk)

Near-duplicate with SimHash

bebound@gmail.com (KK) — Wed, 04 Dec 2019 00:16:00 +0800

Before talking about SimHash, let’s review some other methods which can also identify duplication.

Longest Common Subsequence(LCS)

This is the algorithm used by diff command. It is also edit distance with insertion and deletion as the only two edit operations.

This works good for short strings. However, the algorithm’s time complexity is $O(m*n)$, if two strings’ lengths are $m$ and $n$ respectively. So it’s not suitable for large corpus. Also, if two corpus consists of same paragraph but the order is not same. LCS treat them as different corpus, and that’s not we expected.

The Annotated The Annotated Transformer

bebound@gmail.com (KK) — Sun, 01 Sep 2019 16:00:00 +0800

Thanks for the articles I list at the end of this post, I understand how transformers works. These posts are comprehensive, but there are some points that confused me.

First, this is the graph that was referenced by almost all of the post related to Transformer.

Transformer consists of these parts: Input, Encoder*N, Output Input, Decoder*N, Output. I’ll explain them step by step.

Input

The input word will map to 512 dimension vector. Then generate Positional Encoding(PE) and add it to the original embeddings.

Different types of Attention

bebound@gmail.com (KK) — Mon, 15 Jul 2019 00:16:00 +0800

$s_t$ and $h_i$ are source hidden states and target hidden state, the shape is (n,1). $c_t$ is the final context vector, and $\alpha_{t,s}$ is alignment score.

\[\begin{aligned} c_t&=\sum_{i=1}^n \alpha_{t,s}h_i \\ \alpha_{t,s}&= \frac{\exp(score(s_t,h_i))}{\sum_{i=1}^n \exp(score(s_t,h_i))} \end{aligned}\]

Global(Soft) VS Local(Hard)

Global Attention takes all source hidden states into account, and local attention only use part of the source hidden states.

Content-based VS Location-based

Content-based Attention uses both source hidden states and target hidden states, but location-based attention only use source hidden states.

Using Dueling DQN to Play Flappy Bird

bebound@gmail.com (KK) — Sun, 14 Apr 2019 17:10:00 +0800

PyTorch provide a simple DQN implementation to solve the cartpole game. However, the code is incorrect, it diverges after training (It has been discussed here).

The official code’s training data is below, it’s high score is about 50 and finally diverges.

There are many reason that lead to divergence.

First it use the difference of two frame as input in the tutorial, not only it loss the cart’s absolute information(This information is useful, as game will terminate if cart moves too far from centre), but also confused the agent when difference is the same but the state is varied.

TextCNN with PyTorch and Torchtext on Colab

bebound@gmail.com (KK) — Mon, 03 Dec 2018 15:47:00 +0800

PyTorch is a really powerful framework to build the machine learning models. Although some features is missing when compared with TensorFlow (For example, the early stop function, History to draw plot), its code style is more intuitive.

Torchtext is a NLP package which is also made by pytorch team. It provide a way to read text, processing and iterate the texts.

Google Colab is a Jupyter notebook environment host by Google, you can use free GPU and TPU to run your modal.

LSTM and GRU

bebound@gmail.com (KK) — Sun, 22 Apr 2018 14:39:00 +0800

LSTM

The avoid the problem of vanishing gradient and exploding gradient in vanilla RNN, LSTM was published, which can remember information for longer periods of time.

Here is the structure of LSTM:

The calculate procedure are:

\[\begin{aligned} f_t&=\sigma(W_f\cdot[h_{t-1},x_t]+b_f)\\ i_t&=\sigma(W_i\cdot[h_{t-1},x_t]+b_i)\\ o_t&=\sigma(W_o\cdot[h_{t-1},x_t]+b_o)\\ \tilde{C_t}&=tanh(W_C\cdot[h_{t-1},x_t]+b_C)\\ C_t&=f_t\ast C_{t-1}+i_t\ast \tilde{C_t}\\ h_t&=o_t \ast tanh(C_t) \end{aligned}\]

$f_t$,$i_t$,$o_t$ are forget gate, input gate and output gate respectively. $\tilde{C_t}$ is the new memory content. $C_t$ is cell state. $h_t$ is the output.

Models and Architectures in Word2vec

bebound@gmail.com (KK) — Fri, 05 Jan 2018 15:14:00 +0800

Generally, word2vec is a language model to predict the words probability based on the context. When build the model, it create word embedding for each word, and word embedding is widely used in many NLP tasks.

Models

CBOW (Continuous Bag of Words)

Use the context to predict the probability of current word. (In the picture, the word is encoded with one-hot encoding, $W_{V*N}$ is word embedding, and $W_{V*N}^{’}$, the output weight matrix in hidden layer, is same as $\hat{\upsilon}$ in following equations)

Semi-supervised text classification using doc2vec and label spreading

bebound@gmail.com (KK) — Sun, 10 Sep 2017 15:29:00 +0800

Here is a simple way to classify text without much human effort and get a impressive performance.

It can be divided into two steps:

Get train data by using keyword classification
Generate a more accurate classification model by using doc2vec and label spreading

Keyword-based Classification

Keyword based classification is a simple but effective method. Extracting the target keyword is a monotonous work. I use this method to automatic extract keyword candidate.

Parameters in doc2vec

bebound@gmail.com (KK) — Thu, 03 Aug 2017 15:20:00 +0800

Here are some parameter in gensim’s doc2vec class.

window

window is the maximum distance between the predicted word and context words used for prediction within a document. It will look behind and ahead.

In skip-gram model, if the window size is 2, the training samples will be this:(the blue word is the input word)

min_count

If the word appears less than this value, it will be skipped

sample

High frequency word like the is useless for training. sample is a threshold for deleting these higher-frequency words. The probability of keeping the word $w_i$ is:

Brief Introduction of Label Propagation Algorithm

bebound@gmail.com (KK) — Sun, 16 Jul 2017 21:45:00 +0800

As I said before, I’m working on a text classification project. I use doc2vec to convert text into vectors, then I use LPA to classify the vectors.

LPA is a simple, effective semi-supervised algorithm. It can use the density of unlabeled data to find a hyperplane to split the data.

Here are the main stop of the algorithm:

Let $ (x_1,y1)…(x_l,y_l)$ be labeled data, $Y_L = \{y_1…y_l\} $ are the class labels. Let $(x_{l+1},y_{l+u})$ be unlabeled data where $Y_U = \{y_{l+1}…y_{l+u}\}$ are unobserved, usually $l \ll u$. Let $X=\{x_1…x_{l+u}\}$ where $x_i\in R^D$. The problem is to estimate $Y_U$ for $X$ and $Y_L$.
Calculate the similarity of the data points. The most simple metric is Euclidean distance. Use a parameter $\sigma$ to control the weights.

\[w_{ij}= exp(-\frac{d^2_{ij}}{\sigma^2})=exp(-\frac{\sum^D_{d=1}{(x^d_i-x^d_j})^2}{\sigma^2})\]