<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Doc2vec on KK's Blog (fromkk)</title><link>https://fromkk.com/tags/doc2vec/</link><description>Recent content in Doc2vec on KK's Blog (fromkk)</description><generator>Hugo</generator><language>en</language><managingEditor>bebound@gmail.com (KK)</managingEditor><webMaster>bebound@gmail.com (KK)</webMaster><lastBuildDate>Sun, 10 Aug 2025 18:44:05 +0800</lastBuildDate><atom:link href="https://fromkk.com/tags/doc2vec/index.xml" rel="self" type="application/rss+xml"/><item><title>Semi-supervised text classification using doc2vec and label spreading</title><link>https://fromkk.com/posts/semi-supervised-text-classification-using-doc2vec-and-label-spreading/</link><pubDate>Sun, 10 Sep 2017 15:29:00 +0800</pubDate><author>bebound@gmail.com (KK)</author><guid>https://fromkk.com/posts/semi-supervised-text-classification-using-doc2vec-and-label-spreading/</guid><description>&lt;p&gt;Here is a simple way to classify text without much human effort and get a impressive performance.&lt;/p&gt;
&lt;p&gt;It can be divided into two steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Get train data by using keyword classification&lt;/li&gt;
&lt;li&gt;Generate a more accurate classification model by using doc2vec and label spreading&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="keyword-based-classification"&gt;Keyword-based Classification&lt;/h2&gt;
&lt;p&gt;Keyword based classification is a simple but effective method. Extracting the target keyword is a monotonous work. I use this method to automatic extract keyword candidate.&lt;/p&gt;</description></item><item><title>Parameters in doc2vec</title><link>https://fromkk.com/posts/parameters-in-dov2vec/</link><pubDate>Thu, 03 Aug 2017 15:20:00 +0800</pubDate><author>bebound@gmail.com (KK)</author><guid>https://fromkk.com/posts/parameters-in-dov2vec/</guid><description>&lt;p&gt;Here are some parameter in &lt;code&gt;gensim&lt;/code&gt;&amp;rsquo;s &lt;code&gt;doc2vec&lt;/code&gt; class.&lt;/p&gt;
&lt;h3 id="window"&gt;window&lt;/h3&gt;
&lt;p&gt;window is the maximum distance between the predicted word and context words used for prediction within a document. It will look behind and ahead.&lt;/p&gt;
&lt;p&gt;In &lt;code&gt;skip-gram&lt;/code&gt; model, if the window size is 2, the training samples will be this:(the blue word is the input word)&lt;/p&gt;
&lt;figure class="image-size-s"&gt;&lt;img src="https://fromkk.com/images/doc2vec_window.png"&gt;
&lt;/figure&gt;

&lt;h3 id="min-count"&gt;min_count&lt;/h3&gt;
&lt;p&gt;If the word appears less than this value, it will be skipped&lt;/p&gt;
&lt;h3 id="sample"&gt;sample&lt;/h3&gt;
&lt;p&gt;High frequency word like &lt;code&gt;the&lt;/code&gt; is useless for training. &lt;code&gt;sample&lt;/code&gt; is a threshold for deleting these higher-frequency words. The probability of keeping the word \(w_i\) is:&lt;/p&gt;</description></item></channel></rss>