<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Torchtext on KK's Blog (fromkk)</title><link>https://fromkk.com/tags/torchtext/</link><description>Recent content in Torchtext on KK's Blog (fromkk)</description><generator>Hugo</generator><language>en</language><managingEditor>bebound@gmail.com (KK)</managingEditor><webMaster>bebound@gmail.com (KK)</webMaster><lastBuildDate>Sun, 10 Aug 2025 18:44:05 +0800</lastBuildDate><atom:link href="https://fromkk.com/tags/torchtext/index.xml" rel="self" type="application/rss+xml"/><item><title>Torchtext snippets</title><link>https://fromkk.com/posts/torchtext-snippets/</link><pubDate>Mon, 01 Jul 2019 21:28:00 +0800</pubDate><author>bebound@gmail.com (KK)</author><guid>https://fromkk.com/posts/torchtext-snippets/</guid><description>&lt;h2 id="load-separate-files"&gt;Load separate files&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;data.Field&lt;/code&gt; parameters is &lt;a href="https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.Field" target="_blank" rel="noopener noreffer "&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;When calling &lt;code&gt;build_vocab&lt;/code&gt;, torchtext will add &lt;code&gt;&amp;lt;unk&amp;gt;&lt;/code&gt; in vocabulary list. Set &lt;code&gt;unk_token=None&lt;/code&gt; if you want to remove it. If &lt;code&gt;sequential=True&lt;/code&gt; (default), it will add &lt;code&gt;&amp;lt;pad&amp;gt;&lt;/code&gt; in vocab. &lt;code&gt;&amp;lt;unk&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;pad&amp;gt;&lt;/code&gt; will add at the beginning of vocabulary list by default.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;LabelField&lt;/code&gt; is similar to Field, but it will set &lt;code&gt;sequential=False&lt;/code&gt;, &lt;code&gt;unk_token=None&lt;/code&gt; and &lt;code&gt;is_target=Ture&lt;/code&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;INPUT &lt;span style="color:#f92672"&gt;=&lt;/span&gt; data&lt;span style="color:#f92672"&gt;.&lt;/span&gt;Field(lower&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;True&lt;/span&gt;, batch_first&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;True&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;TAG &lt;span style="color:#f92672"&gt;=&lt;/span&gt; data&lt;span style="color:#f92672"&gt;.&lt;/span&gt;LabelField()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;train, val, test &lt;span style="color:#f92672"&gt;=&lt;/span&gt; data&lt;span style="color:#f92672"&gt;.&lt;/span&gt;TabularDataset&lt;span style="color:#f92672"&gt;.&lt;/span&gt;splits(path&lt;span style="color:#f92672"&gt;=&lt;/span&gt;base_dir&lt;span style="color:#f92672"&gt;.&lt;/span&gt;as_posix(), train&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#39;train_data.csv&amp;#39;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; validation&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#39;val_data.csv&amp;#39;&lt;/span&gt;, test&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#39;test_data.csv&amp;#39;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; format&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#39;tsv&amp;#39;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; fields&lt;span style="color:#f92672"&gt;=&lt;/span&gt;[(&lt;span style="color:#66d9ef"&gt;None&lt;/span&gt;, &lt;span style="color:#66d9ef"&gt;None&lt;/span&gt;), (&lt;span style="color:#e6db74"&gt;&amp;#39;input&amp;#39;&lt;/span&gt;, INPUT), (&lt;span style="color:#e6db74"&gt;&amp;#39;tag&amp;#39;&lt;/span&gt;, TAG)])
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="load-single-file"&gt;Load single file&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;all_data &lt;span style="color:#f92672"&gt;=&lt;/span&gt; data&lt;span style="color:#f92672"&gt;.&lt;/span&gt;TabularDataset(path&lt;span style="color:#f92672"&gt;=&lt;/span&gt;base_dir &lt;span style="color:#f92672"&gt;/&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#39;gossip_train_data.csv&amp;#39;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; format&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#39;tsv&amp;#39;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; fields&lt;span style="color:#f92672"&gt;=&lt;/span&gt;[(&lt;span style="color:#e6db74"&gt;&amp;#39;text&amp;#39;&lt;/span&gt;, TEXT), (&lt;span style="color:#e6db74"&gt;&amp;#39;category&amp;#39;&lt;/span&gt;, CATEGORY)])
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;train, val, test &lt;span style="color:#f92672"&gt;=&lt;/span&gt; all_data&lt;span style="color:#f92672"&gt;.&lt;/span&gt;split([&lt;span style="color:#ae81ff"&gt;0.7&lt;/span&gt;, &lt;span style="color:#ae81ff"&gt;0.2&lt;/span&gt;, &lt;span style="color:#ae81ff"&gt;0.1&lt;/span&gt;])
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="create-iterator"&gt;Create iterator&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;train_iter, val_iter, test_iter &lt;span style="color:#f92672"&gt;=&lt;/span&gt; data&lt;span style="color:#f92672"&gt;.&lt;/span&gt;BucketIterator&lt;span style="color:#f92672"&gt;.&lt;/span&gt;splits(
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; (train, val, test), batch_sizes&lt;span style="color:#f92672"&gt;=&lt;/span&gt;(&lt;span style="color:#ae81ff"&gt;32&lt;/span&gt;, &lt;span style="color:#ae81ff"&gt;256&lt;/span&gt;, &lt;span style="color:#ae81ff"&gt;256&lt;/span&gt;), shuffle&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;True&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; sort_key&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;lambda&lt;/span&gt; x: x&lt;span style="color:#f92672"&gt;.&lt;/span&gt;input)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="load-pretrained-vector"&gt;Load pretrained vector&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;vectors &lt;span style="color:#f92672"&gt;=&lt;/span&gt; Vectors(name&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#39;cc.zh.300.vec&amp;#39;&lt;/span&gt;, cache&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#39;./&amp;#39;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;INPUT&lt;span style="color:#f92672"&gt;.&lt;/span&gt;build_vocab(train, vectors&lt;span style="color:#f92672"&gt;=&lt;/span&gt;vectors)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;TAG&lt;span style="color:#f92672"&gt;.&lt;/span&gt;build_vocab(train, val, test)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="check-vocab-sizes"&gt;Check vocab sizes&lt;/h2&gt;
&lt;p&gt;You can view vocab index by &lt;code&gt;vocab.itos&lt;/code&gt;.&lt;/p&gt;</description></item></channel></rss>