<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>SimHash on KK's Blog (fromkk)</title><link>https://fromkk.com/tags/simhash/</link><description>Recent content in SimHash on KK's Blog (fromkk)</description><generator>Hugo</generator><language>en</language><managingEditor>bebound@gmail.com (KK)</managingEditor><webMaster>bebound@gmail.com (KK)</webMaster><lastBuildDate>Fri, 17 Apr 2026 17:11:11 +0800</lastBuildDate><atom:link href="https://fromkk.com/tags/simhash/index.xml" rel="self" type="application/rss+xml"/><item><title>Near-duplicate with SimHash</title><link>https://fromkk.com/posts/near-duplicate-with-simhash/</link><pubDate>Wed, 04 Dec 2019 00:16:00 +0800</pubDate><author>bebound@gmail.com (KK)</author><guid>https://fromkk.com/posts/near-duplicate-with-simhash/</guid><description>&lt;p&gt;Before talking about &lt;strong&gt;SimHash&lt;/strong&gt;, let&amp;rsquo;s review some other methods which can also identify duplication.&lt;/p&gt;
&lt;h2 id="longest-common-subsequence--lcs"&gt;Longest Common Subsequence(LCS)&lt;/h2&gt;
&lt;p&gt;This is the algorithm used by &lt;code&gt;diff&lt;/code&gt; command. It is also &lt;strong&gt;edit distance&lt;/strong&gt; with insertion and deletion as the only two edit operations.&lt;/p&gt;
&lt;p&gt;This works good for short strings. However, the algorithm&amp;rsquo;s time complexity is \(O(m*n)\), if two strings&amp;rsquo; lengths are \(m\) and \(n\) respectively. So it&amp;rsquo;s not suitable for large corpus. Also, if two corpus consists of same paragraph but the order is not same. LCS treat them as different corpus, and that&amp;rsquo;s not we expected.&lt;/p&gt;</description></item></channel></rss>