Near-duplicate with SimHash

bebound@gmail.com (KK) — Wed, 04 Dec 2019 00:16:00 +0800

Before talking about SimHash, let’s review some other methods which can also identify duplication.

Longest Common Subsequence(LCS)

This is the algorithm used by diff command. It is also edit distance with insertion and deletion as the only two edit operations.

This works good for short strings. However, the algorithm’s time complexity is \(O(m*n)\), if two strings’ lengths are \(m\) and \(n\) respectively. So it’s not suitable for large corpus. Also, if two corpus consists of same paragraph but the order is not same. LCS treat them as different corpus, and that’s not we expected.

SimHash on KK's Blog (fromkk)

Near-duplicate with SimHash

Longest Common Subsequence(LCS)