Clustering with Lucene: Example

Posted on 2012-04-09

Recently, I needed to do some analysis on lots of short messages like tweets. To be accurate, I had to remove the duplicated messages.

String compare is not enough, what I wanted is to remove all the similar messages.

Clustering works, but I don't want to write a whole Single Pass Algorithm, a colleague suggested me to consider Lucene.

I've never used Lucene before, so I searched and read some documents of it. At first, I thought I could put all short messages into Lucene index, then search by each message content, set a hit score threshold, collect all the hit docs, and remove all docs beyond the threshold.

But when it came to implementation, it's not easy to extract important words in message content as keywords used for "search by each message content".

Then I came across MoreLikeThis, it's absolutely what I was looking for, and worked well after I tried.

Here's the demo.

First, setup requirements.

@Test
public void testRemoveDuplicates() {
    String[] contents = {
            "I love Emacs, since Emacs is awesome",
            "Emacs is awesome, I love it",
            "Oh my way home. Long day ahead",
            "It's a long day, I have to admit it",
            "Good artists copy, great artists steal",
            "Something interesting, Good artists copy, great artists steal",
            "Great artists steal, Good artists copy",
            "I really think Emacs is awesome, and love it"
    };
    List<String> contentList = Arrays.asList(contents);

    Deduplicator deDuplicator = new Deduplicator(contentList, 0.5f);
    List<String> results = deDuplicator.dedup();

    assertEquals(4, results.size());
    assertEquals(contentList.get(0), results.get(0));
    assertEquals(contentList.get(2), results.get(1));
    assertEquals(contentList.get(3), results.get(2));
    assertEquals(contentList.get(4), results.get(3));
}

Then, add contents to Lucene index.

private void addContents(List<String> contentList) {
    IndexWriter writer = null;
    try {
        writer = getIndexWriter();
        for (String content : contentList) {
            writer.addDocument(addContentToDoc(content));
        }
        closeIndexWriter(writer);
    } catch (CorruptIndexException e) {
        e.printStackTrace();
        closeIndexWriter(writer);
    } catch (IOException e) {
        e.printStackTrace();
        closeIndexWriter(writer);
    }
}

private IndexWriter getIndexWriter() throws IOException {
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(
            Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35));
    return new IndexWriter(idxDir, indexWriterConfig);
}

private Document addContentToDoc(String content) {
    Document doc = new Document();
    // We can also store message id in index,
    // with Field.Index.NO, for later reference
    doc.add(new Field("contents", content, Field.Store.YES,
            Field.Index.ANALYZED));
    return doc;
}

Finally, use MoreLikeThis to find duplicates.

public List<String> dedup() {
    List<String> results = new ArrayList<String>();
    IndexReader indexReader = null;
    try {
        // Open the index, false flag indicates it's writable,
        // since we need to delete similar docs
        indexReader = IndexReader.open(idxDir, false);
        int maxDocNum = indexReader.maxDoc();

        // Iterate all docs, find similar ones and remove
        for (int docNum = 0; docNum < maxDocNum; docNum++) {
            // If the doc is already deleted, ignore
            if (indexReader.isDeleted(docNum)) continue;

            // Save doc to result list
            Document doc = indexReader.document(docNum);
            String content = getContentFromDoc(doc);
            results.add(content);

            // Initialize MoreLikeThis with indexReader
            MoreLikeThis moreLikeThis = new MoreLikeThis(indexReader);
            // Lower the word and doc frequency since message content is short
            moreLikeThis.setMinTermFreq(1);
            moreLikeThis.setMinDocFreq(1);

            // Put message content into a StringReader
            Reader reader = new StringReader(content);
            // And pass it to like() method, field name "contents" must be set,
            // MoreLikeThis will return a query for searching,
            // which is the important words I mentioned before.
            Query query = moreLikeThis.like(reader, "contents");

            // Then use normal search functionality, with the scoreThreshold,
            // to find all the similar docs and remove.
            IndexSearcher searcher = new IndexSearcher(indexReader);
            TopDocs topDocs = searcher.search(query, 100);
            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                if (scoreDoc.score > scoreThreshold) {
                    // Delete all similar docs
                    indexReader.deleteDocument(scoreDoc.doc);
                }
            }
            closeSearcher(searcher);
        }
        closeIndexReader(indexReader);
    } catch (CorruptIndexException e) {
        e.printStackTrace();
        closeIndexReader(indexReader);
    } catch (IOException e) {
        e.printStackTrace();
        closeIndexReader(indexReader);
    }
    return results;
}

Simple and easy, isn't it?

The accuracy is better if there are more docs in index, because of the if*tdf, so, it's not a good way to remove the docs in index during duplicates removing, better to find another solution, I use it just to make it short and easy to understand.

Code of this example can be checkout from here.