I've never used Lucene before, so I searched and read some documents
of it. At first, I thought I could put all short messages into Lucene
index, then search by each message content, set a hit score threshold,
collect all the hit docs, and remove all docs beyond the threshold.
But when it came to implementation, it's not easy to extract
important words in message content as keywords used for "search by each
message content".
Then I came across MoreLikeThis,
it's absolutely what I was looking for, and worked well after I
tried.
@Test publicvoidtestRemoveDuplicates(){ String[] contents = { "I love Emacs, since Emacs is awesome", "Emacs is awesome, I love it", "Oh my way home. Long day ahead", "It's a long day, I have to admit it", "Good artists copy, great artists steal", "Something interesting, Good artists copy, great artists steal", "Great artists steal, Good artists copy", "I really think Emacs is awesome, and love it" }; List<String> contentList = Arrays.asList(contents);
Deduplicator deDuplicator = new Deduplicator(contentList, 0.5f); List<String> results = deDuplicator.dedup();
private IndexWriter getIndexWriter()throws IOException { IndexWriterConfig indexWriterConfig = new IndexWriterConfig( Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)); returnnew IndexWriter(idxDir, indexWriterConfig); }
private Document addContentToDoc(String content){ Document doc = new Document(); // We can also store message id in index, // with Field.Index.NO, for later reference doc.add(new Field("contents", content, Field.Store.YES, Field.Index.ANALYZED)); return doc; }
public List<String> dedup(){ List<String> results = new ArrayList<String>(); IndexReader indexReader = null; try { // Open the index, false flag indicates it's writable, // since we need to delete similar docs indexReader = IndexReader.open(idxDir, false); int maxDocNum = indexReader.maxDoc();
// Iterate all docs, find similar ones and remove for (int docNum = 0; docNum < maxDocNum; docNum++) { // If the doc is already deleted, ignore if (indexReader.isDeleted(docNum)) continue;
// Save doc to result list Document doc = indexReader.document(docNum); String content = getContentFromDoc(doc); results.add(content);
// Initialize MoreLikeThis with indexReader MoreLikeThis moreLikeThis = new MoreLikeThis(indexReader); // Lower the word and doc frequency since message content is short moreLikeThis.setMinTermFreq(1); moreLikeThis.setMinDocFreq(1);
// Put message content into a StringReader Reader reader = new StringReader(content); // And pass it to like() method, field name "contents" must be set, // MoreLikeThis will return a query for searching, // which is the important words I mentioned before. Query query = moreLikeThis.like(reader, "contents");
// Then use normal search functionality, with the scoreThreshold, // to find all the similar docs and remove. IndexSearcher searcher = new IndexSearcher(indexReader); TopDocs topDocs = searcher.search(query, 100); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { if (scoreDoc.score > scoreThreshold) { // Delete all similar docs indexReader.deleteDocument(scoreDoc.doc); } } closeSearcher(searcher); } closeIndexReader(indexReader); } catch (CorruptIndexException e) { e.printStackTrace(); closeIndexReader(indexReader); } catch (IOException e) { e.printStackTrace(); closeIndexReader(indexReader); } return results; }
Simple and easy, isn't it?
The accuracy is better if there are more docs in index, because of
the if*tdf, so, it's
not a good way to remove the docs in index during duplicates removing,
better to find another solution, I use it just to make it short and easy
to understand.