Introduction to Document Similarity with Elasticsearch. Nonetheless, if you’re new towards the notion of document similarity, here’s a quick overview.

In a text analytics context, document similarity relies on reimagining texts as points in area which can be near (comparable) or various (far apart). However, it is not necessarily a simple procedure to figure out which document features ought to be encoded right into a similarity measure (words/phrases? document length/structure?). Furthermore, in training it could be difficult to find an instant, efficient means of finding comparable papers provided some input document. In this post I’ll explore a number of the similarity tools applied in Elasticsearch, which could allow us to enhance search rate and never having to sacrifice a lot of in the real method of nuance.

Document Distance and Similarity

In this post I’ll be concentrating mostly on getting started off with Elasticsearch and comparing the similarity that is built-in currently implemented in ES.

Really, to represent the exact distance between papers, we require a few things:

first, a method of encoding text as vectors, and 2nd, an easy method of calculating distance.

  1. The bag-of-words (BOW) model enables us to express document similarity pertaining to language and it is very easy to do. Some typical alternatives for BOW encoding consist of one-hot encoding, regularity encoding, TF-IDF, and distributed representations. Continue reading