Using Elasticsearch as a Vector Database: Dive into “dense_vector” and “script_score”

Chen Jun Ming
7 min readOct 13, 2023

--

Elasticsearch is an incredibly powerful and flexible search and analytics engine. While its primary use case revolves around full-text search, it is versatile enough to be employed for various other functions. One such function that has caught the attention of many developers and data scientists is using Elasticsearch as a vector database. With the advent of the dense_vector datatype and the ability to leverage the script_score function, Elasticsearch's capabilities have expanded to facilitate vector similarity searches.

The Importance of Vector Search for Semantic Search

Vector search has revolutionized the way we understand and conduct search operations, specifically when it comes to semantic search. But before delving into its significance, it’s essential to grasp the difference between syntactical and semantic searches.

Syntactical vs. Semantic Search

Imagine conducting a search query for “apple alcoholic beverage.” In a syntactical search, the engine would look for documents containing that exact phrase. If a document doesn’t have the words “apple”, “alcoholic”, and “beverage” in close proximity or in that specific order, it may not be ranked high or even shown in the results. This method is limited because it’s tied strictly to the syntax of the query and can miss out on contextually relevant documents.

Enter semantic search, powered by vector search. Here, instead of looking at the exact phrase, the search engine tries to understand the meaning or the intent behind the query. In the realm of semantic search, querying for “apple alcoholic beverage” wouldn’t just give you documents containing that exact phrase. It would understand the essence of your query and fetch documents related to “appletini”, “apple brandy”, “apple bourbon”, and more.

Why is Vector Search Crucial for Semantic Search?

Vector search plays an instrumental role in achieving this semantic understanding. Words, phrases, or even entire sentences can be represented as vectors in a high-dimensional space using various embedding techniques (like Word2Vec, BERT, or FastText). In this vector space, the “distance” between vectors indicates semantic similarity. Words or phrases with similar meanings will have vectors closer to each other.

When you search for “apple alcoholic beverage”, its vector representation might be close to vectors of “appletini”, “apple brandy”, or “apple bourbon”. A vector search then fetches these semantically similar terms, thus achieving a semantic understanding of the user’s intent.

Vector Spaces in the Context of Embedding Models

At a high level, a vector space is a mathematical construct where vectors exist, and operations like addition and scalar multiplication can be performed. In the context of embedding models and natural language processing, vector spaces are used to map words, sentences, or even entire documents into numerical vectors.

  1. Dimensionality: Each dimension in this space can be thought of as a feature or characteristic of the data. For words or sentences, these dimensions could capture syntactic roles, semantic meanings, context, or various abstract linguistic properties. The more dimensions, the more nuanced and detailed the representation, but it also demands more computational resources.
  2. Distance & Similarity: The primary reason for transforming words or sentences into vectors is to measure similarity. In these vector spaces, the "distance" (often using metrics like cosine similarity or Euclidean distance) between any two vectors can indicate how similar those two items are. The closer the vectors, the more similar they are. For instance, in a well-trained embedding model, the vector for "king" minus the vector for "man" plus the vector for "woman" might be close to the vector for "queen", capturing relational semantics.
  3. Training & Context: Embedding models, like Word2Vec or BERT, generate these vectors by training on vast amounts of text data. During this training, the models learn to represent words or sentences in a way that contextual similarities (how words are used in relation to other words) are captured in the vector space. This is why synonyms or thematically related words end up having vectors close to each other in the space.

The Mechanics of Vector Search

Once you have a set of vectors (be it words, sentences, or documents), conducting a vector search involves:

  1. Query Transformation: Convert the search query into its vector representation using the same embedding model.
  2. Distance Computation: For each item in the database (or a subset, depending on optimizations), compute the distance (or similarity score) between the query vector and the item's vector.
  3. Ranking: Rank items based on their distance or similarity from the query vector. Items with vectors closest to the query vector are deemed most relevant and are returned as top results.
Vector search representation in a 3-D space for the word “kitten”.

The Rise of the dense_vector Datatype

Elasticsearch’s dense_vector datatype is designed to store vectors of float values. These vectors are often employed in machine learning, especially for embeddings where items are represented as vectors in high-dimensional space. For example, word embeddings from models like Word2Vec or sentence embeddings from models like BERT can be stored using the dense_vector datatype.

To store a vector, you can define a mapping like:

{
"properties": {
"text-vector": {
"type": "dense_vector",
"dims": 512
}
}
}

Here, dims denotes the number of dimensions in the vector.

Harnessing the Power of script_score for Vector Similarity

To perform vector similarity searches, we need to measure how close a given vector is to other vectors in the database. A common method for this is to compute the dot product between vectors. The script_score function in Elasticsearch allows us to compute custom scores for documents based on a script. By employing this functionality, we can compute the dot product between our query vector and the vectors stored in our database.

The dotProduct function can be utilized within the script_score as follows:

{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "dotProduct(params.queryVector, 'text-vector') + 1.0",
"params": {
"queryVector": [...]
}
}
}
}
}

Here, params.queryVector is the vector you're searching with, and 'text-vector' refers to the field in which the vectors are stored.

Why the “+ 1.0”?

An astute observer might wonder about the addition of + 1.0 outside of the dotProduct function. This addition is crucial due to a limitation within Elasticsearch: it cannot handle negative score values. By adding 1.0, we ensure that all the score values returned by our query remain positive.

However, it’s essential to remember that this addition can distort the relative similarity measurements between vectors, especially when the dot product is close to zero. If precise similarity values are required, developers should manually post-process the scores after the query returns, subtracting the 1.0 addition to retrieve the original dot product values.

Advantages of Elasticsearch Over Other Vector Search Libraries

One of the undeniable advantages of using Elasticsearch as a vector database is its built-in capability to filter your query on specific subsets of data. This feature is immensely useful when you want to narrow down your search space or when your application demands context-aware vector searches.

In contrast, while other specialized vector search libraries, like ChromaDB and Faiss, offer impeccable speed and efficiency for pure vector searches, they lack the full-featured query capabilities present in Elasticsearch. For instance, ChromaDB does allow querying on metadata, but it’s constrained to exact matches on strings. This limitation can sometimes hinder the flexibility and granularity required in complex search scenarios.

Incorporating Elasticsearch’s rich querying environment with vector similarity searches means users get the best of both worlds: precise vector-based results, augmented by the capability to layer those searches with nuanced, context-aware filters. This amalgamation makes Elasticsearch a compelling choice for developers needing both depth and breadth in their search capabilities.

Disadvantages of Elasticsearch vs. Specialized Vector Search Libraries

Elasticsearch, with its expansive toolset and adaptability, has carved a niche for itself in the world of search engines. However, when pitted against specialized vector search libraries like Faiss and ChromaDB, there are areas where it reveals its limitations.

Approximate Nearest Neighbors & Hierarchical Navigable Small Worlds: At the forefront of these limitations, specifically for Elasticsearch version 7 as employed by OpenSearch, is its lack of built-in support for Approximate Nearest Neighbors (ANN) algorithms and Hierarchical Navigable Small Worlds (HNSW) techniques. These state-of-the-art methods are pivotal in enhancing the speed of searches within expansive datasets. By leveraging them, searches can be accelerated considerably, with only a tiny trade-off in accuracy.

Faiss and ChromaDB, having been crafted with these strategies at their core, demonstrate an innate ability to navigate vast vector spaces swiftly. Their prowess in this domain lends them an edge, particularly in use cases that revolve around extensive datasets and demand rapid results.

Mitigating with Filters: However, all is not bleak for Elasticsearch aficionados. The system’s inherent flexibility provides a countermeasure. By adeptly applying filters to queries, one can curtail the search space in Elasticsearch. This reduction can, to a degree, balance out the efficiency gap introduced by the lack of ANN and HNSW support. Such filters allow Elasticsearch to home in on relevant data subsets, making the search process more manageable and faster.

Elasticsearch Version 8’s Advancements: It’s worth noting that while version 7 exhibits these limitations, the latest Elasticsearch version 8, developed by the original Elasticsearch team, has incorporated HNSW, thus advancing its capabilities in the realm of vector search.

While Elasticsearch remains a powerhouse in diverse search scenarios, its efficiency in large-scale vector searches is version-dependent. The tailored libraries like Faiss and ChromaDB might be the frontrunners for specific needs, but with Elasticsearch 8, the gap narrows significantly.

Final Thoughts

Elasticsearch’s journey into the domain of vector search highlights its versatility and adaptability. While initially not designed as a vector database, its capabilities have expanded with innovations such as the dense_vector datatype and script_score function. These advancements have positioned Elasticsearch as a viable tool for vector similarity searches, bridging the gap between traditional full-text searches and the nuanced realm of semantic understanding enabled by vector representations. As with any tool, Elasticsearch exhibits strengths and weaknesses, especially when compared to specialized vector search libraries. Its flexibility and broad querying capabilities render it invaluable in multifaceted search scenarios, but one must be aware of its limitations, particularly in older versions. The recent improvements in version 8, however, signal a promising direction for its evolution in the vector search space. For developers and data scientists navigating this domain, the choice boils down to the specific needs of their projects and the trade-offs they're willing to make. Whether you choose Elasticsearch, Faiss, ChromaDB, or another tool entirely, it's clear that the world of search is richer and more dynamic than ever before.

--

--