Vector databases offer efficiency and scalability, and are transforming the way we harness the potential of embedding data in the digital age. There are quite a few open source vector databases that come with their own benefits. We take a quick look at them.
In natural language processing (NLP), an embedding is a representation of text in the form of vectors. The goal of an embedding is to capture the semantic meaning of words or documents in a way that can be understood by a machine learning model.
A vector database (or an embedding database) in NLP is a specialised database designed to efficiently store, retrieve, and perform operations on high-dimensional vector data (such as the embeddings mentioned above). Vector databases are optimised to perform nearest neighbour search operations efficiently, which is a common requirement in NLP applications. They provide a way of organising and searching through large amounts of embedding data, which can be beneficial in various tasks like information retrieval, document similarity, clustering, and others.
As an example, let’s say you’ve embedded a large number of documents using a Doc2Vec model. Now, given a new document, you want to find the most similar documents in your database. To do this, you would:
- First, embed the new document into the same high-dimensional space.
- Next, search the vector database for the vectors closest to the new document’s vector. This is the nearest neighbour search.
Due to the high-dimensional nature of the data, this search can be computationally intensive. However, vector databases use specialised indexing and querying algorithms (like k-d trees, ball trees, or hashing techniques) to speed up these operations. Examples of such databases include FAISS developed by Facebook AI and Annoy developed by Spotify.
Open source vector databases
This open source vector database enables the storage and retrieval of data objects based on their semantic properties through vector indexing.
- It supports a variety of media types, including text, images, etc, and offers features such as semantic search, question-answer extraction, classification, and customisable models.
- It provides a GraphQL API for straightforward data access and is optimised for high performance, as demonstrated by open source benchmarks.
- Key characteristics include quick queries, support for multiple media types via modules, combined vector and scalar search, real-time and persistent data access, horizontal scalability, and graph-like object connections.
- Weaviate is recommended when seeking improved search quality, performing similarity search with machine learning (ML) models, efficiently combining vector and scalar search, scaling ML models for production, and performing rapid classification tasks.
- It has applications in semantic search, image search, anomaly detection, recommendation engines, e-commerce search, data classification, and cyberthreat analysis, among others.
It is an open source PostgreSQL extension that permits vector similarity searches within the database. It allows efficient storage and querying of high-dimensional vectors, making it appropriate for applications such as recommendation systems, natural language processing, and image analysis.
- pgvector offers functions and operators for vector similarity searches, such as locating nearest neighbours based on vector distances or performing similarity joins between vectors. The search process is optimised utilising index structures such as K-d trees or Annoy.
- With pgvector, vector data can be directly stored in PostgreSQL tables, making it easy to integrate vector similarity search into existing database workflows. It also provides indexing and querying support for multiple vector fields within a table.
- The extension is implemented in C and supports multiple vector data types, such as float4, float8, and integer. It offers a simple SQL interface for vector operations and can be integrated seamlessly with other PostgreSQL features and extensions.
- pgvector expands PostgreSQL’s capabilities by adding vector similarity search functionality, enabling developers to conduct efficient and scalable similarity searches on high-dimensional vector data directly within the database.
Chroma is the open source database for embedding. It facilitates the development of LLM applications by allowing knowledge, facts, and skills to be plugged into LLMs.
Chroma provides the means to:
- Store embeddings and their associated metadata
- Insert documents and queries
- Search embedded content
Milvus was developed in 2019 to store, index, and manage massive embedding vectors generated by deep neural networks and other ML models.
- It is designed to process queries over vector inputs and is capable of indexing vectors on a trillion scale. Milvus can handle embedding vectors converted from unstructured data, unlike relational databases which deal with structured data.
- On the internet, unstructured data such as emails, papers, IoT sensor data, photos, and protein structures has become increasingly prevalent. Milvus stores and indexes these vectors, allowing computers to interpret unstructured data.
- Milvus is capable of analysing the correlation between two vectors by calculating their similarity distance, which indicates the similarity of the original data sources if the vectors are extremely similar.
This open source vector database is designed for fast and scalable storage and retrieval of high-dimensional data.
- It utilises advanced indexing techniques, including approximate nearest neighbours (ANN) and product quantization, to enable efficient search and retrieval of data.
- QDrant supports both CPU- and GPU-based computing, providing flexibility and adaptability to different hardware configurations.
- The database is highly scalable and capable of handling large scale data and high user concurrency.
- A unique feature of QDrant is its ability to store and search geospatial data, making it well suited for location based applications.
Each of these vector databases has its own strengths and applications. Weaviate stands out with its semantic search capabilities and diverse media type support. Pgvector integrates seamlessly with PostgreSQL and provides efficient vector similarity searches. Chroma DB focuses on embedding storage and retrieval for LLM applications. Milvus is specialised for handling massive embedding vectors and unstructured data. QDrant excels in fast and scalable storage and retrieval of high-dimensional data, with support for geospatial data. The choice of database depends on the specific requirements and use cases of the application.
To read more about vector databases and their transformative potential in managing and leveraging high-dimensional data effectively: click here.