2023년, 벡터 데이터베이스 선택을 위한 비교 및 가이드 / Picking a vector database: a comparison and guide for 2023

9bow · 10월 9, 2023, 9:57오전

벡터 데이터베이스 선택을 위한 비교 및 가이드(2023년) / Picking a vector database: a comparison and guide for 2023

작성자(Author): 에밀 프뢰베르크 / Emil Fröberg, Vectorview 공동 창업자

개요 / Introduction

시맨틱 검색(semantic search; 의미 검색)과 검색 증강 생성(RAG; Retrieval-Augmented Generation)이 온라인 상호작용을 재정의하는 시대에, 이러한 발전을 뒷받침하는 중추인 벡터 데이터베이스가 간과되는 경우가 많습니다. 대규모 언어 모델(LLM), RAG 또는 시맨틱 검색을 활용하는 플랫폼과 같은 애플리케이션에 대해 알아보고 계신다면, 이 글이 바로 당신이 찾던 그 글입니다.

In an era where semantic search and retrieval-augmented generation (RAG) are redefining our online interactions, the backbone supporting these advancements is often overlooked: vector databases. If you're diving into applications like large language models, RAG, or any platform leveraging semantic search, you're in the right place.

벡터 데이터베이스를 선택하는 것은 확장성(scalability), 지연 시간(latency), 비용(cost), 심지어 규정(compliance) 준수 여부 등을 두고 선택해야 하는 어려운 작업입니다. 이러한 고민을 하는 분들을 위해 2023년의 주요 벡터 데이터베이스를 비교하는 여정을 시작했습니다. 비교 대상에는 다음과 같은 벡터 데이터베이스를 포함했습니다: Pinecone, Weviate, Milvus, Qdrant, Chroma, Elasticsearch 및 PGvector입니다. 비교의 근거 데이터는 각 벡터 데이터베이스의 문서와 내부 벤치마크, 그리고 오픈소스 GitHub 저장소를 파헤쳐서 얻은 ANN 벤치마크로부터 가져왔습니다.

Picking a vector database can be hard. Scalability, latency, costs, and even compliance hinge on this choice. For those navigating this terrain, I've embarked on a journey to sieve through the noise and compare the leading vector databases of 2023. I’ve included the following vector databases in the comparision: Pinecone, Weviate, Milvus, Qdrant, Chroma, Elasticsearch and PGvector. The data behind the comparision comes from ANN Benchmarks, the docs and internal benchmarks of each vector database and from digging in open source github repos.

주요 벡터 데이터베이스 비교 / A comparison of leading vector databases

	Pinecone	Weaviate	Milvus	Qdrant	Chroma	Elasticsearch	PGvector
오픈소스 여부 / Is open source
자체 호스팅 / Self-host
클라우드 관리 / Cloud management							()
벡터 전용 / Purpose-built for Vectors
개발자 경험 / Developer experience
커뮤니티 / Community	Community page & events	8k☆ github, 4k slack	23k☆ github, 4k slack	13k☆ github, 3k discord	9k☆ github, 6k discord	23k slack	6k☆ github
초당 쿼리 수 / Queries per second (using text nytimes-256-angular)	150 *for p2, but more pods can be added	791	2406	326	?	700-100 *from various reports	141
지연 시간, ms / Latency, ms (Recall/Percentile 95 (millis), nytimes-256-angular)	1 *batched search, 0.99 recall, 200k SBERT	2	1	4	?	?	8
지원하는 인덱스 종류 / Supported index types	?	HNSW	Multiple (11 total)	HNSW	HNSW	HNSW	HNSW/IVFFlat
하이브리드 검색 / Hybrid Search (i.e. scalar filtering)
디스크 인덱스 지원 / Disk index support
역할-기반 접근 제어(RBAC) / Role-based access control
동적 세그먼트 배치 vs. 정적 데이터 샤딩 / Dynamic segment placement vs. static data sharding	?	Static sharding	Dynamic segment placement	Static sharding	Dynamic segment placement	Static sharding	-
무료 호스팅 티어 / Free hosted tier				(free self-hosted)	(free self-hosted)	(free self-hosted)	(varies)
가격 (5만 벡터 @1536) / Pricing (50k vectors @1536)	$70	fr. $25	fr. $65	est. $9	Varies	$95	Varies
가격 (2천만 벡터, 2천만 요청 @768) / Pricing (20M vectors, 20M req. @768)	$227 ($2074 for high performance)	$1536	fr. $309 ($2291 for high performance)	fr. $281 ($820 for high performance)	Varies	est. $1225	Varies

요약 / Summary

2023년 벡터 데이터베이스 시장을 탐색해 보면 각기 다른 요구 사항을 충족하는 다양한 옵션이 있음을 보실 수 있습니다. 비교 표를 보면 명확해보이지만, 결정을 내리는 데 도움이 되도록 간단히 정리해보겠습니다:

Navigating the terrain of vector databases in 2023 reveals a diverse array of options each catering to different needs. The comparison table paints a clear picture, but here's a succinct summary to aid your decision:

오픈소스 및 클라우드 호스팅: 오픈소스 솔루션을 선호한다면 Weviate, Milvus, Chroma가 최고의 경쟁자로 떠오릅니다. Pinecone은 오픈소스는 아니지만 개발자 경험과 강건한(robust) 전체(fully) 호스팅 솔루션으로 빛을 발합니다.

Open-Source and hosted cloud: If you lean towards open-source solutions, Weviate, Milvus, and Chroma emerge as top contenders. Pinecone, although not open-source, shines with its developer experience and a robust fully hosted solution.

성능: 초당 쿼리 수(QPS; Query per Second)로 표시되는 기본적인 성능은 Milvus가 선두를 달리고 있으며, Weviate와 Qdrant가 그 뒤를 바짝 쫓고 있습니다. 그러나 지연 시간 측면에서는 Pinecone과 Milvus가 모두 2ms 미만의 인상적인 결과를 제공합니다. Pinecone에 n개의 파드(pod)를 추가하면 훨씬 더 높은 QPS(초당 쿼리 수)에 도달할 수 있습니다.

Performance: When it comes to raw performance in queries per second, Milvus takes the lead, closely followed by Weviate and Qdrant. However, in terms of latency, Pinecone and Milvus both offer impressive sub-2ms results. If n-multiple pods are added for pinecone, then much higher QPS can be reached.

커뮤니티의 힘: Milvus는 가장 큰 커뮤니티를 자랑하며, 그 뒤를 Weviate와 Elasticsearch가 따릅니다. 강력한 커뮤니티는 종종 더 나은 지원, 개선, 버그 수정으로 이어집니다.

Community Strength: Milvus boasts the largest community presence, followed by Weviate and Elasticsearch. A strong community often translates to better support, enhancements, and bug fixes.

확장성, 고급 기능 및 보안: 많은 엔터프라이즈 애플리케이션에 필수적인 기능인 역할 기반 액세스 제어(RBAC)는 Pinecone, Milvus, Elasticsearch에서 찾아볼 수 있습니다. 확장 측면에서는 Milvus와 Chroma가 제공하는 동적 세그먼트 배치가 끊임없이 진화하는 데이터셋에 적합합니다. 다양한 인덱스 유형이 포함된 데이터베이스가 필요한 경우, 11가지 유형에 대한 Milvus의 지원은 타의 추종을 불허합니다. 하이브리드 검색은 전반적으로 잘 지원되지만, 디스크 인덱스 지원 측면에서는 Elasticsearch가 다소 부족합니다.

Scalability, advanced features and security: Role-based access control, a feature crucial for many enterprise applications, is found in Pinecone, Milvus, and Elasticsearch. On the scaling front, dynamic segment placement is offered by Milvus and Chroma, making them suitable for ever-evolving datasets. If you're in need of a database with a wide array of index types, Milvus' support for 11 different types is unmatched. While hybrid search is well-supported across the board, Elasticsearch does fall short in terms of disk index support.

가격: 스타트업이나 예산이 부족한 프로젝트의 경우, 50만 개의 벡터에 대해 9달러로 책정된 Qdrant의 가격을 이길 수 없습니다. 반면에 고성능이 필요한 대규모 프로젝트의 경우 Pinecone과 Milvus는 경쟁력 있는 가격대를 제공합니다.

Pricing: For startups or projects on a budget, Qdrant's estimated $9 pricing for 50k vectors is hard to beat. On the other end of the spectrum, for larger projects requiring high performance, Pinecone and Milvus offer competitive pricing tiers.

결론 / Conclusion

결론적으로, 어디에나 사용할 수 있는 팔방미인(one-size-fits-all) 벡터 데이터베이스는 없습니다. 특정 프로젝트의 요구 사항, 예산 제약, 개인적 선호도에 따라 최적의 선택이 무엇인지가 달라집니다. 이 가이드는 2023년 최고의 벡터 데이터베이스를 종합적으로 살펴볼 수 있는 렌즈를 제공하여 개발자의 의사 결정 과정을 간소화할 수 있도록 도우려고 작성하였습니다. 제 선택요? 저는 Pinecone과 Milvus를 테스트하고 있는데, 주로 고성능, Milvus의 강력한 커뮤니티, 규모에 따른 가격 유연성이 그 이유입니다.

In conclusion, there's no one-size-fits-all when it comes to vector databases. The ideal choice varies based on specific project needs, budget constraints, and personal preferences. This guide offers a comprehensive lens to view the top vector databases of 2023, hoping to simplify the decision-making process for developers. My choice? I’m testing out Pinecone and Milvus in the wild, mostly because of their high performance, Milvus strong community and price flexibility at scale.

출처 / Sources

각 벡터 데이터베이스의 깃헙과 문서 / Github and docs for each vector database

부록 1: 비교 파라미터에 대한 설명 / Appendix 1: explination of comparision parameters

오픈소스 여부: 소프트웨어의 소스 코드가 공개되어 개발자가 소프트웨어를 검토, 수정 및 배포할 수 있는지 여부를 나타냅니다.

Is open source: Indicates if the software's source code is freely available to the public, allowing developers to review, modify, and distribute the software.

자체 호스팅: 데이터베이스를 타사 클라우드 서비스에 의존하지 않고 사용자의 자체 인프라에서 호스팅할 수 있는지 여부를 나타냅니다.

Self-host: Specifies if the database can be hosted on a user's own infrastructure rather than being dependent on a third-party cloud service.

클라우드 관리: 데이터베이스 클라우드 관리를 위한 인터페이스를 제공하는지 여부를 나타냅니다.

Cloud management: Offers an interface for database cloud management

벡터 전용: 벡터 기능이 추가된 일반 데이터베이스가 아니라 벡터 저장 및 검색을 염두에 두고 특별히 설계된 데이터베이스인지 여부를 나타냅니다.

Purpose-built for Vectors: This means the database was specifically designed with vector storage and retrieval in mind, rather than being a general database with added vector capabilities.

개발자 경험: 문서, SDK, API 설계와 같은 측면을 고려하여 개발자가 각 데이터베이스로 작업하는 것이 얼마나 사용자 친화적이고 직관적(intuitive)인지 평가합니다.

Developer experience: Evaluates how user-friendly and intuitive it is for developers to work with the database, considering aspects like documentation, SDKs, and API design.

커뮤니티: 데이터베이스 관련 개발자 커뮤니티의 규모와 활동을 평가합니다. 커뮤니티가 활발할수록 지원, 기여도, 지속적인 발전 가능성이 높다는 것을 의미합니다.

Community: Assesses the size and activity of the developer community around the database. A strong community often indicates good support, contributions, and the potential for continued development.

초당 쿼리 수(QPS): 벤치마킹을 위한 특정 데이터셋에서 (여기서는 nytimes-256-angular 데이터셋) 데이터베이스가 초당 처리할 수 있는 쿼리 수를 나타냅니다.

Queries per second: How many queries the database can handle per second using a specific dataset for benchmarking (in this case, the nytimes-256-angular dataset)

지연 시간: 요청을 시작하고 응답을 받기까지의 지연 시간(밀리초)으로, nytimes-256-angular 데이터셋의 쿼리에 대해 지연 시간의 95%가 지정된 시간 내에 해당합니다.

Latency: the delay (in milliseconds) between initiating a request and receiving a response. 95% of query latencies fall under the specified time for the nytimes-256-angular dataset.

지원하는 인덱스 종류: 데이터베이스가 지원하는 다양한 인덱싱 기술을 의미하며, 검색 속도와 정확도에 영향을 줄 수 있습니다. 일부 벡터 데이터베이스는 HNSW, IVF 등과 같은 여러 인덱싱 유형을 지원할 수 있습니다.

Supported index types: Refers to the various indexing techniques the database supports, which can influence search speed and accuracy. Some vector databases may support multiple indexing types like HNSW, IVF, and more.

하이브리드 검색: 데이터베이스에서 기존(스칼라) 쿼리와 벡터 쿼리를 결합할 수 있는지 여부를 결정합니다. 이는 벡터가 아닌 컬럼을 사용하여 결과를 필터링해야 하는 경우 매우 중요합니다.

Hybrid Search: Determines if the database allows for combining traditional (scalar) queries with vector queries. This can be crucial for applications that need to filter results based on non-vector criteria.

디스크 인덱스 지원: 데이터베이스가 디스크에 인덱스 저장을 지원하는지 여부를 나타냅니다. 이는 메모리에 들어갈 수 없는 대용량 데이터셋을 처리하는 데 필수적입니다.

Disk index support: Indicates if the database supports storing indexes on disk. This is essential for handling large datasets that cannot fit into memory.

역할 기반 접근 제어(RBAC): 데이터베이스에 특정 역할 또는 사용자에게 권한을 부여할 수 있는 보안 메커니즘이 있는지 확인하여 데이터 보안을 강화합니다.

Role-based access control: Checks if the database has security mechanisms that allow permissions to be granted to specific roles or users, enhancing data security.

동적 세그먼트 배치 vs. 정적 데이터 샤딩: 데이터베이스가 데이터 배포 및 확장을 관리하는 방식을 말합니다. 동적 세그먼트 배치는 실시간 필요에 따라 데이터를 보다 유연하게 배포할 수 있는 반면, 정적 데이터 샤딩은 데이터를 미리 정해진 세그먼트로 분할합니다.

Dynamic segment placement vs. static data sharding: Refers to how the database manages data distribution and scaling. Dynamic segment placement allows for more flexible data distribution based on real-time needs, while static data sharding divides data into predetermined segments.

무료 호스팅 티어: 데이터베이스 제공업체가 무료로 클라우드에 호스팅하는 버전을 제공하여 사용자가 초기 비용 없이 데이터베이스를 테스트하거나 사용할 수 있는지를 나타냅니다.

Free hosted tier: Specifies if the database provider offers a free cloud-hosted version, allowing users to test or use the database without initial investment.

가격(5만 벡터 @1536) 및 가격(2천만 벡터, 2천만 요청 @768): 특정한 양의 데이터를 저장 및 쿼리할 때의 비용에 대한 정보를 제공하여 소규모 및 대규모 사용 사례 모두에 대한 데이터베이스의 비용 효율성에 대한 인사이트를 제공합니다.

Pricing (50k vectors @1536) and Pricing (20M vectors, 20M req. @768): Provides information on the cost associated with storing and querying specific amounts of data, giving an insight into the database's cost-effectiveness for both small and large-scale use cases.