DeepEval
DeepEval by Confident AI is an open-source framework for testing large language model systems. Similar to Pytest but designed for LLM outputs, it evaluates metrics like G-Eval, hallucination, answer relevancy.
DeepEval can be integrated with SurrealDB to evaluate RAG pipelines — ensuring your LLM applications return relevant, grounded, and faithful responses based on retrieved vector search context.
SurrealDB’s native vector engine, allows you to store vectors, documents and metadata in the same database that already stores the rest of your app data.
Install & run
pip install deepeval surrealdb openai
docker run -p 8000:8000 surrealdb/surrealdb:latest \
start --user root --pass secret file:/data/db
SurrealDB ≥ v1.5 ships HNSW indexes for sub-millisecond k-NN search.
Set up a vector table & index (one-time)
DEFINE TABLE rag_docs SCHEMALESS;
DEFINE FIELD id ON rag_docs TYPE string;
DEFINE FIELD text ON rag_docs TYPE string;
DEFINE FIELD source ON rag_docs TYPE string;
DEFINE FIELD embedding ON rag_docs TYPE array;
DEFINE INDEX IF NOT EXISTS rag_docs_vec
ON rag_docs FIELDS embedding
HNSW DIMENSION 1536 DIST COSINE;
Python helper: Surreal client with add/query
surreal_rag.py
from surrealdb import AsyncSurreal
import hashlib, json, os
from typing import List, Dict, Any
import openai
_EMBED_DIM = 1536
def embed(text: str) -> List[float]:
resp = openai.Embedding.create(
model="text-embedding-3-small",
input=[text],
dimensions=_EMBED_DIM,
api_key=os.getenv("OPENAI_API_KEY"),
)
return resp["data"][0]["embedding"]
class SurrealRAG:
def __init__(
self,
url: str = "ws://localhost:8000/rpc",
namespace: str = "rag",
database: str = "demo",
user: str = "root",
password: str = "secret",
):
self.url = url
self.namespace = namespace
self.database = database
self.user = user
self.password = password
async def _ensure_table(self):
async with AsyncSurreal(self.url) as db:
await db.signin({"username": self.user, "password": self.password})
await db.use(self.namespace, self.database)
await db.query(
"""
DEFINE TABLE rag_docs SCHEMALESS;
DEFINE FIELD id ON rag_docs TYPE string;
DEFINE FIELD text ON rag_docs TYPE string;
DEFINE FIELD source ON rag_docs TYPE string;
DEFINE FIELD embedding ON rag_docs TYPE array;
DEFINE INDEX IF NOT EXISTS rag_docs_vec
ON rag_docs FIELDS embedding
HNSW DIMENSION $dim DIST COSINE;
""",
{"dim": _EMBED_DIM}
)
async def add(self, docs: List[Dict[str, str]]):
await self._ensure_table()
async with AsyncSurreal(self.url) as db:
await db.signin({"username": self.user, "password": self.password})
await db.use(self.namespace, self.database)
for d in docs:
rec = {
"id": hashlib.sha1(d["text"].encode()).hexdigest(),
"text": d["text"],
"source": d["source"],
"embedding": embed(d["text"]),
}
await db.create("rag_docs", rec)
async def query(self, text: str, k: int = 4) -> List[Dict[str, Any]]:
await self._ensure_table()
vec = embed(text)
async with AsyncSurreal(self.url) as db:
await db.signin({"username": self.user, "password": self.password})
await db.use(self.namespace, self.database)
result = await db.query(
"""
SELECT text, source,
vector::distance::cosine(embedding, $vec) AS score
FROM rag_docs
WHERE embedding <|$k|> $vec
ORDER BY score ASC
""",
{
"vec": vec,
"k": k
}
)
rows = result[0]["result"]
return [{"context": r["text"], "source": r["source"], "score": r["score"]} for r in rows]
Unique DeepEval example
We’ll test whether an LLM correctly answers “Which fruit is botanically a berry but commonly mistaken for a vegetable?” using information fetched from SurrealDB.
import asyncio
from surreal_rag import SurrealRAG
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualPrecisionMetric,
)
from deepeval import evaluate
import openai
async def main():
rag = SurrealRAG()
await rag.add([
{
"text": "The tomato is botanically classified as a berry because it develops \
from a single ovary and contains seeds.",
"source": "https://en.wikipedia.org/wiki/Tomato",
},
{
"text": "A cucumber is a pepo, a type of berry with a hard rind.",
"source": "https://en.wikipedia.org/wiki/Cucumber",
},
{
"text": "Strawberries are accessory fruits; their 'seeds' are achenes.",
"source": "https://en.wikipedia.org/wiki/Strawberry",
},
])
query = "Which fruit is a berry but people think it's a vegetable?"
context = await rag.query(query, k=3)
prompt = f"""Answer the question.\n\nContext:\n{context}\n\nQ: {query}\nA:"""
resp = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
answer = resp.choices[0].message.content.strip()
test_case = LLMTestCase(
input=query,
actual_output=answer,
expected_output="tomato",
retrieval_context=context,
)
evaluate(
test_cases=[test_case],
metrics=[
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.7),
ContextualPrecisionMetric(top_k=3),
],
)
if __name__ == "__main__":
asyncio.run(main())
Running this script prints a local score report and uploads the run to the Confident AI dashboard for historical tracking (after you’ve logged in with deepeval login).
Because the context objects already include score and source, DeepEval can show traceability back to the exact SurrealDB rows that justified the answer.
Scaling up
- Replace the quick
rag.add() list with a real corpus (CSV, PDFs, etc.). - Encapsulate the embed + insert logic inside a Dagster or Airflow asset if you already orchestrate ETL.
- Use SurrealDB’s metadata fields and SurrealQL predicates (
WHERE metadata.topic = 'law') to test retrieval recall for specific slices of your knowledge base. - Evaluate hundreds of examples by looping through a HuggingFace dataset and appending each
LLMTestCase to a list before calling evaluate().
Why SurrealDB + DeepEval?
| Benefit | Why it matters |
|---|
| Single data plane | Store documents, vectors and relational metadata together – fewer moving parts. |
| Built-in ANN index | Define HNSW with one DDL statement; no external vector service to deploy. |
| SurrealQL | Flexible SELECT … WHERE … <K> queries mix Boolean filters with vector similarity. |
| DeepEval dashboards | Track how retrieval quality + answer faithfulness change as you tweak prompts or embeddings. |
With these snippets you can drop SurrealDB into any DeepEval-based RAG test harness and keep the rest of your metric logic unchanged. Happy evaluating!
Resources