Large Language Models (LLM)

Tag Archive for: Large Language Models

From Zero to Hero: Implementing RAG using PostgreSQL

Large Language Models (LLMs) are powerful, but they operate without real-time access to your data. They struggle with private, changing, or domain-specific information, and they may hallucinate when their training data is incomplete. Retrieval-Augmented Generation (RAG) addresses this by pulling in relevant context from an external data store at query time and feeding it into the model.

To enable fast and context-aware retrieval given a user’s query, most systems rely on semantic search over embeddings. You chunk your documents, embed them into a high-dimensional vector space, and do the same for the user’s query. Similarity search then finds the closest matches. However, semantic search alone has gaps. Rare terms, exact phrases or code snippets are often better handled through keyword-based search. This is why many production setups use a hybrid approach that combines both methods.

A common choice is to use managed vector databases like Pinecone, Weaviate, or Chroma. These work well, but they also add operational overhead: more infrastructure, more APIs to manage, and more cost. For many projects, there is a simpler option. PostgreSQL, together with the pgvector extension, can act as a vector store and handle semantic search directly—while also supporting keyword search via its built-in Full Text Search (FTS) capabilities. Since PostgreSQL is already the primary database in many backends, this approach makes it convenient to keep everything in one place and still support hybrid search efficiently.


This post is a hands-on walkthrough covering three parts:

  1. Vector search with pgvector
  2. Keyword search with Full Text Search
  3. Row-Level Security (RLS) to restrict access to private data

By the end, you will have a working PostgreSQL-based retrieval setup that can power a RAG system without adding another piece of infrastructure.

– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

Vector Search with pgvector

pgvector adds native vector types and nearest-neighbor search to PostgreSQL, so you can store embeddings next to your relational data and query them with SQL. To use pgvector, enable it once per database:

CREATE EXTENSION IF NOT EXISTS vector;

pgvector supports several vector types with different limits:

  • VECTOR – up to 2,000 dims
  • HALFVEC – up to 4,000 dims
  • BIT – up to 64,000 dims
  • SPARSEVEC – up to 1,000 non-zero elements 

For any embedding model with dimensionality <2k (e.g., OpenAI’s text-embedding-3-small) VECTOR is the default choice. For models that exceed the 2k dims (e.g., OpenAI’s `text-embedding-3-large`), ` HALFVEC` is a straightforward choice which cuts storage roughly in half. 

Setting Up a VECTOR Column for Embeddings

For any table, you can add one (or more) vector columns. If you know the embedding
dimensionality, it’s best to specify it, which enables early checks and better planning. For example:

CREATE TABLE documents ( 
    id BI GSERIAL PRIMARY KEY,
    title TEXT,
    content TEXT, 
    embedding VECTOR(1536) 
); 

PostgreSQL does not create embeddings for you. You must compute them using an embedding model and store them in the database (along with the existing data).

Executing a Vector Search

When querying the database, you first calculate an embedding for your query using the same model you used when storing document embeddings. This allows you to compare the query vector against the vectors in your database and retrieve the most relevant matches.


To measure how close two vectors are in this space, several similarity metrics are commonly
used:

  • L2 distance: <->
  • Inner product (negative, for ASC ordering): <#>
  • Cosine distance: <=>
  • L1 distance: <+>
  • Bit vectors: Hamming <~> and Jaccard <%>

Cosine distance is a good default because it measures directional similarity rather than raw magnitude. Most embedding models produce vectors where direction encodes meaning, and many providers normalize vectors to unit length. Under normalization, cosine, dot product, and L2 distance produce identical rankings, but cosine and dot product are usually faster to compute and less sensitive to differences in vector scale.


Example query:

SELECT id, title 
FROM documents 
ORDER BY embedding &lt;=&gt; $1  -- $1 is a 1536-dim vector literal like '[...]' 
LIMIT 10; 

Indexing for performance

Exact vector search provides perfect recall but does not scale well. With large datasets, latency increases because every vector must be scanned. To reduce latency, add an approximate nearest neighbor (ANN) index. pgvector supports two ANN index types: IVFFlat (inverted file with flat compression) and HNSW (hierarchical navigable small world).


IVFFlat partitions the vector space into clusters (so-called lists). At query times, only the most similar lists are searched. It builds quickly and uses modest memory, but recall is lower than with HNSW. Build the index after loading representative data, choose the number of lists when creating the index, and tune ivfflat.probes – which controls how many of those clusters are searched – per query to balance recall and speed.

CREATE INDEX ON documents
USING ivfflat (embedding halfvec_cosine_ops) WITH (lists = 100); -- per-query recall/speed trade-off 
SET ivfflat.probes = 10; 

Rules of thumb: start with lists = rows / 1000 up to 1M rows and sqrt(rows) above that; set probes ≈ sqrt(lists).


HNSW creates a multi-layer graph structure where each vector is connected to several neighbors. Queries navigate from coarse upper layers down to more detailed lower layers, which makes lookup fast while keeping recall high. It generally performs well for read-heavy workloads but uses more memory and takes longer to build than IVFFlat. It does not require a separate training step.

You control index quality and resource use with two build-time parameters:

  • m: how many connections each node keeps. Higher values improve recall but increase memory.
  • ef_construction: how broadly the index searches for neighbors during build. Higher values improve graph quality but slow indexing

At query time, you tune:

  • hnsw.ef_search: how many candidate neighbors are explored. Higher values increase recall but slow queries.
CREATE INDEX ON documents 
USING hnsw (embedding halfvec_cosine_ops) 
WITH (m = 16, ef_construction = 64); 

Which to choose? Use HNSW if you want the best recall and lowest latency and have memory to spare. Use IVFFlat if you need faster builds or tighter memory usage. With both index types, keep ORDER BY … LIMIT in your query so the planner can use the index.

– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

Keyword search with Full Text Search

With vector search in place, we can add keyword-based retrieval using PostgreSQL’s native Full Text Search (FTS). FTS is well-suited for exact terms, rare words, or code snippets— making it a useful complement to embeddings in a hybrid search setup.

Setting Up a TSVECTOR Column for Searchable Text

PostgreSQL stores searchable text in a TSVECTOR, which contains normalized lexemes and optional positional data. For example:

“Learn how to use tsvector and tsquery” → ‘learn’:1 ‘use’:4 ‘tsvector’:5 ‘tsquery’:7

The functions to produce and query these vectors are built in. Precompute the TSVECTOR
and store it in the table to avoid recomputing on every query. A generated stored column
keeps it in sync.

ALTER TABLE articles 
ADD COLUMN search_vector tsvector 
GENERATED ALWAYS AS ( 
    to_tsvector(‘simple’, coalesce(title, '') || ' ' || coalesce(content, '')) 
) STORED; 

Tokenization configuration

This example above uses the ‘simple’ text search configuration (no stemming or stopwords removal) to generate the TSVECTOR. PostgreSQL also offers language-specific analyzers (english, german, etc.) which handle stemming and stopword removal for that specific language.

To list all supported languages execute:

SELECT cfgname FROM pg_ts_config; 

For multilingual data, you can also store a per-row language and call to_tsvector(ts_config, text).

Weighting

PostGreSQL also supports weighting fields differently (e.g., title > body) using weighted vectors:

ALTER TABLE films 
ADD COLUMN search_vector tsvector 
GENERATED ALWAYS AS ( 
setweight(to_tsvector('english', title), 'A') || 
setweight(to_tsvector('english', description), 'B') 
) STORED; 

The default weight array used by PostGreSQL is {D=0.1, C=0.2, B=0.4, A=1.0} but can be
changed during query time.

Indexing

PostgreSQL supports two index types for Full Text Search: GIN (Generalized Inverted Index) and GiST (Generalized Search Tree).


GIN stores a true inverted index: for each lexeme, it keeps a list of document references (and optional positions). This makes containment checks (which documents contain a given keyword) fast and exact. The trade-offs are a larger index on disk and slower writes, because updates must maintain these posting lists. This is usually acceptable for read-heavy workloads, and GIN remains the standard choice for most text search scenarios.

GiST works differently. Instead of storing full posting lists, it compacts each tsvector into a small bitmask called a signature. These signatures summarize which lexemes appear in a document, but they are not perfectly precise. Because different documents can produce the same bit pattern, GiST may return rows that only might match, and PostgreSQL must recheck the underlying tsvector. The benefit is a smaller index that is faster to build and cheaper to update. The cost is occasional false positives and slower complex queries.


Which to choose? Choose GIN when you want fast, accurate text search and can afford slightly slower index maintenance. Choose GiST when you have many writes, frequent updates, or limited storage and can tolerate a recheck step.


Example:

CREATE INDEX idx_products_search_vector_gin ON products USING GIN (search_vector);
CREATE INDEX idx_products_search_vector_gist ON products USING GIST (search_vector gist_tsvector_ops); 

Executing a Full-Text Search

To run a full-text search in PostgreSQL, you first convert your raw text query into the TSQUERY format. A TSQUERY is the query representation used by PostgreSQL to match against TSVECTOR columns, like how an embedding vector is used to query a vector index. The TSQUERY defines the search terms and logic, and PostgreSQL uses it to check which documents match.


Converting input to a TSQUERY is more involved than vector search. For embeddings, you compute a vector once and search with it. For keyword search, you must decide how to turn a natural language query into a meaningful TSQUERY. Raw queries from users or from an LLM in a RAG system are usually not optimized for keyword search.


The simplest conversion is plainto_tsquery. It normalizes the text, removes stopwords, and lemmatizes terms, but it always produces an AND query. All terms must appear in the document. This is often too strict in practice.

SELECT plainto_tsquery('english', 'deep learning models');  -- 'deep' &amp; 'learn' &amp; 'model'

If you need broader recall, there are two alternatives:

  1. Post Processing: Improve the query before calling FTS. You can apply your own heuristics or let an LLM shape the query (e.g., if used as a tool). PostgreSQL provides websearch_to_tsquery, which accepts a Google-style syntax and never errors on malformed input. This is useful when the query is already processed.
  2. Convert the AND query to OR: Generate the TSQUERY with plainto_tsquery (as shown above) to benefit from parsing and normalization, then replace & with |. This is a small “hack” to widen the match set with no extra latency, no LLM calls, and no custom heuristics, but should be used with caution. OR queries can become overly broad, especially if the input contains terms that are not removed as stopwords. In the worst case, almost all documents match, reducing the value of the filter step. It is a quick fix to increase recall, but it can weaken filtering significantly.

Given the computed TSQUERY, a full-text search query runs in two steps. First, you filter rows using the TSQUERY and the @@ operator which benefits from FTS index. It determines which documents match at all. Second, the matching rows are ranked by relevance. Ranking is slower than indexed filtering, so even though the first step is technically optional, it is recommended to keep the match set small to minimize latency during ranking.

PostgreSQL provides two functions to compute a relevance score: ts_rank and ts_rank_cd. Both take a TSVECTOR (document) and a TSQUERY (query) and return a numeric score, where higher means a better match.

ts_rank scores documents based on term frequency and optional weights. More
occurrences of query terms increase the score, but with diminishing returns. Very frequent
terms are discounted, so they do not dominate.

ts_rank_cd includes everything from ts_rank but adds cover density. It considers how close the query terms appear to each other. Documents where terms occur in a tight span receive a higher score than those where terms are far apart. This usually improves ranking for multi term queries, because clustering signals a more relevant context. The extra logic makes ts_rank_cd slower, as it must examine positional data. For short documents or one-word queries, the difference is small. For longer documents or multi-term queries, ts_rank_cd typically yields better ranking.

Both functions support a normalization bitmask to adjust raw scores. Normalization can reduce bias towards long documents and scale scores into comparable ranges. Bit 1 divides by 1 + log(document_length) and bit 2 divides by document length, both reducing length bias. Bits 4 and 8 consider the number of unique terms for scaling. Bit 32 is common in practice and maps scores to score / (score + 1), producing values between 0 and 1.

To summarize, a typical full-text search combines filtering with @@ and ranking with ts_rank or ts_rank_cd. The filter uses the FTS index to select candidates quickly, and ranking then orders only those matches. An example might look as follows:

SELECT id, title, ts_rank_cd(search_vector, query, 32) AS score 
FROM products 
WHERE search_vector @@ query 
ORDER BY score DESC 
LIMIT 10; 

Hybrid Retrieval: Combining pgvector and FTS

Now that you have both vector search and keyword-based search in place, you can combine them to improve result quality. Hybrid search fuses semantic matches from pgvector with keyword matches from Full Text Search. You run two independent queries—one vector search and one FTS search—and then merge the results. Both queries can be executed in parallel.

Reciprocal Rank Fusion (RRF) combines the two result lists by assigning a score based on each document’s rank in the individual lists and summing them. It is popular in many retrieval setups because it is simple, rank-based, and robust to outliers.

Where:

  • r = rank position of document d in a result list (1 = best)
  • R_d = set of ranks for d across all lists
  • k = constant to smooth the contribution of lower-ranked results.

The value of k is commonly set to 60 based on empirical evaluation in the information retrieval literature, where it produced stable performance across benchmarks without tuning. Larger values reduce the influence of tail results, ensuring top-ranked documents dominate the fused score.

Overall, RRF avoids the need to calibrate or normalize scores from different search methods. It is a practical default when combining vector and keyword search.

– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

Row-Level Security (RLS)

When powering a RAG system from PostgreSQL, you often store user-specific data alongside shared content. A vector search alone does not guarantee isolation: without additional rules, embeddings and text rows from different users can surface in results. To prevent cross-user leakage and ensure each query only returns data the caller is allowed to see, you enforce access control at the table level.

Row-Level Security (RLS) is a feature in PostgreSQL that allows fine-grained control over which rows each user or role can access or modify. RLS enforces restrictions within a table, ensuring each query only sees the rows it’s permitted to see.

One of the primary advantages of RLS is that access rules are enforced directly in the database, rather than scattered throughout application code. This centralization means there’s one place (the database table policy) to define and maintain your security logic, reducing the chance of an application bug exposing data (for example, a missing filter in an API endpoint). If RLS is properly configured on a table, every query (SELECT, UPDATE, DELETE) executed by non-privileged roles will automatically include the security filter, so you cannot bypass it with a badly written query.

Enabling and Applying Row-Level Security Policies

RLS is implemented through policies defined on tables. By default, tables have no RLS policy, so if a user has SELECT access, they can see all rows. Once RLS is enabled, PostgreSQL requires each row to pass a policy check before it can be returned or modified. If RLS is enabled and no policy exists, PostgreSQL defaults to “deny all,” meaning no rows are visible or editable until a policy is added.


Enabling RLS on a table is done with the following command:

ALTER TABLE my_table ENABLE ROW LEVEL SECURITY; 

After enabling it, you create one or more policies using CREATE POLICY. A policy defines a boolean expression that filters rows for specific operations (SELECT, INSERT, UPDATE, DELETE) and can apply to particular roles or to PUBLIC (all roles). For example, to restrict selects on a documents table to rows where owner_id matches the current user’s ID, you might write:

CREATE POLICY docs_select_own ON documents 
FOR SELECT 
USING (owner_id = current_setting('app.user_id')::int); 

For write operations, consider WITH CHECK clauses on INSERT and UPDATE policies to ensure users cannot create rows that would later be invisible to them.


Important: RLS works alongside PostgreSQL’s regular permission system. It does not replace it. A user must still have the appropriate table privileges (granted via GRANT) to perform operations. RLS then adds an additional filtering layer on top. For example, you might GRANT SELECT ON documents TO app_user_role, giving that role general read access, but RLS policies will determine which rows are actually visible. If a user has no SELECT privilege, RLS does not grant it; they simply cannot query the table.

Privilege Behavior Under RLS

Superusers and roles with the BYPASSRLS attribute skip row-level security checks entirely. The owner of a table is also exempt by default, since owners are typically administrative roles. As a result, if your application connects as the table owner or as a superuser, RLS policies will not be enforced for that session.


To ensure RLS applies, use a dedicated non-owner role for your application to connect. For example, create an app_user role with limited privileges and have the application authenticate as that role. Grant access to the required tables, but do not grant ownership. In that configuration, RLS policies will be evaluated as intended.

Permissive vs. Restrictive Policies

PostgreSQL policies are permissive by default. Multiple permissive policies on a table are combined with logical OR — if any policy allows access, the row is accessible. This makes permissive policies suitable for broad, general rules.

For example, you might define a permissive policy covering all operations (*) that allows each employee to read and update documents within their own organization:

CREATE POLICY docs_org_policy ON documents 
FOR ALL 
USING (org_id = current_setting('app.org_id')::int); 

In some cases, you need stricter rules for certain operations. Restrictive policies address this. Restrictive policies are combined with logical AND — all restrictive conditions must pass, in addition to any permissive policy.


For example, to ensure only the document owner can delete their own document, you can define a restrictive policy for DELETE:

CREATE POLICY docs_delete_own ON documents 
AS RESTRICTIVE 
FOR DELETE 
USING (owner_id = current_setting('app.user_id')::int); 

This combination allows layered access control: permissive policies define the baseline, while restrictive policies tighten constraints on sensitive operations.

– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

Conclusion

PostgreSQL can serve as a practical vector store for RAG systems by combining pgvector for semantic retrieval with Full Text Search for keywords. Simple fusion methods such as Reciprocal Rank Fusion improve result quality without complex score calibration, and both searches can run in parallel. Row-Level Security then prevents cross-user leakage by enforcing access rules inside the database engine. Together, these features let you build a retrieval pipeline on familiar infrastructure with tunable performance, centralized authorization, and minimal operational overhead.

Author

Porträtfoto von Arne Grobrügge

Arne Grobrügge
Data Scientist at scieneers GmbH

arne.grobrügge@scieneers.de

PyData 2025

PyData Berlin 2025 at the Berlin Congress Center was three days full of talks, tutorials, and tech community spirit. The focus was on open-source tools and agentic AI, as well as addressing the question: How can LLMs be used productively and in a controlled manner? We from scieneers gave a presentation on LiteLLM, titled “One API to Rule Them All? LiteLLM in Production”.

M3 2025

At this year’s Minds Mastering Machines (M3) conference in Karlsruhe, the focus was on best practices for GenAI, RAG systems, case studies from different industries, agent systems, and LLM, as well as legal aspects of ML. We gave three talks about our projects.

How students can benefit from LLMs and chatbots

In modern higher education, the optimisation and personalisation of the learning process is extremely important. Technologies such as Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) can play a supporting role, especially in complex courses such as law. A pilot project at the University of Leipzig, involving the university’s Computing Centre and the Faculty of Law, shows how these technologies can be successfully used in the form of an AI chatbot.

Background and Turing

In 1950, Alan Turing posed the revolutionary question in his essay “Computing Machinery and Intelligence”: Can machines think? He proposed the famous “imitation game”, now known as the Turing Test. In his view, a machine could be said to “think” if it could fool a human tester.

This idea forms the theoretical basis for many modern AI applications. We have come a long way since then, and new opportunities are opening up for students in particular to use AI tools such as LLMs to support their studies.

How does such a chatbot work for law studies?

The AI-based chatbot uses OpenAI’s advanced language models, called Transformers. These systems, such as GPT-4, can be augmented with the Retrieval Augmented Generation (RAG) method to provide correct answers to more complex legal questions. The process consists of several steps:

1. Ask a question (Query): Students ask a legal question, for example, “What is the difference between a mortgage and a security mortgage?”

2. Processing the query (Embedding): The question is converted into vectors so that it can be read and analysed by the LLM.

3. Search in vector database: The retrieval system searches a vector database for relevant texts that match the question. These can be lecture notes, case solutions or lecture slides.

4. Answer generation: The LLM analyses the data found and provides a precise answer. The answer can be provided with references, e.g. the page in the script or the corresponding slide in the lecture.

This is a powerful tool for law students, as they not only get quick answers to very individual questions, but also have direct links to the relevant teaching materials. This makes it easier to understand complex legal concepts and encourages independent learning.

Benefits for students and professors

Chatbots offer several benefits for teaching and learning in universities. For students, this means

  • Personalised learning support: Students can ask individual questions and receive tailor-made answers.
  • Adaptation to different subjects: You can easily adapt the chatbot to different areas of law, such as civil, criminal or public law. It can also explain more difficult legal concepts or help with exam preparation.
  • Flexibility and cost transparency: Whether at home or on the move, the chatbot is always available and provides access to key information – via a Learning Management System (LMS) such as Moodle or directly as an app. In addition, monthly token budgets ensure clear cost control.

The use of LLMs in combination with RAG also has advantages for teachers:

  • Planning support: AI tools can help to better structure courses.
  • Development of teaching materials: AI can support the creation of assignments, teaching materials, case studies or exam questions.

Challenges in using LLMs

Despite the many benefits and opportunities offered by chatbots and other AI-based learning systems, there are also challenges that need to be considered:

  • Resource-intensive: The operation of such systems requires a high level of computing power and costs.
  • Provider dependency: Currently, many such systems rely on interfaces to external providers such as Microsoft Azure or OpenAI, which can limit independence from universities.
  • Quality of answers: AI systems do not always produce correct results. “Hallucinations (incorrect or nonsensical answers) can occur. Like all data-based systems, LLMs can be biased by the training data used. Therefore, both the accuracy of the answers and the avoidance of bias must be ensured.

The technical background: Azure and OpenAI

The chatbot above is built on the Microsoft Azure cloud infrastructure. Azure provides several services that enable secure and efficient computing. These include:

  • AI Search: A hybrid search that combines both vector and full-text search to quickly find relevant data.
  • Document Intelligence: Extracts information from PDF documents and provides direct access to lecture slides, scripts, or other educational materials.
  • OpenAI: Azure provides access to OpenAI’s powerful language models. For example, the implementation uses GPT-4 Turbo and the ada-002 model for text embeddings to efficiently generate correct answers.

Presentation of the data processing procedure

Conclusion

The pilot project with the University of Leipzig shows how the use of LLMs and RAGs can support higher education. These technologies not only make learning processes more efficient, but also more flexible and targeted.

The use of Microsoft Azure also ensures secure and GDPR-compliant data processing.

The combination of powerful language models and innovative search methods offers both students and teachers new and effective ways to improve learning and teaching. The future of learning will be personalized, scalable, and always available.

Authors

Florence López

Florence Lopez, Data Scientist and Diversity Manager at scieneers GmbH
florence.lopez@scieneers.de


Hikaru Han, Working Student in Online-Marketing at scieneers GmbH
shinchit.han@scieneers.de

Leveraging VideoRAG for Company Knowledge Transfer

The Knowledge Transfer Challenge

In many companies, the issue isn’t a lack of data but how to manage and make it accessible to employees. One particularly pressing challenge is the transfer of knowledge from senior employees to younger generations. This is no small task, as it’s not just about transferring what’s documented in manuals or process guides, but the implicit knowledge that exists “between the lines”—the insights and experience locked within the minds of long-serving employees.

This challenge has been present across industries for many years, and as technology evolves, so do the solutions. With the rapid advancement of Artificial Intelligence (AI), particularly Generative AI, new possibilities for preserving and sharing this valuable company knowledge are emerging.

The Rise of Generative AI

Generative AI, especially Large Language Models (LLMs) such as OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, or Meta’s Llama3.2, offer new ways to process and make large amounts of unstructured data accessible. These models enable users to interact with company data via chatbot applications, making knowledge transfer more dynamic and user-friendly.

But the question remains — how do we make the right data accessible to the chatbot in the first place? This is where Retrieval-Augmented Generation (RAG) comes into play.

Retrieval-Augmented Generation (RAG) for Textual Data

RAG has proven to be a reliable solution for handling textual data. The concept is straightforward: all available company data is chunked and stored in (vector) databases, where it is transformed into numerical embeddings. When a user makes a query, the system searches for relevant data chunks by comparing the query’s embedding with the stored data.

With this method, there’s no need to fine-tune LLMs. Instead, relevant data is retrieved and appended to the user’s query in the prompt, ensuring that the chatbot’s responses are based on the company’s specific data. This approach works effectively for all types of textual data, including PDFs, webpages, and even image-embedded documents using multi-modal embeddings.

In this way, company knowledge stored in documents becomes easily accessible to employees, customers, or other stakeholders via AI-powered chatbots.

Extending RAG to Video Data

While RAG works well for text-based knowledge, it doesn’t fully address the challenge of more complex, process-based tasks that are often better demonstrated visually. For tasks like machine maintenance, where it’s difficult to capture everything through written instructions alone, video tutorials provide a practical solution without the need for time-consuming documentation-writing.

Videos offer a rich source of implicit knowledge, capturing processes step-by-step with commentary. However, unlike text, automatically describing a video is far from a straightforward task. Even humans approach this differently, often focusing on varying aspects of the same video based on their perspective, expertise, or goals. This variability highlights the challenge of extracting complete and consistent information from video data.

Breaking Down Video Data

To make knowledge captured in videos accessible to users via a chatbot, our goal must be to provide a structured process to convert videos into textual form that prioritizes extracting as much relevant information as possible. Videos consist of three primary components:

  • Metadata: The handling of metadata is typically straightforward, as it is often available in structured textual form.
  • Audio: Audio can be transcribed into text using speech-to-text (STT) models like OpenAI’s Whisper. For industry-specific contexts, it’s also possible to enhance accuracy by incorporating custom terminology into these models.
  • Frames (visuals): The real challenge lies in integrating the frames (visuals) with the audio transcription in a meaningful way. Both components are interdependent — frames often lack context without audio explanations, and vice versa.

Tackling the Challenges of Video Descriptions

Figure 1: Chunking Process of VideoRAG.

When working with video data, we encounter three primary challenges:

  1. Describing individual images (frames).
  2. Maintaining context, as not every frame is independently relevant.
  3. Integrating the audio transcription for a more complete understanding of the video content.

To address these, multi-modal models like GPT-4o, capable of processing both text and images, can be employed. By using both video frames and transcribed audio as inputs to these models, we can generate a comprehensive description of video segments.

However, maintaining context between individual frames is crucial, and this is where frame grouping (often also referred to as chunking) becomes important. There are two primary methods for grouping frames:

  • Fixed Time Intervals: A straightforward approach where consecutive frames are grouped based on predefined time spans. This method is easy to implement and works well for many use cases.
  • Semantic Chunking: A more sophisticated approach where frames are grouped based on their visual or contextual similarity, effectively organizing them into scenes. There are various ways to implement semantic chunking, such as using Convolutional Neural Networks (CNNs) to calculate frame similarity or leveraging multi-modal models like GPT-4o for pre-processing. By defining a threshold for similarity, you can group related frames to better capture the essence of each scene.

Once frames are grouped, they can be combined into image grids. This technique allows the model to understand the relation and sequence between different frames, preserving the video’s narrative structure.

The choice between fixed time intervals and semantic chunking depends on the specific requirements of the use case. In our experience, fixed intervals are often sufficient for most scenarios. Although semantic chunking better captures the underlying semantics of the video, it requires tuning several hyperparameters and can be more resource-intensive, as each use case may require a unique configuration.

With the growing capabilities of LLMs and increasing context windows, one might be tempted to pass all frames to the model in a single call. However, this approach should be used cautiously. Passing too much information at once can overwhelm the model, causing it to miss crucial details. Additionally, current LLMs are constrained by their output token limits (e.g., GPT-4o allows 4096 tokens), which further emphasizes the need for thoughtful processing and framing strategies.

Building Video Descriptions with Multi-Modal Models

Figure 2: Ingestion Pipeline of VideoRAG.

Once the frames are grouped and paired with their corresponding audio transcription, the multi-modal model can be prompted to generate descriptions of these chunks of the video. To maintain continuity, descriptions from earlier parts of the video can be passed to later sections, creating a coherent flow as shown in Figure 2. At the end, you’ll have descriptions for each part of the video that can be stored in a knowledge base alongside timestamps for easy reference.

Bringing VideoRAG to Life

Figure 3: Retrieval process of VideoRAG.

As shown in Figure 3, all scene descriptions from the videos stored in the knowledge base are converted into numerical embeddings. This allows user queries to be similarly embedded, enabling efficient retrieval of relevant video scenes through vector similarity (e.g., cosine similarity). Once the most relevant scenes are identified, their corresponding descriptions are added to the prompt, providing the LLM with context grounded in the actual video content. In addition to the generated response, the system retrieves the associated timestamps and video segments, enabling users to review and validate the information directly from the source material.

By combining RAG techniques with video processing capabilities, companies can build a comprehensive knowledge base that includes both textual and video data. Employees, especially newer ones, can quickly access critical insights from older colleagues — whether documented or demonstrated on video — making knowledge transfer more efficient.

Lessons Learned

During the development of VideoRAG, we encountered several key insights that could benefit future projects in this domain. Here are some of the most important lessons learned:

1. Optimizing Prompts with the CO-STAR Framework

As is the case with most applications involving LLMs, prompt engineering proved to be a critical component of our success. Crafting precise, contextually aware prompts significantly impacts the model’s performance and output quality. We found that using the CO-STAR framework — a structure emphasizing Context, Objective, Style, Tone, Audience, and Response—provided a robust guide for prompt design.

By systematically addressing each element of CO-STAR, we ensured consistency in responses, especially in terms of description format. Prompting with this structure enabled us to deliver more reliable and tailored results, minimizing ambiguities in video descriptions.

2. Implementing Guardrail Checks to Prevent Hallucinations

One of the more challenging aspects of working with LLMs is managing their tendency to generate answers, even when no relevant information exists in the knowledge base. When a query falls outside of the available data, LLMs may resort to hallucinating or using their implicit knowledge—often resulting in inaccurate or incomplete responses.

To mitigate this risk, we introduced an additional verification step. Before answering a user query, we let the model evaluate the relevance of each retrieved chunk from the knowledge base. If none of the retrieved data can reasonably answer the query, the model is instructed not to proceed. This strategy acts as a guardrail, preventing unsupported or factually incorrect answers and ensuring that only relevant, grounded information is used. This method is particularly effective for maintaining the integrity of responses when the knowledge base lacks information on certain topics.

3. Handling Industry-Specific Terminology during Transcription

Another critical observation was the difficulty SST models had when dealing with industry-specific terms. These terms, which often include company names, technical jargon, machine specifications, and codes, are essential for accurate retrieval and transcription. Unfortunately, they are frequently misunderstood or transcribed incorrectly, which can lead to ineffective searches or responses.

To address this issue, we created a curated collection of industry-specific terms relevant to our use case. By incorporating these terms into the model’s prompts, we were able to significantly improve the transcription quality and the accuracy of responses. For instance, OpenAI’s Whisper model supports the inclusion of domain-specific terminology, allowing us to guide the transcription process more effectively and ensure that key technical details were preserved.

Conclusion

VideoRAG represents the next step in leveraging generative AI for knowledge transfer, particularly in industries where hands-on tasks require more than just text to explain. By combining multi-modal models and RAG techniques, companies can preserve and share both explicit and implicit knowledge effectively across generations of employees.

Arne Grobrügge

Arne Grobruegge, Data Scientist at scieneers GmbH
arne.grobruegge@scieneers.de