AI

Using a local AI to search your own documents: RAG explained without jargon

AlinAlin · Developer15 March 20268 min read
Stack of books representing a knowledge base

Every business wants an AI that knows their own policies, contracts, and docs. Here's what actually makes that work — and why it's simpler than most people assume.

Every business that starts using AI eventually asks the same question: "Can we make it actually know our stuff?" The policies, the contracts, the product documentation, the internal wiki that nobody can search.

The technique for doing this has a slightly intimidating name — Retrieval-Augmented Generation, or RAG. Under the name is an idea so simple you've been doing it your whole life.

The idea in one sentence

Before the AI answers your question, quietly hand it the two or three pages from your documents that are most likely to contain the answer, and ask it to respond based on those.

That's it. That's RAG. It's not fine-tuning. It's not training a custom model. It's giving the AI a cheat sheet just in time.

Why not just fine-tune?

Fine-tuning is when you adjust the model's internal weights by training it on new material. It sounds like the "right" way to teach an AI about your business, but for almost every real use case it's the wrong tool:

  • It's expensive to do well.
  • The model still makes things up confidently — fine-tuning on a fact doesn't prevent the model from contradicting it later.
  • Updating is painful — every time a document changes, you have to retrain.
  • You can never verify where an answer came from.

RAG avoids all of this. The model stays untouched. The documents stay in your database. Updates are instant. And because the model is given specific source text to answer from, you can show your users exactly which document the answer came from.

The four moving parts

1. Your documents

Anything the AI should know about — PDFs, Word docs, web pages, wiki exports, contracts, manuals. In most small businesses we're talking a few hundred files at most.

2. Chunks

Those documents get split into smaller pieces (usually a few paragraphs each). Why? Because when it's time to find relevant content, a paragraph is a more useful unit than an entire 40-page contract.

3. Embeddings

Each chunk is converted into a mathematical fingerprint — a list of numbers that captures its meaning. Two chunks about the same topic end up with similar fingerprints, even if they use different words. A small embedding model (Ollama has good ones — nomic-embed-text, for example) does this in seconds.

All these fingerprints get stored in a specialised database called a vector database. Good small options: Chroma, LanceDB, or pgvector if you already use PostgreSQL.

4. Retrieval + generation

When a user asks a question, the same embedding model converts their question into a fingerprint. The vector database finds the handful of chunks whose fingerprints are closest. Those chunks get pasted into the prompt that's sent to the AI, roughly like: "Here are some relevant excerpts from our documents: [chunks]. Based only on these, answer the user's question: [question]."

The AI answers, often with a citation. That's the whole thing.

What RAG is genuinely great at

  • Answering questions that can be found in your documents.
  • Summarising across several related documents at once.
  • Customer support for products with complex documentation.
  • Internal Q&A — HR policies, onboarding questions, process manuals.
  • Legal and contract review, where exact wording matters.

What it's not so great at

  • Anything requiring reasoning across many documents at once (it only sees a handful at a time).
  • Questions whose answers aren't literally in the documents you gave it.
  • Tasks that require the model to do real maths or write long code.

If your question is "What does our leave policy say about study leave?" — perfect. If your question is "Given all our contracts, what's our average renewal rate?" — RAG alone won't cut it; you'd want a more structured approach.

A minimal setup you can actually run

The simplest working version of RAG in 2026:

  • Ollama running a small general model (Llama 3.3 8B) and an embedding model (nomic-embed-text).
  • A local Chroma or LanceDB database to hold embeddings.
  • A few hundred lines of Python or TypeScript to tie it together.
  • A basic web interface with an input box and a chat window.

The entire stack runs on a single laptop and costs nothing to operate. A small team can build a working proof of concept in a few days.

Where it quietly pays for itself

The use case we see pay back fastest is internal support. Every business has a small set of questions that get asked constantly — leave policies, expense processes, "who do I ask about X?" A well-built RAG assistant deflects the majority of these and gives a consistent answer every time. The hours saved add up quickly.

Fine-tuning teaches the AI to sound like you. RAG teaches it to answer from what you actually wrote down. For almost every business, the second one is what you actually want.

If you've been waiting for the "right" way to make AI useful to your business, RAG is probably it. It's boring. It works. And it runs on a laptop.