Productivity · Apr 1, 2026

How I Built My Knowledge Base

A deep dive into the process of creating a personal knowledge base to organize and access information efficiently.

A Self-Maintaining Personal Knowledge Base

Most people's experience with LLMs and documents looks like RAG: upload files, retrieve chunks at query time, generate an answer. It works, but nothing accumulates. Ask a subtle question that requires synthesising five documents and the LLM re-derives the answer from scratch every time.

I wanted something different, a wiki that gets smarter with every source I add, without me doing any of the maintenance work. Here's how I built it.

The Core Idea

The system has three components:

raw/: a drop zone for unprocessed sources (articles, notes, documentation, anything)
wiki/: a structured, interlinked Markdown wiki maintained entirely by an LLM
A compile pipeline: when a new source lands in raw/, a script hands it to a Claude agent that reads it, extracts knowledge, and integrates it into the wiki

The wiki is the artefact. It's not a search index or an embedding store. It's an actual document collection with cross-references, synthesis, and structure. Every time you add a source, the agent doesn't just index it. It reads the existing wiki, figures out what's new, and either updates existing articles or creates new ones.

Directory Structure

knowledge/personal/
├── raw/                    # Drop sources here — never modify these
├── wiki/
│   ├── concepts/           # Atomic knowledge articles (one concept per file)
│   ├── connections/        # Cross-cutting synthesis (links 2+ concepts)
│   ├── index.md            # Auto-maintained index of all articles
│   └── log.md              # Chronological record of every operation
├── outputs/                # Generated reports and analyses (lint, queries)
├── scripts/
│   ├── compile.py          # Integrates raw sources into the wiki
│   ├── lint.py             # Health checks on the wiki
│   ├── log_writer.py       # Shared logging utility
│   └── state.json          # Per-source compile state (sha256 + timestamp)
├── CLAUDE.md               # Schema and instructions for Claude
└── pyproject.toml          # Python project (claude-agent-sdk)

The raw/ folder is the input queue. The wiki/ folder is the compiled output. The LLM writes all wiki articles.

Two Types of Wiki Articles

The wiki has a deliberate two-level structure that separates what things are from how they relate.

Concepts (`wiki/concepts/`)

One article per idea. Each file covers exactly one concept, explained clearly enough to stand alone.

---
title: "Spaced Repetition"
tags: [learning, memory]
sources:
  - "raw/make-it-stick.md"
created: 2026-04-01
updated: 2026-04-03
---

# Spaced Repetition

Spaced repetition is a learning technique that schedules reviews at
increasing intervals based on how well you remember each item. It
exploits the spacing effect: memories are stronger when study sessions
are distributed over time rather than massed together.

## Key Points

- Review intervals expand as recall improves (hours → days → weeks → months)
- Forgetting slightly before a review strengthens long-term retention
- Implemented in tools like Anki, which uses the SM-2 algorithm

## Details

The theoretical basis is Ebbinghaus's forgetting curve (1885)...

## Related Concepts

- [[concepts/desirable-difficulty]] - Spaced repetition is one instance of this broader principle
- [[concepts/interleaving]] - Often combined with spaced repetition for stronger retention

## Related Connections

- [[connections/spaced-repetition-and-desirable-difficulty]]

## Sources

- [[raw/make-it-stick.md]] - Core explanation and research citations

The frontmatter fields are fixed: title, tags, sources, created, updated. created is set once and never changes. updated is refreshed on every edit. New sources are appended to the sources list. Existing entries are never removed.

Connections (`wiki/connections/`)

Cross-cutting synthesis articles that link two or more existing concepts and articulate a non-obvious relationship between them. The bar is deliberate: a connection page should only exist when linking the concepts reveals something that neither page says alone.

---
title: "Connection: Spaced Repetition and Desirable Difficulty"
connects:
  - "concepts/spaced-repetition"
  - "concepts/desirable-difficulty"
created: 2026-04-01
updated: 2026-04-01
---

# Connection: Spaced Repetition and Desirable Difficulty

## The Connection

Spaced repetition is the canonical implementation of desirable difficulty —
it works precisely _because_ retrieval is hard.

## Key Insight

Most people practise at the point where retrieval feels easy, which
produces fluency without durability. Spaced repetition forces you to
practise at the edge of forgetting, where retrieval is difficult and
therefore maximally consolidating. The mechanism of spaced repetition
_is_ the mechanism of desirable difficulty.

## Evidence

Studies by Roediger and Karpicke (2006) showed that retrieval practice
under conditions of high difficulty produced 50% better long-term
retention than re-reading...

## Related Concepts

- [[concepts/spaced-repetition]]
- [[concepts/desirable-difficulty]]

Connections don't carry their own sources field. They inherit context from the concept pages they bridge.

How the Compile Script Works

scripts/compile.py uses the Claude Agent SDK to run an agent with Read, Write, and Glob tools scoped to the wiki. The schema from CLAUDE.md is injected verbatim into the system prompt, so the prompt and the schema can never drift apart.

For each unprocessed file in raw/, the script kicks off one agent loop (up to 20 turns) with a single instruction: ingest this source into the wiki. From there, the agent does its own work:

Reads wiki/index.md to see what already exists
Reads the raw source
Pulls in whichever concept and connection pages it judges relevant
Edits those pages: appends the source, integrates new claims, reconciles contradictions
Creates new concept pages for genuinely new topics
Creates new connection pages only when a non-obvious cross-concept relationship exists
Regenerates wiki/index.md

After the agent finishes, the script diffs a sha256 snapshot of the wiki taken before and after the run to determine which articles were created vs. updated, records the result in wiki/log.md, and marks the source as processed in scripts/state.json (with its sha256, so editing the raw file later marks it stale on the next lint).

Tool calls are restricted to paths inside wiki/concepts/, wiki/connections/, and wiki/index.md. The agent can't touch raw/, scripts/, or outputs/.

No API Key Required

The Agent SDK calls the locally installed claude CLI. If you already use Claude Code, you're already authenticated. No ANTHROPIC_API_KEY setup required.

The Lint System

scripts/lint.py runs six health checks on the wiki and writes a Markdown report to outputs/lint-YYYY-MM-DD.md.

Severity	Check	What It Catches
Error	Broken links	`[[wikilinks]]` pointing to non-existent articles
Warning	Orphan pages	Articles with zero inbound links from other articles
Warning	Orphan sources	Raw files that haven't been compiled yet
Warning	Stale articles	Raw sources whose sha256 has changed since they were last compiled
Suggestion	Missing backlinks	A links to B, but B doesn't link back to A
Suggestion	Sparse articles	Below 200 words — likely incomplete

The report uses error/warning/suggestion severity levels so you can triage at a glance. The script exits with code 1 if there are errors, making it CI-friendly.

Chronological Logging

Every compile, lint, and query operation appends an entry to wiki/log.md. This gives you a searchable record of how the wiki evolved.

## [2026-04-14T13:05:00] compile | how-does-claude-code-actually-work.md

- Source: raw/how-does-claude-code-actually-work.md
- Articles created: [[concepts/llm-tool-calling]], [[connections/harness-optimization-and-model-behavior]]
- Articles updated: [[concepts/ai-coding-harness]]

## [2026-04-14T14:42:44] lint | health check

- Errors: 0 | Warnings: 3 | Suggestions: 8
- Report: [[outputs/lint-2026-04-14]]

The log writer (scripts/log_writer.py) is a shared utility imported by both scripts. It also exposes append_query() for logging Q&A sessions, used by the query workflow described below.

Automation with Claude Code Hooks

Both scripts run automatically at the start of every Claude Code session. In .claude/settings.json:

{
  "hooks": {
    "SessionStart": [
      {
        "matcher": "startup",
        "hooks": [
          {
            "type": "command",
            "command": "cd /path/to/knowledge/personal && /opt/homebrew/bin/uv run python scripts/compile.py"
          },
          {
            "type": "command",
            "command": "cd /path/to/knowledge/personal && /opt/homebrew/bin/uv run python scripts/lint.py"
          }
        ]
      }
    ]
  }
}

Hooks run with a minimal shell environment, so use the full path to uv (find it with which uv). Drop a file into raw/, open Claude Code, and the wiki is already updated and linted by the time the session loads.

CLAUDE.md: The Schema File

CLAUDE.md lives at the root of knowledge/personal/ and is the standing instruction for every compile call. It defines:

The directory structure and what each folder is for
The exact frontmatter and section format for concept and connection articles
Rules for when to create vs. update articles
Rules for when connection articles are warranted
The query workflow (read index → synthesize → file good outputs back as new pages)

Ingesting Sources

I use Obsidian Web Clipper as the front door, a browser extension that saves any web page directly into the vault as a Markdown file. Since the vault root is knowledge/personal/, clipping a page lands a structured Markdown file in raw/.

The clipper template captures the page body alongside useful metadata:

{
  "schemaVersion": "0.1.0",
  "name": "Default",
  "behavior": "create",
  "noteContentFormat": "{{content}}",
  "properties": [
    { "name": "title", "value": "{{title}}", "type": "text" },
    { "name": "source", "value": "{{url}}", "type": "text" },
    {
      "name": "author",
      "value": "{{author|split:\", \"|wikilink|join}}",
      "type": "multitext"
    },
    { "name": "published", "value": "{{published}}", "type": "date" },
    { "name": "created", "value": "{{date}}", "type": "date" },
    { "name": "description", "value": "{{description}}", "type": "text" },
    { "name": "tags", "value": "clippings", "type": "multitext" }
  ],
  "noteNameFormat": "{{title}}",
  "path": "raw"
}

Setting path to raw puts clippings directly where the compile pipeline expects them. A clipped file ends up looking like:

---
title: "Spaced Repetition: How It Works"
source: "https://example.com/spaced-repetition"
author: [[Jane Smith]]
published: 2024-03-15
created: 2026-04-14
description: "An overview of spaced repetition systems and the science behind them."
tags:
  - clippings
---

[Full article content in Markdown...]

The compile agent reads the entire file (frontmatter and body), so the source URL, author, and publication date are available when it writes the sources section of new concept articles. A clipped source becomes a proper citation rather than just a filename.

If you don't use Obsidian, anything that lands a .md or .txt file in raw/ works: drag-and-drop, curl, a different clipper, your own scripts.

The Day-to-Day Workflow

Adding knowledge. Clip or drop a file into raw/. Open Claude Code (or run uv run python scripts/compile.py manually). New concept and connection pages appear in wiki/, the index updates, and the log records what happened.

Querying. Open knowledge/personal/ as an Obsidian vault. The graph view shows the connection structure, backlinks show what references a concept, and full-text search finds specific claims. For deeper questions, ask Claude inside the vault. The schema instructs it to read wiki/index.md first, synthesize an answer with citations, and file the result back to the wiki as a new concept or connection page when the answer reveals something not already captured. That last step is what makes the wiki compound: every good question can leave behind a new article. Queries are logged via log_writer.append_query() alongside the pages consulted and any page filed.

Maintenance. The lint report flags what needs attention. Broken links are real errors. Orphan pages and missing backlinks are worth fixing when you have time. Sparse articles are candidates for adding more sources.

Try It Yourself

The design follows Andrej Karpathy's llm-wiki — his gist is the best starting point if you want to build your own version.

Install the tools first:

brew install claude          # Claude Code — runs the compile agent
brew install uv              # Python runner for the scripts
brew install --cask obsidian # Markdown editor for browsing your wiki

Install the Obsidian Web Clipper from your browser's extension store (Chrome · Firefox · Safari).

Then read the gist. It explains the folder structure to create, the CLAUDE.md schema file that tells the agent how to write articles, and the compile script that ties it together. The schema is the most important part — it's a plain text file you write once, and the agent follows it on every run.