Schema-First Extraction for LLM Wikis with GLiNER2

Introduction#

Schema-first extraction is the missing ingestion layer if your RAG stack keeps rediscovering the same entities, relations, and claims every time you query it. You are paying for the same work again and again without building a durable knowledge base.

You drop in documents, retrieve a few chunks, ask the model to synthesize them, and hope the answer is both grounded and complete. It works for lightweight retrieval. It breaks when the real task is accumulation. If the model has to rediscover the same concepts, entities, contradictions, and relationships on every query, you do not have a knowledge system. You have an expensive loop.

That is why the more interesting direction is not better query-time generation. It is schema-first extraction: compile long, messy text once into durable outputs such as entities, relations, classifications, and schema-bound facts, then keep extending those outputs as new sources arrive. This is the missing operational layer between raw documents and a persistent LLM wiki or knowledge graph.

Karpathy's recent LLM wiki direction(opens in a new tab) is valuable because it reframes the goal.¹ The target is not a prettier RAG stack. The target is a compounding artifact: a maintained corpus of pages that grows more useful every time new material is processed. The problem is that fully generative maintenance still leaves too much unstructured surface area. If you want a wiki that compounds cleanly, you need a stronger extraction layer underneath it.

Key Definitions#

A persistent LLM wiki is a file-first knowledge base where pages, links, and structured claims are updated incrementally as new sources are ingested.¹
Schema-first extraction means defining the entities, relations, classifications, and claim fields up front, then extracting into that structure before generating prose.²³
Concept harvesting is the schema-first knowledge extraction layer that turns messy text into durable, reusable records for a wiki or knowledge graph.
A staged harvester is an ingestion pipeline that combines typed chunking, extraction, canonical resolution, and graph maintenance instead of relying on one end-to-end model pass.

Why RAG-Only Retrieval Fails for LLM Wikis#

Traditional RAG is optimized for answering a question now. A persistent wiki or knowledge graph is optimized for knowing more later.

That difference matters.

In a standard retrieval pipeline, the model keeps doing the same cognitive labor again and again:

identify the important entities
infer the relations between them
classify the passage or claim
compress the answer into a transient response

Then the work disappears into chat history.

In a persistent wiki pattern, the valuable part is not only the answer. The valuable part is the compiled structure left behind after the answer: pages, links, evidence records, contradiction notes, and graph edges that do not need to be rediscovered next time.¹

This is exactly where schema-first, schema-driven information extraction helps. Instead of asking a model to improvise a wiki from prose alone, you first ask a smaller extraction model a narrower question:

What can be harvested from this source in a form that remains useful across future ingests?

That changes the workflow from retrieval to compilation.

If you have already seen how ingestion quality shapes downstream behavior in RAG pipelines, this is the same lesson applied one level up. Blind chunking gives you unstable retrieval. Blind generation gives you unstable knowledge maintenance. The ingestion layer decides the ceiling.⁴

What Is Concept Harvesting in LLM Knowledge Systems?#

I prefer the term concept harvesting because it is more operational than "knowledge extraction."

You are not trying to squeeze a perfect ontology out of raw text in one shot. You are harvesting the durable parts of a source that should survive beyond the current prompt. In other words, concept harvesting is a schema-first knowledge extraction layer for a persistent LLM wiki:

entities worth giving their own page
relations worth turning into links or graph edges
classifications that affect routing, ranking, or review priority
structured fields that belong in a canonical record
evidence snippets that justify why a page or edge exists

"The wiki should not rediscover structure on every question. It should inherit structure from ingestion."

That harvested layer becomes the bridge between documents and the wiki.

For example, imagine ingesting a long research note, PR discussion, or benchmark write-up. A generative LLM can summarize it. But a schema-first pipeline can also answer much more durable questions:

Which models, libraries, and companies were mentioned?
Which capability claims were made?
Which implementation decisions were described?
Which metrics belong to structured records rather than prose?
Which relations connect tools, people, systems, and outcomes?

Those are not just summary details. They are the raw material of a maintained wiki and a living graph. They also survive schema evolution better than ad hoc summaries, which is one of the central lessons from schema-adaptable knowledge graph construction work.³

What the Practical Implementation Looks Like Today#

The practical story matters more than the slogan.

In my current concept-harvester implementation, the system is already more than "run GLiNER on some text." It is a staged ingestion pipeline that treats chunk type, context, canonicalization, and graph hygiene as first-class concerns.

The current flow looks like this:

Structured chunks come in first. The upstream chunker emits typed chunks for prose, code, tables, and hierarchy, along with byte offsets, line ranges, breadcrumbs, section paths, AST symbols, comments, and headers.
A context injector adds transient ghost context. Generic terms like system, model, schema, or config get prefixed with a breadcrumb-style context string before extraction so they are less ambiguous at inference time.
Extraction is polymorphic by chunk type.
- text chunks go through GLiNER on prose
- code chunks use AST symbols directly and run GLiNER only on comments and docstrings
- table chunks run extraction on headers
A resolver canonicalizes the output. It uses a layered strategy: in-memory cache first, exact Postgres match second, Qdrant similarity third, and new concept creation last.
A graph gardener cleans up the graph later. Synonyms are compacted, islands are pruned, and supernodes are demoted so the graph does not turn into junk.

That architecture is materially more honest than claiming a single model "builds the wiki." What actually builds the wiki is the full ingestion loop: typed chunks, context-aware extraction, canonical resolution, and maintenance.

It also exposes the real product surface. The value is not only extraction quality. The value is that the system can keep a graph coherent as more sources land.

# Staged concept-harvesting pipeline (simplified)
for chunk in typed_chunks(document):
    ctx_chunk = inject_context(chunk)
    if chunk.type == "code":
        entities = extract_from_ast(chunk)
    else:
        entities = gliner_extract(ctx_chunk, schema)
    resolved = resolver.canonicalize(entities)  # cache -> DB -> vector -> new
    wiki.update(resolved)
    graph.upsert(resolved)
 
gardener.run(graph)  # compact synonyms, prune islands, demote supernodes

GLiNER2 matters because it can consolidate more of this into a single schema-driven extraction pass, but the pipeline around it still stays the product: chunking, canonicalization, and graph maintenance do not disappear.

How GLiNER2 Enables Schema-First Extraction#

GLiNER2(opens in a new tab) matters because it provides a practical schema-driven extractor instead of a vague promise.²⁵

The original GLiNER line was already useful because it moved entity extraction away from brittle fixed-label NER and toward promptable extraction.⁶ My current harvester still uses that older GLiNER family plus AST metadata and heuristics. GLiNER2 matters because it points to a cleaner next step: more of the extraction contract can move into one schema-driven runtime instead of being spread across several partial strategies.

GLiNER2 pushes the model layer much further. The model supports a schema interface that can combine:

entity extraction
relation extraction
text classification
hierarchical structured extraction
mixed-task composition in one pass

That combination is exactly what a compounding wiki needs.

You do not want one pass for entity extraction, a second pass for labels, a third pass for relations, and a fourth pass for structured fields if all of them describe the same source. You want a single extraction contract that says, in effect:

"Read this text, identify the concepts that matter, classify what kind of source or claim this is, and return any structured fields that belong in the knowledge base."

That is a better fit for wiki maintenance than asking a large generative model to produce free-form prose and hoping the structure can be reconstructed later. It also maps directly to familiar developer tasks: named entity recognition, relation extraction, text classification, and structured JSON-like extraction in one pass.²⁵

The important nuance is this: GLiNER2 does not eliminate the rest of the pipeline. You still need chunking, provenance, canonicalization, graph maintenance, and some strategy for ambiguity. What it improves is the extraction runtime itself. It gives the system a stronger contract for what should come out of each ingest.

A schema can stay small and still be useful. For an LLM wiki pipeline, a representative harvesting schema could look like this:

from gliner2 import GLiNER2Extractor
 
extractor = GLiNER2Extractor("fastino/gliner2-large-v1")
 
schema = (
    extractor.create_schema()
    .entities({
        "model": "Model or runtime names",
        "library": "Software libraries or frameworks",
        "person": "People who authored, proposed, or implemented work",
        "concept": "Named concepts or methods worth tracking",
        "metric": "Named metrics or measurements",
    })
    .relations({
        "implements": "A system implements or enables a concept",
        "improves": "A change improves a metric, workflow, or capability",
        "depends_on": "A concept or system depends on another component",
        "compares_with": "A source explicitly compares one approach to another",
    })
    .classification("source_type", ["paper", "blog", "code", "benchmark", "commentary"])
    .classification("evidence_strength", ["weak", "moderate", "strong"])
    .structure("claim_record")
    .field("claim", dtype="str", description="Main factual or technical claim")
    .field("evidence", dtype="list", description="Supporting evidence or observations")
    .field("limitations", dtype="list", description="Stated caveats or constraints")
)
 
results = extractor.extract(text, schema)

The important point is not the exact schema. The point is that the output is already shaped for maintenance. An entity page can be updated. A claim record can be reviewed. A relation can become a graph edge. A classification can decide whether a human should inspect the source before merging it into the wiki.

Designing a Two-Layer LLM Wiki: Human-Readable Pages and a Structured Knowledge Graph#

A lot of "knowledge graph from text" work jumps straight to triples and databases. That is useful, but incomplete.

A practical system usually needs two durable layers:

a human-readable wiki that an operator can browse, diff, and edit
a graph-friendly structured layer that supports linking, retrieval, filtering, and downstream automation

Karpathy's file-first wiki pattern gets the first layer right.¹ The wiki is inspectable. The pages are local. The artifact is versionable. That matters.

But the wiki becomes much more robust when the second layer exists beside it.

A maintained page about GLiNER2, for example, should not rely only on prose. It should be able to inherit structured facts from prior ingests:

model family
extraction capabilities
schema shapes used in practice
implementation notes
cited sources
related runtimes and tools

Likewise, a concept page for schema-first extraction should not only summarize the idea. It should accumulate linked evidence, examples, counterpoints, and supporting systems over time.

This is where a wiki starts to feel less like note-taking and more like a compiled knowledge surface.

Why Graph Canonicalization Separates Real LLM Wikis from Demos#

A lot of concept-extraction demos stop too early.

They show that a model can pull out entities or triples from a page. Then they stop before the harder part begins:

merging aliases and synonyms
preventing graph explosion from generic concepts
weighting edges by importance or position
pruning low-value islands
handling schema drift as the corpus evolves

That is exactly why the resolver and gardener layers matter so much in a practical system. Extraction without canonicalization gives you noise. Canonicalization without maintenance gives you slow graph rot. A credible LLM wiki needs both.

This is also why I do not think the product is "a model that reads documents." The product is the maintenance loop that keeps the graph and wiki usable after the tenth ingest, the hundredth ingest, and the thousandth ingest.

Running Schema-First Extraction in Production with vLLM Pooling#

A blog post about concept harvesting without a serving path is just architecture fan fiction.

The practical reason this direction became interesting to me is that encoder-first extraction now has a credible runtime path. vLLM's pooling stack(opens in a new tab) supports encoder workloads through pooling tasks such as classify, token_classify, and plugin, exposed both in offline APIs and online endpoints including /pooling.⁷⁸

That matters because a persistent wiki is not built by a one-off notebook. It is built by repeated ingestion jobs:

process new sources
extract schema-bound outputs
update pages and graph records
re-run checks when schemas evolve

That workload needs a serving contract, not a demo script.

In my own work on the GLiNER2 path inside vllm-factory(opens in a new tab), the useful part was not merely "we got the model to run." The useful part was that schema-based extraction started to look like a production primitive instead of a lab curiosity. Once the request path is stable, mixed extraction workloads become cheap enough and clean enough to sit behind a real ingestion loop.⁹

The strongest evidence from the follow-up L4 run was not abstract. On the heaviest cached workload, the optimized path hit 7,692 request tokens/sec at 893 ms mean latency while costing roughly $0.03 per 1M request tokens on a single Modal L4 priced at $0.80/h.¹⁰ That is the threshold where the argument changes: a relatively small encoder model stops looking like a toy extractor and starts looking like a practical ingestion primitive for a compounding knowledge base.

And that changes the economics of the whole pattern. If the extractor is small, fast, and cheap enough to run repeatedly, then concept harvesting can happen continuously instead of only when someone is willing to pay the generative-model tax.

To be clear: the current system already runs with GLiNER, AST metadata, context injection, resolution, and graph maintenance. The GLiNER2 + vLLM path is the next-step upgrade, not the baseline.

A Step-by-Step Ingestion Loop for a Persistent LLM Wiki#

If I were building this stack from scratch, I would keep the loop disciplined:

1. Curate raw sources#

Keep the original document, article, transcript, or notebook immutable.

2. Run schema-first harvesting#

Extract entities, relations, classifications, and claim records from the raw source using a narrow schema.

3. Normalize before writing prose#

Resolve aliases, deduplicate entities, score evidence strength, and flag contradictions before updating the wiki.

4. Update the wiki and graph together#

Write or revise human-readable pages, but also append machine-usable records that preserve structured outputs.

5. Re-query the compiled layer first#

When answering later questions, search the maintained wiki and graph before falling back to raw documents.

That last step is the real payoff. You are no longer asking the model to rediscover everything from source text. You are asking it to reason over a knowledge surface that has already been harvested and organized.

For builders still operating in standard RAG mode, this is the same mindset shift we already apply to chunking and runtime isolation. Better systems emerge when you stop trusting the last stage to repair mistakes made upstream. If your ingestion is weak, your retrieval is weak. If your extraction is vague, your wiki will drift. That is as true for knowledge systems as it is for agent runtimes.⁴¹¹

What This Still Does Not Solve#

Knowing where schema-first extraction breaks down matters as much as knowing where it works. Here is an honest account of what the current system does not yet solve cleanly.

The ontology still needs hand-tuned labels and thresholds.
Context injection and disambiguation are useful heuristics, but they are still heuristics.
Canonicalization can merge synonyms well enough to be useful, but it is not the same as deep semantic truth maintenance.
Contradiction handling is still more of an architectural requirement than a finished subsystem.
A human-readable wiki layer and a graph layer can stay synchronized only if provenance and update rules remain disciplined.

That is why I frame this as a practical path, not a finished end state. The implementation is already enough to prove the direction. It is not yet the full operating system for knowledge maintenance.

How Schema-First Extraction Compares to Prompt-Only Knowledge Graph Extraction#

Approach	Runtime cost	Wiki layer	Graph quality	Scales with corpus?
Prompt-only LLM extraction	High	Prose only	Fragile, prone to format drift	Poorly
Graph-first triplet extraction	Medium	None	Strong edges, weak page layer	Partially
Generative wiki with no extraction layer	Medium	Yes	Weak structure	No
Schema-first extraction plus maintained pages	Low-to-medium	Yes	Stronger structure with provenance	Yes

This is why I keep pushing on concept harvesting instead of pure triplet extraction. A persistent LLM wiki needs more than graph edges. It needs entity pages, claim records, evidence, classifications, and a readable artifact that humans can inspect. The graph should support the wiki, not replace it.

Common Pitfalls in Schema-First LLM Wikis#

The most common failure modes are predictable:

Trying to model everything at once. Start with the fields that actually change downstream behavior.
Treating extraction output as truth. It is evidence, not doctrine. Keep provenance and review paths.
Letting prose become the only durable output. Human-readable pages are necessary, but not sufficient.
Confusing graph construction with value creation. A graph full of low-signal edges is worse than a small graph with strong evidence.
Using a generative model for routine structure. Save the larger model for synthesis, comparison, and editorial judgment.

The strongest pattern is division of labor: use a schema-first extractor for harvesting, then use a generative model to synthesize, question, and refine the maintained artifact.

Tradeoffs in Schema-First LLM Wiki Maintenance#

This approach is not free.

You have to define schemas. You have to maintain normalization logic. You have to accept that some concepts resist crisp structuring and still need prose. You also have to decide when a fact belongs in a graph, when it belongs in a page, and when it belongs nowhere until a human reviews it.

But those are healthy costs.

They replace a worse cost: pretending an end-to-end generative system will stay coherent as the corpus grows. It usually will not. Without structured harvesting, the wiki accumulates text faster than it accumulates clarity.

Schema-first extraction does not remove the need for generative models. It gives them better material to work with.

FAQ: How Is an LLM Wiki Different From RAG?#

An LLM wiki and RAG differ mainly in when structure is created. RAG retrieves raw chunks at query time and asks the model to synthesize an answer on demand, while a persistent LLM wiki compiles knowledge during ingestion so later questions can be answered from maintained pages and structured records instead of rediscovering everything from source text.¹

In practice, that means a wiki can accumulate entity pages, linked claims, and contradiction notes over time. RAG can still sit underneath the system, but it should not be the only memory layer.

FAQ: When Should I Use GLiNER2 Instead of a Full LLM?#

Use GLiNER2 when the task can be defined by a schema: entity types, relation labels, classification labels, or structured fields. That is the right fit for routine structure work such as entity extraction, relation extraction, classification, or schema-driven field filling.

Use a larger generative model later for synthesis, editorial judgment, and cross-source reasoning. The extractor should harvest the durable structure first; the larger model should reason over that harvested layer.

FAQ: Do I Need a Graph Database to Start a Schema-First LLM Wiki?#

No. A plain file tree of Markdown pages plus structured JSON records is enough to get started with a schema-first LLM wiki.

A graph database becomes useful later when you need advanced traversal, ranking, or graph-native querying across a larger corpus. Start with the simplest durable artifact that keeps knowledge compounding cleanly.

Bringing Schema-First Extraction into Your LLM Wiki Stack#

The useful question is no longer "Can an LLM maintain a wiki?" It clearly can.

The useful question is what kind of ingestion layer lets that wiki compound without collapsing into soft memory and citation drift.

My answer is schema-first extraction. Harvest concepts once. Persist the durable outputs. Let the wiki inherit structure from ingestion. Then let larger models do what they are actually good at: synthesis, explanation, and judgment over a knowledge base that keeps getting stronger.

If you are already working on ingestion quality, RAG chunking visualizer in Rust (WebAssembly) is the adjacent problem worth fixing next. If you care about runtime discipline behind long-lived agent systems, Governed Code Mode for zero-trust tool execution is the right companion read. And if your end goal is a self-hosted operator stack rather than a hosted knowledge pipeline, Self-Hosted AI Agent on Oracle Cloud covers the operational side.

References#

Karpathy, A. "LLM Wiki." GitHub Gist. https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f(opens in a new tab) ↩ ↩² ↩³ ↩⁴ ↩⁵
Zaratiana, U. et al. "GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface." arXiv. https://arxiv.org/abs/2507.18546(opens in a new tab) ↩ ↩² ↩³
Ye, H. et al. "Schema-adaptable Knowledge Graph Construction." Findings of EMNLP 2023. https://aclanthology.org/2023.findings-emnlp.425/(opens in a new tab) ↩ ↩²
VeriStamp Journal. "RAG Chunking Visualizer in Rust (WebAssembly)." https://veristamp.com/blog/chunkerlite-rag-chunk-visualizer(opens in a new tab) ↩ ↩²
Fastino AI. "GLiNER2." GitHub. https://github.com/fastino-ai/GLiNER2(opens in a new tab) ↩ ↩²
Zaratiana, U. et al. "GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer." arXiv. https://arxiv.org/abs/2311.08526(opens in a new tab) ↩
vLLM Project. "Pooling Models." vLLM Docs. https://docs.vllm.ai/en/stable/models/pooling_models/(opens in a new tab) ↩
vLLM Project. "Classification Usages." vLLM Docs. https://docs.vllm.ai/en/latest/models/pooling_models/classify/(opens in a new tab) ↩
Dickmann, D. "vllm-factory." GitHub. https://github.com/ddickmann/vllm-factory(opens in a new tab) ↩
Danguria, S. "Adds request-side caching to the deberta_gliner2." vllm-factory pull request #6. https://github.com/ddickmann/vllm-factory/pull/6(opens in a new tab) ↩
VeriStamp Journal. "Governed Code Mode: From Tool-Calling to Zero-Trust Execution." https://veristamp.com/blog/governed-code-mode(opens in a new tab) ↩