quarto/Writing/ner4all-case-study.md

---
tags:
  - Writing
  - table-wrap
authors:
  - name: GPT-4.5
    url: https://chatgpt.com
    affiliation:
      - name: OpenAI
        url: https://openai.com
  - name: Nicole Dresselhaus
    affiliation:
      - name: Humboldt-Universität zu Berlin
        url: https://hu-berlin.de
    orcid: 0009-0008-8850-3679
date: 2025-05-05
categories:
  - Article
  - Case-study
  - ML
  - NER
lang: en
citation: true
fileClass: authored
title: "Case Study: Local LLM-Based NER with n8n and Ollama"
abstract: |
  Named Entity Recognition (NER) is a foundational task in text analysis,
  traditionally addressed by training NLP models on annotated data. However, a
  recent study – _“NER4All or Context is All You Need”_ – showed that
  out-of-the-box Large Language Models (LLMs) can **significantly outperform**
  classical NER pipelines (e.g. spaCy, Flair) on historical texts by using clever
  prompting, without any model retraining. This case study demonstrates how to
  implement the paper’s method using entirely local infrastructure: an **n8n**
  automation workflow (for orchestration) and a **Ollama** server running a
  14B-parameter LLM on an NVIDIA A100 GPU. The goal is to enable research
  engineers and tech-savvy historians to **reproduce and apply this method
  easily** on their own data, with a focus on usability and correct outputs rather
  than raw performance.

  We will walk through the end-to-end solution – from accepting a webhook input
  that defines entity types (e.g. Person, Organization, Location) to prompting a
  local LLM to extract those entities from a text. The solution covers setup
  instructions, required infrastructure (GPU, memory, software), model
  configuration, and workflow design in n8n. We also discuss potential limitations
  (like model accuracy and context length) and how to address them. By the end,
  you will have a clear blueprint for a **self-hosted NER pipeline** that
  leverages the knowledge encoded in LLMs (as advocated by the paper) while
  maintaining data privacy and reproducibility.

bibliography:
  - ner4all-case-study.bib
citation-style: springer-humanities-brackets
nocite: |
  @*
image: ../thumbs/writing_ner4all-case-study.png
---

## Background: LLM-Based NER Method Overview

The referenced study introduced a prompt-driven approach to NER, reframing it
“from a purely linguistic task into a humanities-focused task”. Instead of
training a specialized NER model for each corpus, the method leverages the fact
that large pretrained LLMs already contain vast world knowledge and language
understanding. The key idea is to **provide the model with contextual
definitions and instructions** so it can recognize entities in context. Notably,
the authors found that with proper prompts, a commercial LLM (ChatGPT-4) could
achieve **precision and recall on par with or better than** state-of-the-art NER
tools on a 1921 historical travel guide. This was achieved **zero-shot**, i.e.
without any fine-tuning or additional training data beyond the prompt itself.

**Prompt Strategy:** The success of this approach hinges on careful prompt
engineering. The final prompt used in the paper had multiple components:

- **Persona & Context:** A brief introduction framing the LLM as an _expert_
  reading a historical text, possibly including domain context (e.g. “This text
  is an early 20th-century travel guide; language is old-fashioned”). This
  primes the model with relevant background.
- **Task Instructions:** A clear description of the NER task, including the list
  of entity categories and how to mark them in text. For example: _“Identify all
  Person (PER), Location (LOC), and Organization (ORG) names in the text and
  mark each by enclosing it in tags.”_
- **Optional Examples:** A few examples of sentences with correct tagged output
  (few-shot learning) to guide the model. Interestingly, the study found that
  zero-shot prompting often **outperformed few-shot** until \~16 examples were
  provided. Given the cost of preparing examples and limited prompt length, our
  implementation will focus on zero-shot usage for simplicity.
- **Reiteration & Emphasis:** The prompt repeated key instructions in different
  words and emphasized compliance (e.g. _“Make sure you follow the tagging
  format exactly for every example.”_). This redundancy helps the model adhere
  to instructions.
- **Prompt Engineering Tricks:** They included creative cues to improve
  accuracy, such as offering a “monetary reward for each correct classification”
  and the phrase _“Take a deep breath and think step by step.”_. These tricks,
  drawn from prior work, encouraged the model to be thorough and careful.
- **Output Format:** Crucially, the model was asked to **repeat the original
  text exactly** but insert tags around entity mentions. The authors settled on
  a format like `<<PER ... /PER>>` to tag people, `<<LOC ... /LOC>>` for
  locations, etc., covering each full entity span. This inline tagging format
  leveraged the model’s familiarity with XML/HTML syntax (from its training
  data) and largely eliminated problems like unclosed tags or extra spaces. By
  instructing the model _not to alter any other text_, they ensured the output
  could be easily compared to the input and parsed for entities.

**Why Local LLMs?** The original experiments used a proprietary API (ChatGPT-4).
To make the method accessible to all (and avoid data governance issues of cloud
APIs), we implement it with **open-source LLMs running locally**. Recent openly
licensed models are rapidly improving and can handle such extraction tasks given
the right prompt. Running everything locally also aligns with the paper’s goal
of “democratizing access” to NER for diverse, low-resource texts – there are no
API costs or internet needed, and data stays on local hardware for privacy.

## Solution Architecture

Our solution consists of a **workflow in n8n** that orchestrates the NER
process, and a **local Ollama server** that hosts the LLM for text analysis. The
high-level workflow is as follows:

1. **Webhook Trigger (n8n):** A user initiates the process by sending an HTTP
   request to n8n’s webhook with two inputs: (a) a simple text defining the
   entity categories of interest (for example, `"PER, ORG, LOC"`), and (b) the
   text to analyze (either included in the request or accessible via a provided
   file URL). This trigger node captures the input and starts the automation.
2. **Prompt Construction (n8n):** The workflow builds a structured prompt for
   the LLM. Based on the webhook input, it prepares the system instructions
   listing each entity type and guidelines, then appends the user’s text.
   Essentially, n8n will merge the _entity definitions_ into a pre-defined
   prompt template (the one derived from the paper’s method). This can be done
   using a **Function node** or an **LLM Prompt node** in n8n to ensure the text
   and instructions are combined correctly.
3. **LLM Inference (Ollama + LLM):** n8n then passes the prompt to an **Ollama
   Chat Model node**, which communicates with the Ollama server’s API. The
   Ollama daemon hosts the selected 14B model on the local GPU and returns the
   model’s completion. In our case, the completion will be the original text
   with NER tags inserted around the entities (e.g.
   `<<PER John Doe /PER>> went to <<LOC Berlin /LOC>> ...`). This step harnesses
   the A100 GPU to generate results quickly, using the chosen model’s weights
   locally.
4. **Output Processing (n8n):** The tagged text output from the LLM can be
   handled in two ways. The simplest is to **return the tagged text directly**
   as the response to the webhook call – allowing the user to see their original
   text with all entities highlighted by tags. Alternatively, n8n can
   post-process the tags to extract a structured list of entities (e.g. a JSON
   array of `{"entity": "John Doe", "type": "PER"}`{.json} objects). This
   parsing can be done with a Regex or code node, but given our focus on
   correctness, we often trust the model’s tagging format to be consistent (the
   paper reported the format was reliably followed when instructed clearly).
   Finally, an **HTTP Response** node sends the results back to the user (or
   stores them), completing the workflow.

**Workflow Structure:** In n8n’s interface, the workflow might look like a
sequence of connected nodes: **Webhook → Function (build prompt) → AI Model
(Ollama) → Webhook Response**. If using n8n’s new AI Agent feature, some steps
(like prompt templating) can be configured within the AI nodes themselves. The
key is that the Ollama model node is configured to use the local server (usually
at `http://127.0.0.1:11434` by default) and the specific model name. We assume
the base pipeline (available on GitHub) already includes most of this structure
– our task is to **slot in the custom prompt and model configuration** for the
NER use case.

## Setup and Infrastructure Requirements

To reproduce this solution, you will need a machine with an **NVIDIA GPU** and
the following software components installed:

- **n8n (v1.**x** or later)** – the workflow automation tool. You can install
  n8n via npm, Docker, or use the desktop app. For a server environment, Docker
  is convenient. For example, to run n8n with Docker:

  ```bash
  docker run -it --rm \
             -p 5678:5678 \
             -v ~/.n8n:/home/node/.n8n \
             n8nio/n8n:latest
  ```

  This exposes n8n on `http://localhost:5678` for the web interface. (If you use
  Docker and plan to connect to a host-running Ollama, start the container with
  `--network=host` to allow access to the Ollama API on localhost.)

- **Ollama (v0.x\*)** – an LLM runtime that serves models via an HTTP API.
  Installing Ollama is straightforward: download the installer for your OS from
  the official site (Linux users can run the one-line script
  `curl -sSL https://ollama.com/install.sh | sh`). After installation, start the
  Ollama server (daemon) by running:

  ```bash
  ollama serve
  ```

  This will launch the service listening on port 11434. You can verify it’s
  running by opening `http://localhost:11434` in a browser – it should respond
  with “Ollama is running”. _Note:_ Ensure your system has recent NVIDIA drivers
  and CUDA support if using GPU. Ollama supports NVIDIA GPUs with compute
  capability ≥5.0 (the A100 is well above this). Use `nvidia-smi` to confirm
  your GPU is recognized. If everything is set up, Ollama will automatically use
  the GPU for model inference (falling back to CPU if none available).

- **LLM Model (14B class):** Finally, download at least one large language model
  to use for NER. You have a few options here, and you can “pull” them via
  Ollama’s CLI:

  - _DeepSeek-R1 14B:_ A 14.8B-parameter model distilled from larger reasoning
    models (based on Qwen architecture). It’s optimized for reasoning tasks and
    compares to OpenAI’s models in quality. Pull it with:

    ```bash
    ollama pull deepseek-r1:14b
    ```

    This downloads \~9 GB of data (the quantized weights). If you have a very
    strong GPU (e.g. A100 80GB), you could even try `deepseek-r1:70b` (\~43 GB),
    but 14B is a good balance for our use-case. DeepSeek-R1 is licensed MIT and
    designed to run locally with no restrictions.

  - _Cogito 14B:_ A 14B “hybrid reasoning” model by Deep Cogito, known for
    excellent instruction-following and multilingual capability. Pull it with:

    ```bash
    ollama pull cogito:14b
    ```

    Cogito-14B is also \~9 GB (quantized) and supports an extended context
    window up to **128k tokens** – which is extremely useful if you plan to
    analyze very long documents without chunking. It’s trained in 30+ languages
    and tuned to follow complex instructions, which can help in structured
    output tasks like ours.

  - _Others:_ Ollama offers many models (LLaMA 2 variants, Mistral, etc.). For
    instance, `ollama pull llama2:13b` would get a LLaMA-2 13B model. These can
    work, but for best results in NER with no fine-tuning, we suggest using one
    of the above well-instructed models. If your hardware is limited, you could
    try a 7-8B model (e.g., `deepseek-r1:7b` or `cogito:8b`), which download
    faster and use \~4–5 GB VRAM, at the cost of some accuracy. In CPU-only
    scenarios, even a 1.5B model is available – it will run very slowly and
    likely miss more entities, but it proves the pipeline can work on minimal
    hardware.

**Hardware Requirements:** Our case assumes an NVIDIA A100 GPU (40 GB), which
comfortably hosts a 14B model in memory and accelerates inference. In practice,
any modern GPU with ≥10 GB memory can run a 13–14B model in 4-bit quantization.
For example, an RTX 3090 or 4090 (24 GB) could handle it, and even smaller GPUs
(or Apple Silicon with 16+ GB RAM) can run 7B models. Ensure you have sufficient
**system RAM** as well (at least as much as the model size, plus overhead for
n8n – 16 GB RAM is a safe minimum for 14B). Disk space of \~10 GB per model is
needed. If using Docker for n8n, allocate CPU and memory generously to avoid
bottlenecks when the LLM node processes large text.

## Building the n8n Workflow

With the environment ready, we now construct the n8n workflow that ties
everything together. We outline each component with instructions:

### 1. Webhook Input for Entities and Text

Start by creating a **Webhook trigger** node in n8n. This will provide a URL
(endpoint) that you can send a request to. Configure it to accept a POST request
containing the necessary inputs. For example, we expect the request JSON to look
like:

```json
{
  "entities": "PER, ORG, LOC",
  "text": "John Doe visited Berlin in 1921 and met with the Board of Acme Corp."
}
```

Here, `"entities"` is a simple comma-separated string of entity types (you could
also accept an array or a more detailed schema; for simplicity we use the format
used in the paper: PER for person, LOC for location, ORG for organization). The
`"text"` field contains the content to analyze. In a real scenario, the text
could be much longer or might be sent as a file. If it's a file, one approach is
to send it as form-data and use n8n’s **Read Binary File** + **Move Binary
Data** nodes to get it into text form. Alternatively, send a URL in the JSON and
use an HTTP Request node in the workflow to fetch the content. The key is that
by the end of this step, we have the raw text and the list of entity labels
available in the n8n workflow as variables.

### 2. Constructing the LLM Prompt

Next, add a node to build the prompt that will be fed to the LLM. You can use a
**Function** node (JavaScript code) or the **“Set” node** to template a prompt
string. We will create two pieces of prompt content: a **system instruction**
(the role played by the system prompt in chat models) and the **user message**
(which will contain the text to be processed).

According to the method, our **system prompt** should incorporate the following:

- **Persona/Context:** e.g. _“You are a historian and archivist analyzing a
  historical document. The language may be old or have archaic spellings. You
  have extensive knowledge of people, places, and organizations relevant to the
  context.”_ This establishes domain expertise in the model.
- **Task Definition:** e.g. _“Your task is to perform Named Entity Recognition.
  Identify all occurrences of the specified entity types in the given text and
  annotate them with the corresponding tags.”_
- **Entity Definitions:** List the entity categories provided by the user, with
  a brief definition if needed. For example: _“The entity types are: PER
  (persons or fictional characters), ORG (organizations, companies,
  institutions), LOC (locations such as cities, countries, landmarks).”_ If the
  user already provided definitions in the webhook, include those; otherwise a
  generic definition as shown is fine.
- **Tagging Instructions:** Clearly explain the tagging format. We adopt the
  format from the paper: each entity should be wrapped in `<<TYPE ... /TYPE>>`.
  So instruct: _“Enclose each entity in double angle brackets with its type
  label. For example: <\<PER John Doe /PER>> for a person named John Doe. Do not
  alter any other text – only insert tags. Ensure every opening tag has a
  closing tag.”_ Also mention that tags can nest or overlap if necessary (though
  that’s rare).
- **Output Expectations:** Emphasize that the output should be the **exact
  original text, verbatim, with tags added** and nothing else. For example:
  _“Repeat the input text exactly, adding the tags around the entities. Do not
  add explanations or remove any content. The output should look like the
  original text with markup.”_ This is crucial to prevent the model from
  omitting or rephrasing text. The paper’s prompt literally had a line: “Repeat
  the given text exactly. Be very careful to ensure that nothing is added or
  removed apart from the annotations.”.
- **Compliance & Thoughtfulness:** We can borrow the trick of telling the model
  to take its time and be precise. For instance: _“Before answering, take a deep
  breath and think step by step. Make sure you find **all** entities. You will
  be rewarded for each correct tag.”_ While the notion of reward is
  hypothetical, such phrasing has been observed to sharpen the model’s focus.
  This is optional but can be useful for complex texts.

Once this system prompt is assembled as a single string, it will be sent as the
system role content to the LLM. Now, for the **user prompt**, we simply supply
the text to be analyzed. In many chat-based LLMs, the user message would contain
the text on which the assistant should perform the task. We might prefix it with
something like “Text to analyze:\n” for clarity, or just include the raw text.
(Including a prefix is slightly safer to distinguish it from any instructions,
but since the system prompt already set the task, the user message can be just
the document text.)

In n8n, if using the **Basic LLM Chain** node, you can configure it to use a
custom system prompt. For example, connect the Function/Set node output into the
LLM node, and in the LLM node’s settings choose “Mode: Complete” or similar,
then under **System Instructions** put an expression that references the
constructed prompt text (e.g., `{{ $json["prompt"] }}` if the prompt was output
to that field). The **User Message** can similarly be fed from the input text
field (e.g., `{{ $json["text"] }}`). Essentially, we map our crafted instruction
into the system role, and the actual content into the user role.

### 3. Configuring the Local LLM (Ollama Model Node)

Now configure the LLM node to use the **Ollama** backend and your downloaded
model. n8n provides an “Ollama Chat Model” integration, which is a sub-node of
the AI Agent system. In the n8n editor, add or open the LLM node (if using the
AI Agent, this might be inside a larger agent node), and look for model
selection. Select **Ollama** as the provider. You’ll need to set up a credential
for Ollama API access – use `http://127.0.0.1:11434` as the host (instead of the
default localhost, to avoid any IPv6 binding issues). No API key is needed since
it’s local. Once connected, you should see a dropdown of available models (all
the ones you pulled). Choose the 14B model you downloaded, e.g.
`deepseek-r1:14b` or `cogito:14b`.

Double-check the **parameters** for generation. By default, Ollama models have
their own preset for max tokens and temperature. For an extraction task, we want
the model to stay **focused and deterministic**. It’s wise to set a relatively
low temperature (e.g. 0.2) to reduce randomness, and a high max tokens so it can
output the entire text with tags (set max tokens to at least the length of your
input in tokens plus 10-20% for tags). If using Cogito with its 128k context,
you can safely feed very long text; with other models (often \~4k context),
ensure your text isn’t longer than the model’s context limit or use a model
variant with extended context. If the model supports **“tools” or functions**,
you won’t need those here – this is a single-shot prompt, not a multi-step agent
requiring tool usage, so just the chat completion mode is sufficient.

At this point, when the workflow runs to this node, n8n will send the system and
user messages to Ollama and wait for the response. The heavy lifting is done by
the LLM on the GPU, which will generate the tagged text. On an A100, a 14B model
can process a few thousand tokens of input and output in just a handful of
seconds (exact time depends on the model and input size).

### 4. Returning the Results

After the LLM node, add a node to handle the output. If you want to present the
**tagged text** directly, you can pass the LLM’s output to the final Webhook
Response node (or if using the built-in n8n chat UI, you would see the answer in
the chat). The tagged text will look something like:

```plain
<<PER John Doe /PER>> visited <<LOC Berlin /LOC>> in 1921 and met with the Board
of <<ORG Acme Corp /ORG>>.
```

This format highlights each identified entity. It is immediately human-readable
with the tags, and trivial to post-process if needed. For example, one could use
a regex like `<<(\w+) (.*?) /\1>>` to extract all `type` and `entity` pairs from
the text. In n8n, a quick approach is to use a **Function** node to find all
matches of that pattern in `item.json["data"]` (assuming the LLM output is in
`data`). Then one could return a JSON array of entities. However, since our
focus is on correctness and ease, you might simply return the marked-up text and
perhaps document how to parse it externally if the user wants structured data.

Finally, use an **HTTP Response** node (if the workflow was triggered by a
Webhook) to send back the results. If the workflow was triggered via n8n’s chat
trigger (in the case of interactive usage), you would instead rely on the chat
UI output. For a pure API workflow, the HTTP response will contain either the
tagged text or a JSON of extracted entities, which the user’s script or
application can then use.

**Note:** If you plan to run multiple analyses or have an ongoing service, you
might want to **persist the Ollama server** (don’t shut it down between runs)
and perhaps keep the model loaded in VRAM for performance. Ollama will cache the
model in memory after the first request, so subsequent requests are faster. On
an A100, you could even load two models (if you plan to experiment with which
gives better results) but be mindful of VRAM usage if doing so concurrently.

## Model Selection Considerations

We provided two example 14B models (DeepSeek-R1 and Cogito) to use with this
pipeline. Both are good choices, but here are some considerations and
alternatives:

- **Accuracy vs. Speed:** Larger models (like 14B or 30B) generally produce more
  accurate and coherent results, especially for complex instructions, compared
  to 7B models. Since our aim is correctness of NER output, the A100 allows us
  to use a 14B model which offers a sweet spot. In preliminary tests, these
  models can correctly tag most obvious entities and even handle some tricky
  cases (e.g. person names with titles, organizations that sound like person
  names, etc.) thanks to their pretrained knowledge. If you find the model is
  making mistakes, you could try a bigger model (Cogito 32B or 70B, if resources
  permit). Conversely, if you need faster responses and are willing to trade
  some accuracy, a 7-8B model or running the 14B at a higher quantization (e.g.
  4-bit) on CPU might be acceptable for smaller texts.
- **Domain of the Text:** The paper dealt with historical travel guide text
  (1920s era). These open models have been trained on large internet corpora, so
  they likely have seen a lot of historical names and terms, but their coverage
  might not be as exhaustive as GPT-4. If your text is in a specific domain
  (say, ancient mythology or very obscure local history), the model might miss
  entities that it doesn’t recognize as famous. The prompt’s context can help
  (for example, adding a note like _“Note: Mythological characters should be
  considered PERSON entities.”_ as they did for Greek gods). For extremely
  domain-specific needs, one could fine-tune a model or use a specialized one,
  but that moves beyond the zero-shot philosophy.
- **Language:** If your texts are not in English, ensure the chosen model is
  multilingual. Cogito, for instance, was trained in over 30 languages, so it
  can handle many European languages (the paper also tested German prompts). If
  using a model that’s primarily English (like some LLaMA variants), you might
  get better results by writing the instructions in English but letting it
  output tags in the original text. The study found English prompts initially
  gave better recall even on German text, but with prompt tweaks the gap closed.
  For our pipeline, you can simply provide the definitions in English and the
  text in the foreign language – a capable model will still tag the foreign
  entities. For example, Cogito or DeepSeek should tag a German sentence’s
  _“Herr Schmidt”_ as `<<PER Herr Schmidt /PER>>`. Always test on a small sample
  if in doubt.
- **Extended Context:** If your input text is very long (tens of thousands of
  words), you should chunk it into smaller segments (e.g. paragraph by
  paragraph) and run the model on each, then merge the outputs. This is because
  most models (including DeepSeek 14B) have a context window of 2048–8192
  tokens. However, Cogito’s 128k context capability is a game-changer – in
  theory you could feed an entire book and get a single output. Keep in mind the
  time and memory usage will grow with very large inputs, and n8n might need
  increased timeout settings for such long runs. For typical use (a few pages of
  text at a time), the standard context is sufficient.

In our implementation, we encourage experimenting with both DeepSeek-R1 and
Cogito models. Both are **open-source and free for commercial use** (Cogito uses
an Apache 2.0 license, DeepSeek MIT). They represent some of the best 14B-class
models as of early 2025. You can cite these models in any academic context if
needed, or even switch to another model with minimal changes to the n8n workflow
(just pull the model and change the model name in the Ollama node).

## Example Run

Let’s run through a hypothetical example to illustrate the output. Suppose a
historian supplies the following via the webhook:

- **Entities:** `PER, ORG, LOC`
- **Text:** _"Baron Münchhausen was born in Bodenwerder and served in the
  Russian military under Empress Anna. Today, the Münchhausen Museum in
  Bodenwerder is operated by the town council."_

When the workflow executes, the LLM receives instructions to tag people (PER),
organizations (ORG), and locations (LOC). With the prompt techniques described,
the model’s output might look like:

```plain
<<PER Baron Münchhausen /PER>> was born in <<LOC Bodenwerder /LOC>> and served
in the Russian military under <<PER Empress Anna /PER>>. Today, the <<ORG
Münchhausen Museum /ORG>> in <<LOC Bodenwerder /LOC>> is operated by the town
council.
```

All person names (Baron Münchhausen, Empress Anna) are enclosed in `<<PER>>`
tags, the museum is marked as an organization, and the town Bodenwerder is
marked as a location (twice). The rest of the sentence remains unchanged. This
output can be returned as-is to the user. They can visually verify it or
programmatically parse out the tagged entities. The correctness of outputs is
high: each tag corresponds to a real entity mention in the text, and there are
no hallucinated tags. If the model were to make an error (say, tagging "Russian"
as LOC erroneously), the user could adjust the prompt (for example, clarify that
national adjectives are not entities) and re-run.

## Limitations and Solutions

While this pipeline makes NER easier to reproduce, it’s important to be aware of
its limitations and how to mitigate them:

- **Model Misclassifications:** A local 14B model may not match GPT-4’s level of
  understanding. It might occasionally tag something incorrectly or miss a
  subtle entity. For instance, in historical texts, titles or honorifics (e.g.
  _“Dr. John Smith”_) might confuse it, or a ship name might be tagged as ORG
  when it’s not in our categories. **Solution:** Refine the prompt with
  additional guidance. You can add a “Note” section in the instructions to
  handle known ambiguities (the paper did this with notes about Greek gods being
  persons, etc.). Also, a quick manual review or spot-check is recommended for
  important outputs. Since the output format is simple, a human or a simple
  script can catch obvious mistakes (e.g., if "Russian" was tagged LOC, a
  post-process could remove it knowing it's likely wrong). Over time, if you
  notice a pattern of mistakes, update the prompt instructions accordingly.

- **Text Reproduction Issues:** We instruct the model to output the original
  text verbatim with tags, but LLMs sometimes can’t resist minor changes. They
  may “correct” spelling or punctuation, or alter spacing. The paper noted this
  tendency and used fuzzy matching when evaluating. In our pipeline, minor
  format changes usually don’t harm the extraction, but if preserving text
  exactly is important (say for downstream alignment), this is a concern.
  **Solution:** Emphasize fidelity in the prompt (we already do). If needed, do
  a diff between the original text and tagged text and flag differences. Usually
  differences will be small (e.g., changing an old spelling to modern). You can
  then either accept them or attempt a more rigid approach (like asking for a
  JSON list of entity offsets – though that introduces other complexities and
  was intentionally avoided by the authors). In practice, we found the tag
  insertion approach with strong instructions yields nearly identical text apart
  from the tags.

- **Long Inputs and Memory:** Very large documents may exceed the model’s input
  capacity or make the process slow. The A100 GPU can handle a lot, but n8n
  itself might have default timeouts for a single workflow execution.
  **Solution:** For long texts, break the input into smaller chunks (maybe one
  chapter or section at a time). n8n can loop through chunks using the Split In
  Batches node or simply by splitting the text in the Function node and feeding
  the LLM node multiple times. You’d then concatenate the outputs. If chunking,
  ensure that if an entity spans a chunk boundary, it might be missed – usually
  rare in well-chosen chunk boundaries (paragraph or sentence). Alternatively,
  use Cogito for its extended context to avoid chunking. Make sure to increase
  n8n’s execution timeout if needed (via environment variable
  `N8N_DEFAULT_TIMEOUT`{.bash} or in the workflow settings).

- **Concurrent Usage:** If multiple users or processes hit the webhook
  simultaneously, they would be sharing the single LLM instance. Ollama can
  queue requests, but the GPU will handle them one at a time (unless running
  separate instances with multiple GPUs). For a research setting with one user
  at a time, this is fine. If offering this as a service to others, consider
  queuing requests or scaling out (multiple replicas of this workflow on
  different GPU machines). The stateless design of the prompt makes each run
  independent.

- **n8n Learning Curve:** For historians new to n8n, setting up the workflow
  might be unfamiliar. However, n8n’s no-code interface is fairly intuitive with
  a bit of guidance. This case study provides the logic; one can also import
  pre-built workflows. In fact, the _n8n_ community has template workflows (for
  example, a template for chatting with local LLMs) that could be adapted. We
  assume the base pipeline from the paper’s authors is available on GitHub –
  using that as a starting point, one mostly needs to adjust nodes as described.
  If needed, one can refer to n8n’s official docs or community forum for help on
  creating a webhook or using function nodes. Once set up, running the workflow
  is as easy as sending an HTTP request or clicking “Execute Workflow” in n8n.

- **Output Verification:** Since we prioritize correctness, you may want to
  evaluate how well the model did, especially if you have ground truth
  annotations. While benchmarking is out of scope here, note that you can
  integrate evaluation into the pipeline too. For instance, if you had a small
  test set with known entities, you could compare the model output tags with
  expected tags using a Python script (n8n has an Execute Python node) or use an
  NER evaluation library like _nervaluate_ for precision/recall. This is exactly
  what the authors did to report performance, and you could mimic that to gauge
  your chosen model’s accuracy.

## Conclusion

By following this guide, we implemented the **NER4All** paper’s methodology with
a local, reproducible setup. We used n8n to handle automation and prompt
assembly, and a local LLM (via Ollama) to perform the heavy-duty language
understanding. The result is a flexible NER pipeline that requires **no training
data or API access** – just a well-crafted prompt and a powerful pretrained
model. We demonstrated how a user can specify custom entity types and get their
text annotated in one click or API call. The approach leverages the strengths of
LLMs (vast knowledge and language proficiency) to adapt to historical or niche
texts, aligning with the paper’s finding that a bit of context and expert prompt
design can unlock high NER performance.

Importantly, this setup is **easy to reproduce**: all components are either
open-source or freely available (n8n, Ollama, and the models). A research
engineer or historian can run it on a single machine with sufficient resources,
and it can be shared as a workflow file for others to import. By removing the
need for extensive data preparation or model training, this lowers the barrier
to extracting structured information from large text archives.

Moving forward, users can extend this case study in various ways: adding more
entity types (just update the definitions input), switching to other LLMs as
they become available (perhaps a future 20B model with even better
understanding), or integrating the output with databases or search indexes for
further analysis. With the rapid advancements in local AI models, we anticipate
that such pipelines will become even more accurate and faster over time,
continually democratizing access to advanced NLP for all domains.

**Sources:** This implementation draws on insights from Ahmed et al. (2025) for
the prompt-based NER method, and uses tools like n8n and Ollama as documented in
their official guides. The chosen models (DeepSeek-R1 and Cogito) are described
in their respective releases. All software and models are utilized in accordance
with their licenses for a fully local deployment.

## About LLMs as 'authors' {.appendix}

The initial draft was created using "Deep-Research" from `gpt-4.5 (preview)`.
Final proofreading/content review/layouting by Nicole Dresselhaus. Do not fear
that this is some LLM-BS to get views on the homepage. I read everything
multiple times and would have written it with this content - just in worse
words.