NER4all: Context is All You Need

Using LLMs for low-effort, high-performance NER on historical texts.
A humanities informed approach

Author

Torsten Hiltmann

Humboldt-Universität zu Berlin

Martin Dröge

Humboldt-Universität zu BerlinAI-Skills

                  Nicole
DresselhausHumboldt-Universität
zu BerlinNFDI4Memory

Till Grallert

Humboldt-Universität zu BerlinNFDI4Memory

Melanie Althage

Humboldt-Universität zu Berlin

Paul Bayer

Humboldt-Universität zu Berlin

Sophie Eckenstaler

Humboldt-Universität zu BerlinKompetenzwerkstatt Digital Humanities

Koray Mendi

Humboldt-Universität zu Berlin

Jascha Marijn Schmitz

Humboldt-Universität zu BerlinNFDI4Memory

Philipp Schneider

Humboldt-Universität zu Berlin

Wiebke Sczeponik

Humboldt-Universität zu Berlin

Anica Skibba

Humboldt-Universität zu BerlinNFDI4Memory

Occasion

deRSE25, Karlsruhe

Date

26.02.2025

Funded by the German Research Foundation (DFG) - project number 501609550.

Who am i?

Where?

Professur für Digital History, Humboldt-Universität zu Berlin
NFDI4Memory, Task Area 5 - Data-Culture

How?

In Humanities “by happy accident”
— they tend to have the most tricky data
M.Sc in Informatics in the natural sciences
from Universität Bielefeld; Focus on ML
Gathered experience in private companies
often involving NLP or simulations

Current Focus?

Stay up to date on ML-Research
Apply those things to the real world
outside of Toy-Datasets
What are the limits of current Deep Neural Nets?

Slide-Structure

“I want slides that are useful to me - not the one holding the talk”

me

Default Background = What you usually expect from slides
Bullet-Points explaining stuff…

This is a takeaway primary for researchers

This is a takeaway aimed at
Research Software Engineers

Challenges for NLP with Historical Materials

Whats the problem anyway?

High Diversity Challenges Traditional NLP

“While some domains operate with relatively homogeneous text data, historical research is characterized by texts defined by their diversity in form and content, presenting a significant challenge for NLP-tasks.”

NER4all-Paper

“Yet, due to the high linguistic and genre diversity of sources, only limited canonisation of spellings, the level of required historical domain knowledge, and the scarcity of annotated training data, established approaches to natural language processing (NLP) have been both extremely expensive and yielded only unsatisfactory results in terms of recall and precision.”

NER4all-Paper

Wide range of styles in historical corpora
Often no proper spelling or influenced by dialects
Meaning of words changed through time
Even “simple” transcriptions are often
neither trivial, nor standardized

Standard tokenizers often fail
Input normalization sometimes loses meaning necessary for downstream tasks
Reliable recognition can only be fuzzy
Worst of all:
no “go-to” benchmark including this variety

For RSE: From a technical standpoint, these diverse textual sources often require multiple heuristics or specialized normalization scripts. LLM-based solutions can significantly reduce that overhead. You can embed context-aware transformations or expansions directly into your pipeline, letting an LLM handle the language’s irregularities. This approach streamlines your codebase and avoids ad-hoc patches for every new text variant. For Researchers: In our paper, we specifically discuss how older texts often break standard tokenizers and rely heavily on domain expertise for correct entity labeling. Software engineers need to plan for these irregularities by either normalizing input or fully leveraging contextual LLM prompts to interpret the text’s meaning. Understanding these constraints helps avoid wasted resources on training models ill-suited to historical data.

Historical NER Requires Expert Interpretation

“The process of detecting and classifying named entities in historical texts is often less straightforward than it appears as it involves a degree of interpretation by domain experts familiar with the specific subject, historical context, and time period.”

NER4all-Paper

Censorship:
Meaning can hide behind archaic forms
Change of Labels:
WWI was known as “the great war” until WWII
Lack of Appearance in modern catalogs:
Entity may not exist anymore/is forgotten

Our hypothesis is that

LLMs encode a lot of knowledge during their training
— it is “just” a matter of activating this knowledge via prompting
LLMs can infer entities via context clues
even if they know nothing about the entity itself
no fine-tuning is necessary

Ground Truth and Matching

How do you even test for this?

Ground Truth from 1921 Baedeker Guide

“To test our approach, we created ground truth with manually annotated named entities from the 1921 Baedeker travel guide for Berlin[…].”

NER4all-Paper

Why this Resource?

No comparable benchmark for historic NER existed
travel guides have a high density and variety of named entities
We have Experts for annotations in-house
(Historians at a Berlin University)

Examples from the 1921 Baedeker Travel-Guide for Berlin

Using Lenient Span Matching

“We […] match the extracted spans. We used the most lenient criteria […], meaning it suffices to have an overlap with the annotation and having the correct entity-type annotated.”

“We [allowed] for up to 1 error every 5 generated characters,[…]”

NER4all-Paper

The requirements force the adoption of a forgiving metric that suits messy historical data
Introduces a simple hyperparameter to tweak precision vs. recall
In history, often recall is paramount

Matching less susceptible to spelling or OCR-Mistakes
Imposing exact matching is a fool’s errand on historic data.
Less impact of hallucinations/“corrections” from generating LLMs

Configurable thresholds for stricter or looser matching
Easily adjust defaults for different user groups
Advocate for a “never without” approach
removing leniency is easy, adding is hard

Performance and Evaluation Metrics

Is it any good?

LLMs Outperform spaCy & flair by up to 22%

“readily-available, state-of-the-art LLMs significantly outperform two leading NLP frameworks, spaCy and flair, for NER in historical documents by seven to twentytwo percent higher F1-Scores.”

NER4all-Paper

Selected representative results.
“+ PE” means “including Prompt Engineering”.
“No Context” means instructions only (“Do NER by …”).
Variant	Recall		Precision		F₁-Score
All 0-Shot	μ ± σ	impact	μ ± σ	impact	μ ± σ	impact
Specific Context + PE	`0.84 ±0.10`	`3.71 %`	`0.91 ±0.08`	`3.99 %`	`0.87 ±0.08`	`3.80 %`
Specific Context	`0.81 ±0.19`	`0.00 %`	`0.87 ±0.19`	`0.00 %`	`0.84 ±0.19`	`0.00 %`
Generic Context + PE	`0.80 ±0.11`	`-2.35 %`	`0.92 ±0.10`	`2.92 %`	`0.84 ±0.10`	`-0.07 %`
No Context	`0.75 ±0.15`	`-7.43 %`	`0.90 ±0.09`	`2.31 %`	`0.81 ±0.11`	`-3.64 %`
Baseline flair	`0.76 ±0.13`	`-6.65 %`	`0.89 ±0.10`	`1.46 %`	`0.81 ±0.11`	`-3.86 %`
Baseline spaCy	`0.71 ±0.13`	`-12.79 %`	`0.62 ±0.11`	`-29.32 %`	`0.66 ±0.10`	`-21.66 %`

Easy to use in your research
No dependencies on tooling needed
Still reproducible research
using simple, flexible prompts for your domain

Adopt prompt-based NER
yielding leaner code and faster dev cycles
— try a few prompts and get a “good enough” first glance
Minimize preprocessing with direct LLM usage
— your data doesn’t need to be perfect
Future-proof research tools
— changing the underlying LLM to keep up is fast!

Zero-Shot Beats Low-Shot

“surprisingly and against our expectations, zero-shot approaches, […] perform better than few-shot approaches until the number of examples reaches 16 and more.”

NER4all-Paper

Good annotation takes time, so either
Commit to well-sized annotation sets
or skip them altogether
Getting “the right” examples is non-trivial

Design your CI pipeline around zero-shot or high-shot modes
Avoid partial solutions that degrade reliability
Plan enough time if you do annotation-sprints

LLMs May Surpass Human-Level Performance

“Although the process is less straightforward for historians who rely on domain expertise, we discovered that the LLM-based approach can replicate or even exceed human-level performance for certain tasks.”

NER4all-Paper

Our data was annotated by at least 2 historians
with a third deciding deciding on disagreement
We still got surprised when the LLM disagreed
— and it was right!
Storytime: “Borstells Lesezirkel”
Not systematically evaluated yet…

Contextual and Domain-Specific Prompting

Whats does this mean?

Context & Persona Modeling Are Critical

“[LLMs] outperform […] as soon as a bit of contextual information and persona modelling is included in the prompts.”

“Our ablation study shows how providing historical context to the task […] turns focus away from a purely linguistic approach [and] are core to a successful prompting strategy.”

NER4all-Paper

Variant	Recall	Precision	F₁-Score
All 0-Shot	μ ± σ	μ ± σ	μ ± σ
Specific Context + PE 32-Shot	`0.89 ±0.09`	`0.90 ±0.06`	`0.89 ±0.06`
Specific Context + PE	`0.84 ±0.10`	`0.91 ±0.08`	`0.87 ±0.08`
Specific Context	`0.81 ±0.19`	`0.87 ±0.19`	`0.84 ±0.19`
Generic Context + PE	`0.80 ±0.11`	`0.92 ±0.10`	`0.84 ±0.10`
Generic Context	`0.81 ±0.11`	`0.90 ±0.10`	`0.85 ±0.09`
No Context + PE	`0.74 ±0.15`	`0.91 ±0.10`	`0.81 ±0.11`
No Context	`0.75 ±0.15`	`0.90 ±0.09`	`0.81 ±0.11`

Context: Including historical context and tailoring the prompt to the particular time period or source genre greatly improves results. For fellow scientists, this underlines the importance of domain-specific knowledge in software pipelines, demonstrating that even well-designed NLP algorithms need contextualized prompts to excel on real-world or less conventional data. Context: It is not just brute-force language modeling that leads to better outcomes but leveraging persona- or context-based prompts. This is encouraging for R&D teams who already rely on standard NLP libraries: small changes to the prompt (like specifying domain background) can noticeably improve entity extraction for niche corpora.

Add context cues
to handle archaic or specialized texts
Customize prompts with persona-like insights
Use domain data to close the gap in complex corpora

Embed metadata or persona definitions in your pipeline
Keep your code flexible for quick domain changes
Reduce overhead by shifting to context-based LLM prompts
Eliminate multiple specialized model versions

Reconceptualizing NER as a Humanist Endeavor

“We argue that in order to do so, one has to reconceptualize NER from a purely linguistic task into a humanist endeavour that requires some level of domain expertise and aims at activating the vast body of information LLMs have ingested during their training.”

NER4all-Paper

LLMs understand historical and cultural context, not just text patterns
Traditional NER fails on non-standardized, historical, or domain-specific texts
Ignoring this shift means missing a powerful, low-effort tool for research

Recognize entities
within historical nuance, not just text structure
Use domain expertise
to correct errors in machine predictions
Leverage LLMs as
adaptive research assistants, not rigid classifiers

Design systems that
incorporate metadata, historical lexicons, and local knowledge
Move beyond traditional NLP pipelines
to context-aware architectures
Build tools that let domain experts shape results,
not just consume them

Paradigm-Shift from Linguistics to Content-Driven NLP

“We propose a paradigmatic shift in the use of LLMs for NLP tasks: the redefinition of these tasks from a purely linguistic dimension to a content-oriented humanities dimension.”

NER4all-Paper

Content knowledge is the real key to success
Embed cultural and social dimensions right into model prompts
Linguistics difficulties no longer inhibit research

Build prompts that highlight historical or cultural detail
Elevate beyond structural text parsing to deeper context
Save time not normalizing/correcting bad OCR

Less time “wasted” annotating
Less time “wasted” fine-tuning for niche cases
Ask questions thought impossible before

Outlook & future work

What now?

Liberate NER Access for Historians

“NER [is made available] for all historians by removing the barrier of scripting languages and computational skills required for established NLP tools and instead leveraging natural language prompts and consumer-grade tools and frontends.”

NER4all-Paper

Enables advanced text processing for non-technical historians
Replace coding overhead with straightforward LLM prompts
Platforms are already available and only get better

You should try it out yourself!

All our experiments were on “the old” GPT-4o
Total Cost of less than $200!
Can even run on higher-end consumer-grade hardware!

Extending to Older Linguistic Forms & Periods

“In future work, we plan to investigate how well our approach can handle earlier linguistic forms to determine its broader applicability across different historical periods and languages.”

NER4all-Paper

Preliminary test already show promising results
Indicate LLM-based NER can scale across broader epochs

Provide custom settings for each time period or language
Bring in specialized lexicons to refine entity detection
- e.g. Middle High German, Renaissance Latin, ..
Use RAGs or similar to enhance performance even further

Open Data & Code for Full Reproducibility

All data and code will be made available shortly.

NER4all-Authors

Paper is currently under peer-review
Code & Results will be made available soon

Our pre-print https://arxiv.org/abs/2502.04351 includes full prompts
Can be directly used with any LLM and a page of text
Try it!

Benchmark will be made available after final revisions
If you have NLP/NER projects, give it a try!

Thank you for your time

Questions?

NER4all: Context is All You Need

Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach

Who am i?

Slide-Structure

Challenges for NLP with Historical Materials

High Diversity Challenges Traditional NLP

Historical NER Requires Expert Interpretation

Ground Truth and Matching

Ground Truth from 1921 Baedeker Guide

Using Lenient Span Matching

Performance and Evaluation Metrics

LLMs Outperform spaCy & flair by up to 22%

Zero-Shot Beats Low-Shot

LLMs May Surpass Human-Level Performance

Contextual and Domain-Specific Prompting

Context & Persona Modeling Are Critical

Reconceptualizing NER as a Humanist Endeavor

Paradigm-Shift from Linguistics to Content-Driven NLP

Outlook & future work

Liberate NER Access for Historians

Extending to Older Linguistic Forms & Periods

Open Data & Code for Full Reproducibility

Thank you for your time

Using LLMs for low-effort, high-performance NER on historical texts.
A humanities informed approach