NER4all: Context is All You Need

Using LLMs for low-effort, high-performance NER on historical texts.
A humanities informed approach

Author
Torsten Hiltmann
Humboldt-Universität zu Berlin
Martin Dröge
Humboldt-Universität zu BerlinAI-Skills
Nicole Dresselhaus
Humboldt-Universität zu BerlinNFDI4Memory
Till Grallert
Humboldt-Universität zu BerlinNFDI4Memory
Melanie Althage
Humboldt-Universität zu Berlin
Paul Bayer
Humboldt-Universität zu Berlin
Sophie Eckenstaler
Humboldt-Universität zu BerlinKompetenzwerkstatt Digital Humanities
Koray Mendi
Humboldt-Universität zu Berlin
Jascha Marijn Schmitz
Humboldt-Universität zu BerlinNFDI4Memory
Philipp Schneider
Humboldt-Universität zu Berlin
Wiebke Sczeponik
Humboldt-Universität zu Berlin
Anica Skibba
Humboldt-Universität zu BerlinNFDI4Memory
Occasion
deRSE25, Karlsruhe
Date
26.02.2025
Funded by the German Research Foundation (DFG) - project number 501609550.

Who am i?

Nicole Dresselhaus

Where?

  • Professur für Digital History, Humboldt-Universität zu Berlin
  • NFDI4Memory, Task Area 5 - Data-Culture

How?

  • In Humanities “by happy accident”
    — they tend to have the most tricky data
  • M.Sc in Informatics in the natural sciences
    from Universität Bielefeld; Focus on ML
  • Gathered experience in private companies
    often involving NLP or simulations

Current Focus?

  • Stay up to date on ML-Research
  • Apply those things to the real world
    outside of Toy-Datasets
  • What are the limits of current Deep Neural Nets?

Slide-Structure

“I want slides that are useful to me - not the one holding the talk”

me

  • Default Background = What you usually expect from slides
  • Bullet-Points explaining stuff…
  • This is a takeaway primary for researchers
  • This is a takeaway aimed at
    Research Software Engineers

Challenges for NLP with Historical Materials

Whats the problem anyway?

High Diversity Challenges Traditional NLP

“While some domains operate with relatively homogeneous text data, historical research is characterized by texts defined by their diversity in form and content, presenting a significant challenge for NLP-tasks.”

NER4all-Paper

“Yet, due to the high linguistic and genre diversity of sources, only limited canonisation of spellings, the level of required historical domain knowledge, and the scarcity of annotated training data, established approaches to natural language processing (NLP) have been both extremely expensive and yielded only unsatisfactory results in terms of recall and precision.”

NER4all-Paper

  • Wide range of styles in historical corpora
  • Often no proper spelling or influenced by dialects
  • Meaning of words changed through time
  • Even “simple” transcriptions are often
    neither trivial, nor standardized
  • Standard tokenizers often fail
  • Input normalization sometimes loses meaning necessary for downstream tasks
  • Reliable recognition can only be fuzzy
  • Worst of all:
    no “go-to” benchmark including this variety

Historical NER Requires Expert Interpretation

“The process of detecting and classifying named entities in historical texts is often less straightforward than it appears as it involves a degree of interpretation by domain experts familiar with the specific subject, historical context, and time period.”

NER4all-Paper

  • Censorship:
    Meaning can hide behind archaic forms
  • Change of Labels:
    WWI was known as “the great war” until WWII
  • Lack of Appearance in modern catalogs:
    Entity may not exist anymore/is forgotten

Our hypothesis is that

  • LLMs encode a lot of knowledge during their training
    — it is “just” a matter of activating this knowledge via prompting
  • LLMs can infer entities via context clues
    even if they know nothing about the entity itself
  • no fine-tuning is necessary

Ground Truth and Matching

How do you even test for this?

Ground Truth from 1921 Baedeker Guide

“To test our approach, we created ground truth with manually annotated named entities from the 1921 Baedeker travel guide for Berlin[…].”

NER4all-Paper

Why this Resource?

  • No comparable benchmark for historic NER existed
  • travel guides have a high density and variety of named entities
  • We have Experts for annotations in-house
    (Historians at a Berlin University)

Examples from the 1921 Baedeker Travel-Guide for Berlin

Prose with few Named Entities
Tables
Plans and Sketches
High Density Listings of Named Entities

Using Lenient Span Matching

“We […] match the extracted spans. We used the most lenient criteria […], meaning it suffices to have an overlap with the annotation and having the correct entity-type annotated.”

“We [allowed] for up to 1 error every 5 generated characters,[…]”

NER4all-Paper

  • The requirements force the adoption of a forgiving metric that suits messy historical data
  • Introduces a simple hyperparameter to tweak precision vs. recall
  • In history, often recall is paramount
  • Matching less susceptible to spelling or OCR-Mistakes
  • Imposing exact matching is a fool’s errand on historic data.
  • Less impact of hallucinations/“corrections” from generating LLMs
  • Configurable thresholds for stricter or looser matching
  • Easily adjust defaults for different user groups
  • Advocate for a “never without” approach
    removing leniency is easy, adding is hard

Performance and Evaluation Metrics

Is it any good?

LLMs Outperform spaCy & flair by up to 22%

“readily-available, state-of-the-art LLMs significantly outperform two leading NLP frameworks, spaCy and flair, for NER in historical documents by seven to twentytwo percent higher F1-Scores.”

NER4all-Paper

Selected representative results.
“+ PE” means “including Prompt Engineering”.
“No Context” means instructions only (“Do NER by …”).
Variant Recall Precision F1-Score
All 0-Shot μ ± σ impact μ ± σ impact μ ± σ impact
Specific Context + PE 0.84 ±0.10 3.71 % 0.91 ±0.08 3.99 % 0.87 ±0.08 3.80 %
Specific Context 0.81 ±0.19 0.00 % 0.87 ±0.19 0.00 % 0.84 ±0.19 0.00 %
Generic Context + PE 0.80 ±0.11 -2.35 % 0.92 ±0.10 2.92 % 0.84 ±0.10 -0.07 %
No Context 0.75 ±0.15 -7.43 % 0.90 ±0.09 2.31 % 0.81 ±0.11 -3.64 %
Baseline flair 0.76 ±0.13 -6.65 % 0.89 ±0.10 1.46 % 0.81 ±0.11 -3.86 %
Baseline spaCy 0.71 ±0.13 -12.79 % 0.62 ±0.11 -29.32 % 0.66 ±0.10 -21.66 %
  • Easy to use in your research
  • No dependencies on tooling needed
  • Still reproducible research
    using simple, flexible prompts for your domain
  • Adopt prompt-based NER
    yielding leaner code and faster dev cycles
    — try a few prompts and get a “good enough” first glance
  • Minimize preprocessing with direct LLM usage
    — your data doesn’t need to be perfect
  • Future-proof research tools
    — changing the underlying LLM to keep up is fast!

Zero-Shot Beats Low-Shot

“surprisingly and against our expectations, zero-shot approaches, […] perform better than few-shot approaches until the number of examples reaches 16 and more.”

NER4all-Paper

  • Good annotation takes time, so either
    Commit to well-sized annotation sets
    or skip them altogether
  • Getting “the right” examples is non-trivial
  • Design your CI pipeline around zero-shot or high-shot modes
  • Avoid partial solutions that degrade reliability
  • Plan enough time if you do annotation-sprints

LLMs May Surpass Human-Level Performance

“Although the process is less straightforward for historians who rely on domain expertise, we discovered that the LLM-based approach can replicate or even exceed human-level performance for certain tasks.”

NER4all-Paper

  • Our data was annotated by at least 2 historians
    with a third deciding deciding on disagreement
  • We still got surprised when the LLM disagreed
    and it was right!
  • Storytime: “Borstells Lesezirkel”
  • Not systematically evaluated yet…

Contextual and Domain-Specific Prompting

Whats does this mean?

Context & Persona Modeling Are Critical

“[LLMs] outperform […] as soon as a bit of contextual information and persona modelling is included in the prompts.”

“Our ablation study shows how providing historical context to the task […] turns focus away from a purely linguistic approach [and] are core to a successful prompting strategy.”

NER4all-Paper

Variant Recall Precision F1-Score
All 0-Shot μ ± σ μ ± σ μ ± σ
Specific Context + PE 32-Shot 0.89 ±0.09 0.90 ±0.06 0.89 ±0.06
Specific Context + PE 0.84 ±0.10 0.91 ±0.08 0.87 ±0.08
Specific Context 0.81 ±0.19 0.87 ±0.19 0.84 ±0.19
Generic Context + PE 0.80 ±0.11 0.92 ±0.10 0.84 ±0.10
Generic Context 0.81 ±0.11 0.90 ±0.10 0.85 ±0.09
No Context + PE 0.74 ±0.15 0.91 ±0.10 0.81 ±0.11
No Context 0.75 ±0.15 0.90 ±0.09 0.81 ±0.11
  • Add context cues
    to handle archaic or specialized texts
  • Customize prompts with persona-like insights
  • Use domain data to close the gap in complex corpora
  • Embed metadata or persona definitions in your pipeline
  • Keep your code flexible for quick domain changes
  • Reduce overhead by shifting to context-based LLM prompts
  • Eliminate multiple specialized model versions

Reconceptualizing NER as a Humanist Endeavor

“We argue that in order to do so, one has to reconceptualize NER from a purely linguistic task into a humanist endeavour that requires some level of domain expertise and aims at activating the vast body of information LLMs have ingested during their training.”

NER4all-Paper

  • LLMs understand historical and cultural context, not just text patterns
  • Traditional NER fails on non-standardized, historical, or domain-specific texts
  • Ignoring this shift means missing a powerful, low-effort tool for research
  • Recognize entities
    within historical nuance, not just text structure
  • Use domain expertise
    to correct errors in machine predictions
  • Leverage LLMs as
    adaptive research assistants, not rigid classifiers
  • Design systems that
    incorporate metadata, historical lexicons, and local knowledge
  • Move beyond traditional NLP pipelines
    to context-aware architectures
  • Build tools that let domain experts shape results,
    not just consume them

Paradigm-Shift from Linguistics to Content-Driven NLP

“We propose a paradigmatic shift in the use of LLMs for NLP tasks: the redefinition of these tasks from a purely linguistic dimension to a content-oriented humanities dimension.”

NER4all-Paper

  • Content knowledge is the real key to success
  • Embed cultural and social dimensions right into model prompts
  • Linguistics difficulties no longer inhibit research
  • Build prompts that highlight historical or cultural detail
  • Elevate beyond structural text parsing to deeper context
  • Save time not normalizing/correcting bad OCR
  • Less time “wasted” annotating
  • Less time “wasted” fine-tuning for niche cases
  • Ask questions thought impossible before

Outlook & future work

What now?

Liberate NER Access for Historians

“NER [is made available] for all historians by removing the barrier of scripting languages and computational skills required for established NLP tools and instead leveraging natural language prompts and consumer-grade tools and frontends.”

NER4all-Paper

  • Enables advanced text processing for non-technical historians
  • Replace coding overhead with straightforward LLM prompts
  • Platforms are already available and only get better

You should try it out yourself!

  • All our experiments were on “the old” GPT-4o
  • Total Cost of less than $200!
  • Can even run on higher-end consumer-grade hardware!

Extending to Older Linguistic Forms & Periods

“In future work, we plan to investigate how well our approach can handle earlier linguistic forms to determine its broader applicability across different historical periods and languages.”

NER4all-Paper

  • Preliminary test already show promising results
  • Indicate LLM-based NER can scale across broader epochs
  • Provide custom settings for each time period or language
  • Bring in specialized lexicons to refine entity detection
    • e.g. Middle High German, Renaissance Latin, ..
  • Use RAGs or similar to enhance performance even further

Open Data & Code for Full Reproducibility

All data and code will be made available shortly.

NER4all-Authors

  • Paper is currently under peer-review
  • Code & Results will be made available soon
  • Benchmark will be made available after final revisions
  • If you have NLP/NER projects, give it a try!

Thank you for your time

Questions?