NER Module — Unstructured Text Anonymization

MaskMe’s NER module detects and replaces personally identifiable information in free-form text using Named Entity Recognition via spaCy. It operates independently from the structured-data engine and is invoked through the maskme ner CLI subcommand or the maskme.ner Python API.

Installation

pip install maskme[ner]
python -m spacy download en_core_web_lg
python -m spacy download fr_core_news_lg

Supported languages: fr (French) and en (English). Install only the models you need.

API Reference

maskme.ner

Unstructured text anonymization for MaskMe.

Detects and masks PII in free text (medical reports, legal documents, emails, etc.) using spaCy NER, with support for French and English.

The mask() function is the single entry point:

>>> from maskme.ner import mask
>>> result = mask("Alice habite à Paris.")
>>> result.output
'[PERSON] habite à [LOCATION].'

Batch:

>>> results = mask(["Alice habite à Paris.", "John lives in London."])
>>> [r.output for r in results]
['[PERSON] habite à [LOCATION].', '[PERSON] lives in [LOCATION].']
class maskme.ner.Detector(*args, **kwargs)[source]

Bases: Protocol

Protocol that every entity detector must satisfy.

A detector receives a text string and returns a list of Entity objects. It must never modify the text or apply any masking.

The name attribute is used by the pipeline for logging, priority resolution, and conflict reporting.

Example implementation skeleton:

class MyDetector:

name = “my_detector” priority = 50 # higher = wins span conflicts

def detect(

self, text: str, language: str = “fr”, **kwargs,

) -> List[Entity]:

entities = [] # … find spans … return entities

detect(text: str, language: str = 'fr', **kwargs: Any) List[Entity][source]

Detect PII entities in a text string.

Parameters:
  • text – The raw text to analyse.

  • language – ISO 639-1 language code (“fr” or “en”). Detectors may use this to apply language-specific patterns or load the appropriate NLP model.

  • **kwargs – Detector-specific parameters.

Returns:

List of Entity objects found in the text. Must never return overlapping spans from the same detector. Returns an empty list if no entities are found.

name: str

Stable identifier used in registry keys and entity.detector field.

priority: int

Priority for span conflict resolution (higher wins). spaCy detectors default to 50 (contextual but probabilistic).

class maskme.ner.Entity(text: str, label: EntityLabel, start: int, end: int, score: float = 1.0, detector: str = '', metadata: Dict[str, ~typing.Any]=<factory>)[source]

Bases: object

A single detected PII span within a text.

text

The exact substring that was detected.

Type:

str

label

Standardised entity label (EntityLabel).

Type:

maskme.ner.base.EntityLabel

start

Character offset of the start of the span (inclusive).

Type:

int

end

Character offset of the end of the span (exclusive). text == source_text[start:end] must always hold.

Type:

int

score

Confidence score in [0.0, 1.0]. Regex detections are always 1.0 (deterministic). NER detections carry the model’s confidence.

Type:

float

detector

Name of the detector that produced this entity, for traceability and conflict resolution.

Type:

str

metadata

Arbitrary extra data (e.g. regex group name, normalised value, spaCy entity type).

Type:

Dict[str, Any]

contains(other: Entity) bool[source]

Return True if this entity’s span fully contains another’s.

detector: str = ''
end: int
label: EntityLabel
metadata: Dict[str, Any]
overlaps(other: Entity) bool[source]

Return True if this entity’s span overlaps with another’s.

score: float = 1.0
start: int
text: str
class maskme.ner.EntityLabel(value)[source]

Bases: str, Enum

Entity labels produced by the spaCy NER detector.

Using an enum prevents label drift between detectors. All detectors must map their internal labels to these values.

DATE = 'DATE'
LOCATION = 'LOCATION'
ORGANISATION = 'ORGANISATION'
PERSON = 'PERSON'
TIME = 'TIME'
class maskme.ner.PipelineResult(output: str, input: str, entities: ~typing.List[~maskme.ner.base.Entity], language: str, stats: ~typing.Dict[str, ~typing.Any] = <factory>)[source]

Bases: object

Output of a single mask() call.

output

Text with all detected entities replaced by tags (e.g. [PERSON], [LOCATION]).

Type:

str

input

Original input text.

Type:

str

entities

Non-overlapping entities after span resolution, sorted by start offset.

Type:

List[maskme.ner.base.Entity]

language

Detected or overridden language code ("fr" / "en").

Type:

str

stats

Processing metadata.

Type:

Dict[str, Any]

as_dict() Dict[str, Any][source]
entities: List[Entity]
property entity_count: int
input: str
property labels_found: List[str]
language: str
output: str
stats: Dict[str, Any]
maskme.ner.mask(text: str | List[str], language: str | None = None) PipelineResult | List[PipelineResult][source]

Detect and replace PII in text with anonymized tags ([PERSON], etc.).

This is the single entry point to the NER module. Pass a string for single-text processing or a list of strings for batch processing.

Parameters:
  • text – Text string or list of text strings to mask.

  • language – Language override ("fr" / "en"). Auto-detected if None.

Returns:

PipelineResult if text is a single string. List[PipelineResult] if text is a list.

Simple usage:

>>> from maskme.ner import mask
>>> result = mask("Alice habite à Paris.")
>>> result.output
'[PERSON] habite à [LOCATION].'

Batch:

>>> results = mask(["Alice habite à Paris.", "John lives in London."])
>>> [r.output for r in results]
['[PERSON] habite à [LOCATION].', '[PERSON] lives in [LOCATION].']

Language hint:

>>> mask("John lives in London.", language="en").output
'[PERSON] lives in [LOCATION].'
maskme.ner.resolve_spans(entities: List[Entity]) List[Entity][source]

Resolve overlapping entity spans from multiple detectors.

Resolution rules (applied in order):
  1. Longer span wins over shorter span (more context = more precise).

  2. Higher detector priority wins when spans have equal length.

  3. Higher confidence score breaks remaining ties.

The result is a set of non-overlapping entities sorted by start offset, ready to be passed to masker.py for text reconstruction.

Parameters:

entities – Raw entity list, potentially with overlapping spans.

Returns:

Deduplicated, non-overlapping list of Entity objects sorted by their start position in the source text.

CLI Usage

maskme ner [OPTIONS] [INPUT]

Arguments:

  • INPUT — Path to a text file. If omitted, reads from stdin.

Options:

Option

Description

-o, --output PATH

Output file path (default: stdout)

-l, --language TEXT

Language override (fr or en); auto-detected if omitted

--lines

Treat each line as a separate text (batch mode)

--verbose

Enable debug logging

Entity Labels

The following entity types are detected and replaced with bracketed tags:

Label

Tag

Description

PERSON

[PERSON]

People names

LOCATION

[LOCATION]

Cities, countries, addresses

ORGANISATION

[ORGANISATION]

Companies, hospitals, organizations

DATE

[DATE]

Absolute or relative dates

TIME

[TIME]

Times of day

Graceful Degradation

If spaCy is not installed, the module logs a warning and returns the original text unchanged. This allows code using the NER module to run without the optional dependency in environments where text anonymization is not required.