NER Module — Unstructured Text Anonymization
MaskMe’s NER module detects and replaces personally identifiable information in
free-form text using Named Entity Recognition via spaCy. It operates independently
from the structured-data engine and is invoked through the maskme ner CLI subcommand
or the maskme.ner Python API.
Installation
pip install maskme[ner]
python -m spacy download en_core_web_lg
python -m spacy download fr_core_news_lg
Supported languages: fr (French) and en (English). Install only the models
you need.
API Reference
maskme.ner
Unstructured text anonymization for MaskMe.
Detects and masks PII in free text (medical reports, legal documents, emails, etc.) using spaCy NER, with support for French and English.
The mask() function is the single entry point:
>>> from maskme.ner import mask
>>> result = mask("Alice habite à Paris.")
>>> result.output
'[PERSON] habite à [LOCATION].'
Batch:
>>> results = mask(["Alice habite à Paris.", "John lives in London."])
>>> [r.output for r in results]
['[PERSON] habite à [LOCATION].', '[PERSON] lives in [LOCATION].']
- class maskme.ner.Detector(*args, **kwargs)[source]
Bases:
ProtocolProtocol that every entity detector must satisfy.
A detector receives a text string and returns a list of Entity objects. It must never modify the text or apply any masking.
The name attribute is used by the pipeline for logging, priority resolution, and conflict reporting.
Example implementation skeleton:
- class MyDetector:
name = “my_detector” priority = 50 # higher = wins span conflicts
- def detect(
self, text: str, language: str = “fr”, **kwargs,
- ) -> List[Entity]:
entities = [] # … find spans … return entities
- detect(text: str, language: str = 'fr', **kwargs: Any) List[Entity][source]
Detect PII entities in a text string.
- Parameters:
text – The raw text to analyse.
language – ISO 639-1 language code (“fr” or “en”). Detectors may use this to apply language-specific patterns or load the appropriate NLP model.
**kwargs – Detector-specific parameters.
- Returns:
List of Entity objects found in the text. Must never return overlapping spans from the same detector. Returns an empty list if no entities are found.
- name: str
Stable identifier used in registry keys and entity.detector field.
- priority: int
Priority for span conflict resolution (higher wins). spaCy detectors default to 50 (contextual but probabilistic).
- class maskme.ner.Entity(text: str, label: EntityLabel, start: int, end: int, score: float = 1.0, detector: str = '', metadata: Dict[str, ~typing.Any]=<factory>)[source]
Bases:
objectA single detected PII span within a text.
- text
The exact substring that was detected.
- Type:
str
- label
Standardised entity label (EntityLabel).
- start
Character offset of the start of the span (inclusive).
- Type:
int
- end
Character offset of the end of the span (exclusive). text == source_text[start:end] must always hold.
- Type:
int
- score
Confidence score in [0.0, 1.0]. Regex detections are always 1.0 (deterministic). NER detections carry the model’s confidence.
- Type:
float
- detector
Name of the detector that produced this entity, for traceability and conflict resolution.
- Type:
str
- metadata
Arbitrary extra data (e.g. regex group name, normalised value, spaCy entity type).
- Type:
Dict[str, Any]
- detector: str = ''
- end: int
- label: EntityLabel
- metadata: Dict[str, Any]
- score: float = 1.0
- start: int
- text: str
- class maskme.ner.EntityLabel(value)[source]
Bases:
str,EnumEntity labels produced by the spaCy NER detector.
Using an enum prevents label drift between detectors. All detectors must map their internal labels to these values.
- DATE = 'DATE'
- LOCATION = 'LOCATION'
- ORGANISATION = 'ORGANISATION'
- PERSON = 'PERSON'
- TIME = 'TIME'
- class maskme.ner.PipelineResult(output: str, input: str, entities: ~typing.List[~maskme.ner.base.Entity], language: str, stats: ~typing.Dict[str, ~typing.Any] = <factory>)[source]
Bases:
objectOutput of a single
mask()call.- output
Text with all detected entities replaced by tags (e.g.
[PERSON],[LOCATION]).- Type:
str
- input
Original input text.
- Type:
str
- entities
Non-overlapping entities after span resolution, sorted by start offset.
- Type:
List[maskme.ner.base.Entity]
- language
Detected or overridden language code (
"fr"/"en").- Type:
str
- stats
Processing metadata.
- Type:
Dict[str, Any]
- as_dict() Dict[str, Any][source]
- entities: List[Entity]
- property entity_count: int
- input: str
- property labels_found: List[str]
- language: str
- output: str
- stats: Dict[str, Any]
- maskme.ner.mask(text: str | List[str], language: str | None = None) PipelineResult | List[PipelineResult][source]
Detect and replace PII in text with anonymized tags (
[PERSON], etc.).This is the single entry point to the NER module. Pass a string for single-text processing or a list of strings for batch processing.
- Parameters:
text – Text string or list of text strings to mask.
language – Language override (
"fr"/"en"). Auto-detected ifNone.
- Returns:
PipelineResultiftextis a single string.List[PipelineResult]iftextis a list.
Simple usage:
>>> from maskme.ner import mask >>> result = mask("Alice habite à Paris.") >>> result.output '[PERSON] habite à [LOCATION].'Batch:
>>> results = mask(["Alice habite à Paris.", "John lives in London."]) >>> [r.output for r in results] ['[PERSON] habite à [LOCATION].', '[PERSON] lives in [LOCATION].']Language hint:
>>> mask("John lives in London.", language="en").output '[PERSON] lives in [LOCATION].'
- maskme.ner.resolve_spans(entities: List[Entity]) List[Entity][source]
Resolve overlapping entity spans from multiple detectors.
- Resolution rules (applied in order):
Longer span wins over shorter span (more context = more precise).
Higher detector priority wins when spans have equal length.
Higher confidence score breaks remaining ties.
The result is a set of non-overlapping entities sorted by start offset, ready to be passed to masker.py for text reconstruction.
- Parameters:
entities – Raw entity list, potentially with overlapping spans.
- Returns:
Deduplicated, non-overlapping list of Entity objects sorted by their start position in the source text.
CLI Usage
maskme ner [OPTIONS] [INPUT]
Arguments:
INPUT— Path to a text file. If omitted, reads from stdin.
Options:
Option |
Description |
|---|---|
|
Output file path (default: stdout) |
|
Language override ( |
|
Treat each line as a separate text (batch mode) |
|
Enable debug logging |
Entity Labels
The following entity types are detected and replaced with bracketed tags:
Label |
Tag |
Description |
|---|---|---|
|
|
People names |
|
|
Cities, countries, addresses |
|
|
Companies, hospitals, organizations |
|
|
Absolute or relative dates |
|
|
Times of day |
Graceful Degradation
If spaCy is not installed, the module logs a warning and returns the original text unchanged. This allows code using the NER module to run without the optional dependency in environments where text anonymization is not required.