Getting Started with MaskMe
In this tutorial, you’ll learn how to anonymize datasets using MaskMe. Whether you prefer a command-line tool or a Python API for custom applications, you’ll see both approaches in action.
By the end, you’ll understand:
The two ways to use MaskMe (CLI vs Python API)
How each anonymization strategy works
How to choose the right strategy for your data
How to measure re-identification risk and data utility
Prerequisites
MaskMe requires Python 3.9+. Install it now:
pip install maskme
Verify the installation:
maskme --version
Part 1: Your First Anonymization
Let’s start with a simple example. Imagine you have customer data you want to anonymize:
Sample data (customers.csv):
id,name,email,phone,region,purchase_count
1,Alice Johnson,alice@example.com,555-0101,US-CA,42
2,Bob Smith,bob@example.com,555-0102,US-NY,15
3,Carol White,carol@example.com,555-0103,US-TX,87
Define a rules file that specifies which strategy to apply to each field:
Rules file (rules.json):
{
"id": "hash",
"name": "drop",
"email": "drop",
"phone": {"strategy": "redact", "char": "X", "keep_start": 0, "keep_end": 4},
"region": "keep",
"purchase_count": "keep"
}
Now run MaskMe from the command line:
maskme --rules rules.json --input customers.csv --output customers_masked.csv
Inspect the result:
cat customers_masked.csv
Output:
id,phone,region,purchase_count
6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b,XXXX0101,US-CA,42
d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35,XXXX0102,US-NY,15
4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce,XXXX0103,US-TX,87
What happened:
id: Hashed (unreadable, but consistent for the same input)
name: Removed entirely (
drop)email: Removed entirely (
drop)phone: Last 4 digits kept, rest redacted (
keep_end: 4)region and purchase_count: Left unchanged (
keep)
For developers building applications on top of MaskMe, use the Python API:
import csv
from maskme import MaskMe
# Define rules (same as the JSON file)
rules = {
"id": "hash",
"name": "drop",
"email": "drop",
"phone": {"strategy": "redact", "char": "X", "keep_start": 0, "keep_end": 4},
"region": "keep",
"purchase_count": "keep"
}
# Load data
with open("customers.csv", "r") as f:
reader = csv.DictReader(f)
records = list(reader)
# Initialize engine
engine = MaskMe(rules)
# Process records
masked_records = list(engine.mask(records))
# Save results
with open("customers_masked.csv", "w") as f:
writer = csv.DictWriter(f, fieldnames=masked_records[0].keys())
writer.writeheader()
writer.writerows(masked_records)
Key advantage: You can integrate MaskMe directly into your Python applications — pipelines, data processing scripts, ML workflows, etc.
Part 2: Understanding Strategies
MaskMe provides six strategies. Each solves different privacy vs. utility trade-offs. Let’s explore when to use each.
Strategy |
What It Does |
Best For |
|---|---|---|
Keep |
Preserves original value unchanged |
Non-sensitive fields, public data |
Drop |
Removes field entirely |
Direct identifiers (PII) |
Hash |
One-way cryptographic digest |
Consistent linking without revealing values |
Redact |
Replaces characters with placeholders |
Partial visibility (e.g., last 4 digits) |
Noise |
Adds statistical noise to numeric values |
Numeric data with preserved distributions |
Generalize |
Coarsens data into broader categories |
Ages, dates, locations |
Keep
The Keep strategy keeps the original value unchanged.
Use when: The field is already public, non-sensitive, or represent key analytical dimensions.
Example: Geographic region, product category, or timestamp.
In rules file:
{
"region": "keep",
"category": "keep"
}
Input data:
{
"region": "US-West",
"category": "Electronics"
}
Output data:
{
"region": "US-West",
"category": "Electronics"
}
Drop
The Drop strategy removes the field entirely.
Use when: The field is a direct identifier (PII) that could lead to re-identification, or if the field is unnecessary for the final dataset.
Example: Names, email addresses, phone numbers, social security numbers.
In rules file:
{
"user_id": "drop",
"internal_ref": "drop"
}
Input data:
{
"user_id": "USR-12345",
"internal_ref": "REF-999",
"email": "john@example.com"
}
Output data:
{
"email": "john@example.com"
}
Hash
The Hash strategy converts the value into a fixed-length hexadecimal digest.
Use when: You need a consistent, one-way transformation (cannot be reversed).
Common use: Email addresses, usernames, customer IDs when consistency matters for record linking.
Parameters:
algo: Hashing algorithm —sha256(default),sha512,blake2b, etc.salt: Recommended. Salt string for extra security. Same input + same salt = same output.
In rules file:
{
"email": "hash",
"customer_id": {
"strategy": "hash",
"salt": "my-org-secret-2026"
},
"diagnosis_code": {
"strategy": "hash",
"algo": "sha512",
"salt": "healthcare-key"
}
}
Input data:
{
"email": "alice@example.com",
"customer_id": "CUST-5678",
"diagnosis_code": "E11.9"
}
Output data:
{
"email": "2a7f8e9c3b1d5f4a6e8c9b1d3f5a7e8c",
"customer_id": "f4d8b7a9c1e3f5d9b2a4c6e8f1a3d5b7",
"diagnosis_code": "9f8e7d6c5b4a3f2e1d0c9b8a7f6e5d4ca1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6"
}
Note
Same input + same salt = same output. This is useful for linking records across datasets.
Redact
The Redact strategy replaces characters with a placeholder, preserving length.
Use when: You need some visible information (like last 4 digits) while hiding the rest.
Common use: Phone numbers, credit cards, partially-visible IDs.
Parameters:
char: Placeholder character (default:*)keep_start: Characters to show at the beginning (default: 0)keep_end: Characters to show at the end (default: 0)
In rules file:
{
"phone": {
"strategy": "redact",
"keep_end": 4
},
"credit_card": {
"strategy": "redact",
"char": "X",
"keep_end": 4
},
"email": {
"strategy": "redact",
"keep_start": 1,
"keep_end": 3,
"char": "*"
},
"name": "redact"
}
Input data:
{
"phone": "555-0101",
"credit_card": "4532-1234-5678-90",
"email": "alice@example.com",
"name": "John Doe"
}
Output data:
{
"phone": "****0101",
"credit_card": "XXXXXXXXXXXX7890",
"email": "a*******com",
"name": "********"
}
Noise
The Noise strategy adds statistical noise to numeric values.
Use when: You need to keep numbers but make individual values unrecognizable while preserving statistical distributions.
Common use: Ages, salaries, purchase amounts, measurements.
Two modes — which one for you?
Mode |
Pick this when… |
You must provide |
|---|---|---|
Direct sigma |
You want simple, predictable noise. No formal privacy guarantees needed. |
|
Differential Privacy |
You need a formal, provable privacy guarantee (HIPAA, research, regulation). |
|
Direct sigma — picking sigma
Adds Gaussian noise with standard deviation sigma. About 68% of outputs land within ±sigma of the true value, 95% within ±2×sigma.
Pick sigma relative to your data’s scale:
Field |
Typical range |
Suggested sigma |
Effect |
|---|---|---|---|
Age |
0–100 |
2–5 |
Most values shift ±2–5 years |
Salary |
$30k–$200k |
5k–10k |
Most shift ±$5k–$10k |
Rating (1–5) |
1–5 |
0.5–1 |
Most shift ±0.5–1 stars |
Purchase count |
0–1,000 |
50–100 |
Most shift ±50–100 |
Start at the low end, check if the noise is sufficient for your use case, and increase if needed.
{
"age": {
"strategy": "noise",
"sigma": 2,
"seed": "reproducible-2026"
},
"salary": {
"strategy": "noise",
"sigma": 5000,
"precision": 0,
"min_val": 20000,
"max_val": 500000,
"seed": "reproducible-2026"
}
}
Differential Privacy — picking epsilon
Noise is calibrated from epsilon (privacy budget). Smaller epsilon = stronger privacy = more noise.
Epsilon |
Privacy |
When to use |
|---|---|---|
0.1 – 0.5 |
High |
Medical records, financial data, strict compliance |
0.5 – 2.0 |
Moderate |
Most business & research datasets |
2.0 – 10 |
Low |
Non-sensitive analytics, aggregate stats |
> 10 |
Minimal |
Rarely useful — little privacy left |
Picking sensitivity
sensitivity = the maximum possible change one person’s data can cause.
The safe formula: sensitivity = max_value − min_value for the field.
If you clamp salaries to $30k–$200k with min_val / max_val, set sensitivity to 170k. This way the DP noise is proportional to your actual data range.
Picking delta
Standard default: 1e-5 (1 in 100,000 chance of privacy breach). Or set it to 1 / number_of_records.
{
"salary": {
"strategy": "noise",
"epsilon": 1.0,
"sensitivity": 10000,
"delta": 1e-5,
"min_val": 20000,
"max_val": 500000,
"seed": "reproducible-2026"
}
}
All parameters quick reference
Parameter |
Mode |
What it does |
How to choose |
|---|---|---|---|
|
Direct |
Std dev of noise |
~5–10% of data range |
|
DP |
Privacy budget |
0.1–0.5 (high), 0.5–2 (moderate), 2–10 (low) |
|
DP |
Max possible change |
max − min of your field |
|
DP |
Failure probability |
1e-5 (default) |
|
Both |
Lower clip bound |
Realistic minimum |
|
Both |
Upper clip bound |
Realistic maximum |
|
Both |
Decimal places |
0 for integers |
|
Both |
Reproducible noise |
Any string; same seed = same noise |
Example 1: Direct Sigma Mode
Rules:
{
"age": {
"strategy": "noise",
"sigma": 2,
"seed": "reproducible-2026"
}
}
Input data:
{
"age": 45,
"visitor_id": "V-001"
}
Output data (example):
{
"age": 42,
"visitor_id": "V-001"
}
Note
Noise is random, so the same input produces different output each run. Larger sigma = more privacy but more data distortion.
Example 2: With Clipping Bounds
Rules:
{
"salary": {
"strategy": "noise",
"sigma": 5000,
"min_val": 20000,
"max_val": 150000,
"seed": "reproducible-2026"
}
}
Input data:
{
"salary": 95000,
"dept": "Engineering"
}
Output data (example):
{
"salary": 98234,
"dept": "Engineering"
}
Generalize
The Generalize strategy coarsens data to broader categories.
Use when: You want to keep the type of information but remove specificity.
Common use: Dates (year only), locations (state instead of city), ages (brackets).
For numeric data (ages, scores, amounts):
{
"age": {
"strategy": "generalize",
"step": 10,
"method": "range"
},
"score": {
"strategy": "generalize",
"bins": [0, 50, 70, 90, 100],
"method": "range"
}
}
For dates:
{
"birth_date": {
"strategy": "generalize",
"method": "date_year"
},
"visit_month": {
"strategy": "generalize",
"method": "date_month"
}
}
For locations (comma-separated):
{
"full_address": {
"strategy": "generalize",
"depth": 1
}
}
Parameters:
step: Fixed bracket size (e.g., 10 for ages 0-10, 10-20, etc.)bins: Custom brackets [0, 18, 30, 50, 100] → “0-18”, “18-30”, etc.method:"range"(shows bracket),"floor"(lower bound only),"date_year","date_month"depth: For locations, number of leading parts to remove
Example 1: Numeric with Step
Rules:
{
"age": {
"strategy": "generalize",
"step": 10,
"method": "range"
}
}
Input data:
{
"age": 27,
"name": "Alice"
}
Output data:
{
"age": "20-30",
"name": "Alice"
}
Example 2: Custom Bins
Rules:
{
"score": {
"strategy": "generalize",
"bins": [0, 50, 70, 90, 100],
"method": "range"
}
}
Input data:
{
"score": 87,
"student_id": "S-456"
}
Output data:
{
"score": "70-90",
"student_id": "S-456"
}
Example 3: Dates
Rules:
{
"birth_date": {
"strategy": "generalize",
"method": "date_year"
},
"visit_date": {
"strategy": "generalize",
"method": "date_month"
}
}
Input data:
{
"birth_date": "1995-06-15",
"visit_date": "2024-03-15"
}
Output data:
{
"birth_date": "1995",
"visit_date": "2024-03"
}
Example 4: Locations
Rules:
{
"full_address": {
"strategy": "generalize",
"depth": 1
}
}
Input data:
{
"full_address": "New York,USA,Home"
}
Output data:
{
"full_address": "USA,Home"
}
Privacy regulations like HIPAA require a structured approach to data anonymization:
STEP 1: Remove Direct Identifiers (HIPAA Requirement)
├─ Is it a direct identifier? (name, SSN, medical record ID, etc.)
│ └─ YES → drop
│ └─ NO → go to STEP 2
STEP 2: Handle Quasi-Identifiers (Latanya Sweeney's Research)
│
│ Note: Deletion alone is not sufficient.
│ Quasi-identifiers can be re-linked to external data.
│ Example: Birthdate + zipcode can re-identify in 87% of US population
│
├─ Is it a quasi-identifier? (age, zipcode, birthdate, etc.)
│ ├─ Age/Income/Numeric → generalize (brackets) or noise
│ ├─ Date → generalize (year/month only)
│ ├─ Location → generalize (depth)
│ └─ NO → go to STEP 3
STEP 3: Preserve Analytical Payloads
├─ Is it analysis-critical? (medical codes, procedures, measurements)
│ ├─ Sensitive + Need Linkage → hash (consistent transformation)
│ ├─ Sensitive + Need Pattern → redact (partial visibility)
│ ├─ Sensitive + Numeric → noise (with bounds)
| └─ Needed information → keep
│ └─ NO → go to STEP 4
STEP 4: Non-Sensitive Data
└─ Not sensitive, not quasi-identifier → keep
Part 3: Measuring Privacy and Utility
Anonymization is not a one-shot decision — you need to measure the outcome to verify privacy is protected and data remains useful.
MaskMe provides two families of analytics:
Risk analytics — measure re-identification risk
Utility analytics — measure data quality loss
Both are available through CLI and Python API, and can generate rich HTML reports.
# Re-identification risk
maskme analyze risk --input visits_masked.csv \
--qi age postal_code --sa diagnosis \
--report risk_report.html
# Data utility
maskme analyze utility --original visits.csv \
--anonymized visits_masked.csv \
--report utility_report.html
The --qi flag takes the quasi-identifier fields — attributes that could
be combined to re-identify someone (age + postal code, birthdate + gender, etc.).
The --sa flag specifies the sensitive attribute to protect.
Three metrics work together to measure re-identification risk.
k-Anonymity
Every record must be indistinguishable from at least k-1 other records
based on quasi-identifiers. A lower k_min means some records are easy to
single out.
maskme analyze risk --input visits_masked.csv \
--qi age postal_code --sa diagnosis \
--k-threshold 5
Key metrics in the output:
Metric |
What it tells you |
|---|---|
|
Smallest equivalence class size (worst case) |
|
Typical class size across the dataset |
|
Records in classes below the threshold |
|
Percentage of records at risk |
If k_min is 1, the dataset has unique records that can be directly
re-identified. Solution: generalize or suppresse more aggressively.
l-Diversity
Even when records are k-anonymous, if everyone in a class has the same
sensitive value (e.g. all have the same diagnosis), attribute disclosure
is possible. l-diversity requires at least l distinct sensitive values
per equivalence class.
maskme analyze risk --input visits_masked.csv \
--qi age postal_code --sa diagnosis \
--l-threshold 3
Key metrics:
Metric |
What it tells you |
|---|---|
|
Fewest distinct sensitive values in any class |
|
Classes with fewer distinct values than the threshold |
Distinct l-diversity (the variant MaskMe uses) is a good baseline. A
l_min of 1 means at least one class has only one sensitive value —
full attribute disclosure for those records.
t-Closeness
Measures how much the distribution of the sensitive attribute within each class deviates from the global distribution. Uses Earth Mover’s Distance (EMD).
maskme analyze risk --input visits_masked.csv \
--qi age postal_code --sa diagnosis \
--t-threshold 0.15
Key metrics:
Metric |
What it tells you |
|---|---|
|
Maximum EMD across all classes (worst case) |
|
Classes exceeding the threshold |
A t_max above 0.2 means some classes have a significantly different
sensitive distribution — an adversary could infer the sensitive value with
higher confidence than from the global distribution alone.
Threshold recommendations:
Metric |
Strict |
Moderate |
Relaxed |
|---|---|---|---|
k-anonymity |
5–10 |
3–5 |
2 |
l-diversity |
5+ |
3–5 |
2 |
t-closeness |
0.05–0.10 |
0.10–0.20 |
0.20–0.50 |
from maskme.analytics.risk import run, report
records = [
{"age": "30-35", "postal_code": "75001", "diagnosis": "Flu"},
{"age": "30-35", "postal_code": "75001", "diagnosis": "Diabetes"},
{"age": "30-35", "postal_code": "75001", "diagnosis": "Flu"},
# ...
]
results = run(
records=records,
quasi_identifiers=["age", "postal_code"],
sensitive_attr="diagnosis",
)
for r in results:
status = "✓" if r.passed else "✗"
print(f"{status} {r.name}: passed={r.passed}")
# HTML report
report.generate(
results,
output_path="risk_report.html",
dataset_info={"records": len(records)},
)
Run individual metrics:
results = run(
records=records,
quasi_identifiers=["age", "postal_code"],
sensitive_attr="diagnosis",
analytics=["k_anonymity", "t_closeness"], # skip l-diversity
k_threshold=5,
t_threshold=0.1,
)
Anonymization always destroys some information. Utility metrics measure how much data quality is preserved.
Field Retention
Measures how many values remain unchanged per field.
maskme analyze utility --original visits.csv \
--anonymized visits_masked.csv \
Key output:
✓ Field Retention score=0.85 passed
This means 85% of values across all fields were preserved identically. If a score is low, check which fields are being heavily modified — they may be over-anonymized.
Statistical Fidelity
Compares statistical properties before and after anonymization: means, standard deviations, distributions. For categorical fields, uses Total Variation Distance; for numerical, uses normalised delta and Spearman rank correlation.
✓ Statistical Fidelity score=0.72 passed
A score of 0.72 means statistical patterns are reasonably preserved. Low-scoring fields are listed in the detailed output.
Information Loss
The Information Loss Index (ILI) measures how much each field deviates
from its original values. Utility score = 1 − ILI.
✓ Information Loss score=0.65 passed
Numerical fields use Normalised Mean Absolute Error — a value with 20% error contributes ILI of 0.2. Categorical fields count any change (hash, redact) as full loss (ILI = 1.0).
Threshold recommendations:
Metric |
Good |
Acceptable |
Needs review |
|---|---|---|---|
Field Retention |
≥ 0.80 |
0.50–0.80 |
< 0.50 |
Statistical Fidelity |
≥ 0.80 |
0.60–0.80 |
< 0.60 |
Information Loss (score) |
≥ 0.80 |
0.50–0.80 |
< 0.50 |
from maskme.analytics.utility import run, report
results = run(
original=original_records,
anonymised=anonymised_records,
)
for r in results:
print(f"{r.name}: score={r.score:.2f} passed={r.passed}")
# HTML report
report.generate(
results,
output_path="utility_report.html",
dataset_info={"records": len(original_records)},
)
Explicitly declare field types for better accuracy:
results = run(
original=original_records,
anonymised=anonymised_records,
numerical_fields=["age", "purchase_count"],
categorical_fields=["region", "diagnosis"],
field_retention_threshold=0.6,
statistical_fidelity_threshold=0.7,
information_loss_threshold=0.6,
)
Both maskme analyze risk and maskme analyze utility generate
self-contained HTML reports when --report is specified:
maskme analyze risk --input visits_masked.csv \
--qi age postal_code --sa diagnosis \
--k-threshold 3 --report risk_report.html
# Opens in any browser → risk_report.html
maskme analyze utility --original visits.csv \
--anonymized visits_masked.csv \
--report utility_report.html
# → utility_report.html
Each report includes:
Executive summary with pass/fail badges
Per-metric section with score, threshold, and SVG charts
Detailed per-field / per-class breakdowns
Recommendations for improvement
There is no single “right” threshold. The balance depends on your use case:
Scenario |
Privacy priority |
Utility priority |
Typical thresholds |
|---|---|---|---|
Medical research |
High |
High |
k=5, l=3, t=0.1, utility ≥ 0.7 |
Public dataset release |
High |
Medium |
k=10, l=5, t=0.05, utility ≥ 0.5 |
Internal analytics |
Medium |
High |
k=3, l=2, t=0.2, utility ≥ 0.8 |
Exploratory (masked sample) |
Low |
Maximum |
k=2, l=2, t=0.5, utility ≥ 0.9 |
Rule of thumb: Run risk + utility together. If privacy passes but utility fails, your anonymization is too aggressive. If utility passes but privacy fails, you need stronger strategies. Adjust and re-measure.
Part 4: Anonymizing Unstructured Text with NER
So far we’ve worked with structured data (tabular records). But what about unstructured text — patient notes, emails, support tickets, or any free-form text containing names, locations, or dates?
MaskMe’s NER module detects and replaces personally identifiable information in free text using Named Entity Recognition (spaCy).
The NER module requires extra dependencies:
pip install maskme[ner]
python -m spacy download en_core_web_lg
python -m spacy download fr_core_news_lg
MaskMe auto-detects French (fr) and English (en). Install only the language(s) you need.
maskme ner document.txt -o document_anon.txt
Example (report.txt):
Le Dr. Martin a diagnostiqué Alice Johnson le 12 janvier 2024
à l'Hôpital Saint-Louis à Paris. Le traitement commence à 14h30.
Run:
maskme ner report.txt -l fr -o report_anon.txt
Output (report_anon.txt):
Le [PERSON] a diagnostiqué [PERSON] le [DATE]
à l'[ORGANISATION] à [LOCATION]. Le traitement commence à [TIME].
Key CLI options:
Option |
Description |
|---|---|
|
Output file path (stdout if omitted) |
|
Language override ( |
|
Process each line as a separate text (batch mode) |
|
Enable debug logging |
Line-by-line mode is useful when each line is a separate record:
maskme ner patients.txt --lines -o patients_anon.txt
from maskme.ner import mask
# Single text → returns a PipelineResult
result = mask("Alice habite à Paris.")
print(result.output)
# → '[PERSON] habite à [LOCATION].'
print(result.entities)
# → [Entity(text='Alice', label=<EntityLabel.PERSON: 'PERSON'>, ...),
# Entity(text='Paris', label=<EntityLabel.LOCATION: 'LOCATION'>, ...)]
print(result.language)
# → 'fr'
# Batch processing → returns a list of PipelineResult
results = mask([
"Alice habite à Paris.",
"John lives in London.",
])
for r in results:
print(f"[{r.language}] {r.output}")
# → [fr] [PERSON] habite à [LOCATION].
# → [en] [PERSON] lives in [LOCATION].
# Language hint (skip auto-detection)
result = mask("Dr. Smith works at General Hospital.", language="en")
print(result.output)
# → '[PERSON] works at [ORGANISATION].'
Detected entity types are replaced with bracketed tags:
Entity Label |
Tag |
Example |
|---|---|---|
PERSON |
|
Names, doctors, patients |
LOCATION |
|
Cities, countries, addresses |
ORGANISATION |
|
Hospitals, companies |
DATE |
|
Full dates, “12 janvier 2024” |
TIME |
|
Times, “14h30” |
result = mask("Alice habite à Paris.")
result.output # Tagged text
result.input # Original text
result.entities # List of detected Entity objects
result.language # Detected language
result.entity_count # Number of entities masked
result.labels_found # Deduplicated, sorted entity labels
result.stats # Processing metadata (timing, detectors)
result.as_dict() # Serialize everything to a dict
Note
Without spaCy installed, the module degrades gracefully: mask() returns the original text unchanged and logs a warning with install instructions.
Part 5: Real-World Example
Let’s put it all together with a realistic scenario:
Scenario: Healthcare organization wants to share patient visit data for research while protecting privacy and meeting HIPAA compliance requirements.
Field classification (using privacy-compliance decision tree):
patient_id: Direct identifier →drop(STEP 1: Remove direct identifiers per HIPAA)age: Quasi-identifier →generalize(STEP 2: Protect quasi-identifiers as per Latanya Sweeney’s research)postal_code: Quasi-identifier →generalize(STEP 2: Postal code + age can re-identify individuals)diagnosis: Analytical payload →hash(STEP 3: Preserve for research while protecting identity)visit_date: Quasi-identifier →generalize(STEP 2: Protect date information)medication: Non-sensitive →keep(STEP 4: Analytical value without privacy risk)
Rules file (healthcare_rules.json):
{
"patient_id": "drop",
"age": {
"strategy": "generalize",
"step": 5,
"method": "range"
},
"postal_code": {
"strategy": "generalize",
"depth": 2
},
"diagnosis": {"strategy": "hash", "salt": "healthcare-2026"},
"visit_date": {
"strategy": "generalize",
"method": "date_year"
},
"medication": "keep"
}
Run anonymization:
maskme --rules healthcare_rules.json --input visits.csv --output visits_masked.csv
Next Steps
Now that you understand the basics:
Need to build a custom strategy? See How to Create a Custom Strategy
Want to know more about each strategy? See Anonymization Strategies
Building an application with MaskMe? See API Reference
Anonymizing unstructured text? See NER Module — Unstructured Text Anonymization
Curious about the architecture? See Architecture: The Anonymization Engine
Start with the least-destructive strategy: Keep > Generalize > Redact > Hash > Noise > Drop
Test with a small sample first: Verify behavior before running on production data
Always measure privacy: Don’t anonymize without understanding the privacy-utility trade-off
Use salt for hashing: Makes hashes unique to your organization
Document your choices: Track which strategies you chose and why for compliance