Getting Started with MaskMe

In this tutorial, you’ll learn how to anonymize datasets using MaskMe. Whether you prefer a command-line tool or a Python API for custom applications, you’ll see both approaches in action.

By the end, you’ll understand:

The two ways to use MaskMe (CLI vs Python API)
How each anonymization strategy works
How to choose the right strategy for your data
How to measure re-identification risk and data utility

Prerequisites

MaskMe requires Python 3.9+. Install it now:

pip install maskme

Verify the installation:

maskme --version

Part 1: Your First Anonymization

Let’s start with a simple example. Imagine you have customer data you want to anonymize:

Sample data (customers.csv):

id,name,email,phone,region,purchase_count
1,Alice Johnson,alice@example.com,555-0101,US-CA,42
2,Bob Smith,bob@example.com,555-0102,US-NY,15
3,Carol White,carol@example.com,555-0103,US-TX,87

Define a rules file that specifies which strategy to apply to each field:

Rules file (rules.json):

{
  "id": "hash",
  "name": "drop",
  "email": "drop",
  "phone": {"strategy": "redact", "char": "X", "keep_start": 0, "keep_end": 4},
  "region": "keep",
  "purchase_count": "keep"
}

Now run MaskMe from the command line:

maskme --rules rules.json --input customers.csv --output customers_masked.csv

Inspect the result:

cat customers_masked.csv

Output:

id,phone,region,purchase_count
6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b,XXXX0101,US-CA,42
d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35,XXXX0102,US-NY,15
4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce,XXXX0103,US-TX,87

What happened:

id: Hashed (unreadable, but consistent for the same input)
name: Removed entirely (drop)
email: Removed entirely (drop)
phone: Last 4 digits kept, rest redacted (keep_end: 4)
region and purchase_count: Left unchanged (keep)

For developers building applications on top of MaskMe, use the Python API:

import csv
from maskme import MaskMe

# Define rules (same as the JSON file)
rules = {
    "id": "hash",
    "name": "drop",
    "email": "drop",
    "phone": {"strategy": "redact", "char": "X", "keep_start": 0, "keep_end": 4},
    "region": "keep",
    "purchase_count": "keep"
}

# Load data
with open("customers.csv", "r") as f:
    reader = csv.DictReader(f)
    records = list(reader)

# Initialize engine
engine = MaskMe(rules)

# Process records
masked_records = list(engine.mask(records))

# Save results
with open("customers_masked.csv", "w") as f:
    writer = csv.DictWriter(f, fieldnames=masked_records[0].keys())
    writer.writeheader()
    writer.writerows(masked_records)

Key advantage: You can integrate MaskMe directly into your Python applications — pipelines, data processing scripts, ML workflows, etc.

Part 2: Understanding Strategies

MaskMe provides six strategies. Each solves different privacy vs. utility trade-offs. Let’s explore when to use each.

Strategy Overview
Strategy	What It Does	Best For
Keep	Preserves original value unchanged	Non-sensitive fields, public data
Drop	Removes field entirely	Direct identifiers (PII)
Hash	One-way cryptographic digest	Consistent linking without revealing values
Redact	Replaces characters with placeholders	Partial visibility (e.g., last 4 digits)
Noise	Adds statistical noise to numeric values	Numeric data with preserved distributions
Generalize	Coarsens data into broader categories	Ages, dates, locations

Keep

The Keep strategy keeps the original value unchanged.

Use when: The field is already public, non-sensitive, or represent key analytical dimensions.

Example: Geographic region, product category, or timestamp.

In rules file:

{
  "region": "keep",
  "category": "keep"
}

Input data:

{
  "region": "US-West",
  "category": "Electronics"
}

Output data:

{
  "region": "US-West",
  "category": "Electronics"
 }

Drop

The Drop strategy removes the field entirely.

Use when: The field is a direct identifier (PII) that could lead to re-identification, or if the field is unnecessary for the final dataset.

Example: Names, email addresses, phone numbers, social security numbers.

In rules file:

{
  "user_id": "drop",
  "internal_ref": "drop"
}

Input data:

{
  "user_id": "USR-12345",
  "internal_ref": "REF-999",
  "email": "john@example.com"
}

Output data:

{
  "email": "john@example.com"
 }

Hash

The Hash strategy converts the value into a fixed-length hexadecimal digest.

Use when: You need a consistent, one-way transformation (cannot be reversed).

Common use: Email addresses, usernames, customer IDs when consistency matters for record linking.

Parameters:

algo: Hashing algorithm — sha256 (default), sha512, blake2b, etc.
salt: Recommended. Salt string for extra security. Same input + same salt = same output.

In rules file:

{
  "email": "hash",
  "customer_id": {
    "strategy": "hash",
    "salt": "my-org-secret-2026"
  },
  "diagnosis_code": {
    "strategy": "hash",
    "algo": "sha512",
    "salt": "healthcare-key"
  }
}

Input data:

{
  "email": "alice@example.com",
  "customer_id": "CUST-5678",
  "diagnosis_code": "E11.9"
}

Output data:

{
  "email": "2a7f8e9c3b1d5f4a6e8c9b1d3f5a7e8c",
  "customer_id": "f4d8b7a9c1e3f5d9b2a4c6e8f1a3d5b7",
  "diagnosis_code": "9f8e7d6c5b4a3f2e1d0c9b8a7f6e5d4ca1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6"
}

Note

Same input + same salt = same output. This is useful for linking records across datasets.

Redact

The Redact strategy replaces characters with a placeholder, preserving length.

Use when: You need some visible information (like last 4 digits) while hiding the rest.

Common use: Phone numbers, credit cards, partially-visible IDs.

Parameters:

char: Placeholder character (default: *)
keep_start: Characters to show at the beginning (default: 0)
keep_end: Characters to show at the end (default: 0)

In rules file:

{
  "phone": {
    "strategy": "redact",
    "keep_end": 4
  },
  "credit_card": {
    "strategy": "redact",
    "char": "X",
    "keep_end": 4
  },
  "email": {
    "strategy": "redact",
    "keep_start": 1,
    "keep_end": 3,
    "char": "*"
  },
  "name": "redact"
}

Input data:

{
  "phone": "555-0101",
  "credit_card": "4532-1234-5678-90",
  "email": "alice@example.com",
  "name": "John Doe"
}

Output data:

{
  "phone": "****0101",
  "credit_card": "XXXXXXXXXXXX7890",
  "email": "a*******com",
  "name": "********"
 }

Noise

The Noise strategy adds statistical noise to numeric values.

Use when: You need to keep numbers but make individual values unrecognizable while preserving statistical distributions.

Common use: Ages, salaries, purchase amounts, measurements.

Two modes — which one for you?

Mode	Pick this when…	You must provide
Direct sigma	You want simple, predictable noise. No formal privacy guarantees needed.	`sigma`
Differential Privacy	You need a formal, provable privacy guarantee (HIPAA, research, regulation).	`epsilon` + `sensitivity`

Direct sigma — picking sigma

Adds Gaussian noise with standard deviation sigma. About 68% of outputs land within ±sigma of the true value, 95% within ±2×sigma.

Pick sigma relative to your data’s scale:

Rule of thumb
Field	Typical range	Suggested sigma	Effect
Age	0–100	2–5	Most values shift ±2–5 years
Salary	$30k–$200k	5k–10k	Most shift ±$5k–$10k
Rating (1–5)	1–5	0.5–1	Most shift ±0.5–1 stars
Purchase count	0–1,000	50–100	Most shift ±50–100

Start at the low end, check if the noise is sufficient for your use case, and increase if needed.

{
  "age": {
    "strategy": "noise",
    "sigma": 2,
    "seed": "reproducible-2026"
  },
  "salary": {
    "strategy": "noise",
    "sigma": 5000,
    "precision": 0,
    "min_val": 20000,
    "max_val": 500000,
    "seed": "reproducible-2026"
  }
}

Differential Privacy — picking epsilon

Noise is calibrated from epsilon (privacy budget). Smaller epsilon = stronger privacy = more noise.

Epsilon cheat sheet
Epsilon	Privacy	When to use
0.1 – 0.5	High	Medical records, financial data, strict compliance
0.5 – 2.0	Moderate	Most business & research datasets
2.0 – 10	Low	Non-sensitive analytics, aggregate stats
> 10	Minimal	Rarely useful — little privacy left

Picking sensitivity

sensitivity = the maximum possible change one person’s data can cause.

The safe formula: sensitivity = max_value − min_value for the field.

If you clamp salaries to $30k–$200k with min_val / max_val, set sensitivity to 170k. This way the DP noise is proportional to your actual data range.

Picking delta

Standard default: 1e-5 (1 in 100,000 chance of privacy breach). Or set it to 1 / number_of_records.

{
  "salary": {
    "strategy": "noise",
    "epsilon": 1.0,
    "sensitivity": 10000,
    "delta": 1e-5,
    "min_val": 20000,
    "max_val": 500000,
    "seed": "reproducible-2026"
  }
}

All parameters quick reference

Parameter	Mode	What it does	How to choose
`sigma`	Direct	Std dev of noise	~5–10% of data range
`epsilon`	DP	Privacy budget	0.1–0.5 (high), 0.5–2 (moderate), 2–10 (low)
`sensitivity`	DP	Max possible change	max − min of your field
`delta`	DP	Failure probability	1e-5 (default)
`min_val`	Both	Lower clip bound	Realistic minimum
`max_val`	Both	Upper clip bound	Realistic maximum
`precision`	Both	Decimal places	0 for integers
`seed`	Both	Reproducible noise	Any string; same seed = same noise

Example 1: Direct Sigma Mode

Rules:

{
  "age": {
    "strategy": "noise",
    "sigma": 2,
    "seed": "reproducible-2026"
  }
}

Input data:

{
  "age": 45,
  "visitor_id": "V-001"
}

Output data (example):

{
  "age": 42,
  "visitor_id": "V-001"
}

Note

Noise is random, so the same input produces different output each run. Larger sigma = more privacy but more data distortion.

Example 2: With Clipping Bounds

Rules:

{
  "salary": {
    "strategy": "noise",
    "sigma": 5000,
    "min_val": 20000,
    "max_val": 150000,
    "seed": "reproducible-2026"
  }
}

Input data:

{
  "salary": 95000,
  "dept": "Engineering"
}

Output data (example):

{
  "salary": 98234,
  "dept": "Engineering"
  }

Generalize

The Generalize strategy coarsens data to broader categories.

Use when: You want to keep the type of information but remove specificity.

Common use: Dates (year only), locations (state instead of city), ages (brackets).

For numeric data (ages, scores, amounts):

{
  "age": {
    "strategy": "generalize",
    "step": 10,
    "method": "range"
  },
  "score": {
    "strategy": "generalize",
    "bins": [0, 50, 70, 90, 100],
    "method": "range"
  }
}

For dates:

{
  "birth_date": {
    "strategy": "generalize",
    "method": "date_year"
  },
  "visit_month": {
    "strategy": "generalize",
    "method": "date_month"
  }
}

For locations (comma-separated):

{
  "full_address": {
    "strategy": "generalize",
    "depth": 1
  }
}

Parameters:

step: Fixed bracket size (e.g., 10 for ages 0-10, 10-20, etc.)
bins: Custom brackets [0, 18, 30, 50, 100] → “0-18”, “18-30”, etc.
method: "range" (shows bracket), "floor" (lower bound only), "date_year", "date_month"
depth: For locations, number of leading parts to remove

Example 1: Numeric with Step

Rules:

{
  "age": {
    "strategy": "generalize",
    "step": 10,
    "method": "range"
  }
}

Input data:

{
  "age": 27,
  "name": "Alice"
}

Output data:

{
  "age": "20-30",
  "name": "Alice"
}

Example 2: Custom Bins

Rules:

{
  "score": {
    "strategy": "generalize",
    "bins": [0, 50, 70, 90, 100],
    "method": "range"
  }
}

Input data:

{
  "score": 87,
  "student_id": "S-456"
}

Output data:

{
  "score": "70-90",
  "student_id": "S-456"
}

Example 3: Dates

Rules:

{
  "birth_date": {
    "strategy": "generalize",
    "method": "date_year"
  },
  "visit_date": {
    "strategy": "generalize",
    "method": "date_month"
  }
}

Input data:

{
  "birth_date": "1995-06-15",
  "visit_date": "2024-03-15"
}

Output data:

{
  "birth_date": "1995",
  "visit_date": "2024-03"
}

Example 4: Locations

Rules:

{
  "full_address": {
    "strategy": "generalize",
    "depth": 1
  }
}

Input data:

{
  "full_address": "New York,USA,Home"
}

Output data:

{
  "full_address": "USA,Home"
 }

Privacy regulations like HIPAA require a structured approach to data anonymization:

STEP 1: Remove Direct Identifiers (HIPAA Requirement)
├─ Is it a direct identifier? (name, SSN, medical record ID, etc.)
│  └─ YES → drop
│  └─ NO → go to STEP 2

STEP 2: Handle Quasi-Identifiers (Latanya Sweeney's Research)
│
│  Note: Deletion alone is not sufficient.
│  Quasi-identifiers can be re-linked to external data.
│  Example: Birthdate + zipcode can re-identify in 87% of US population
│
├─ Is it a quasi-identifier? (age, zipcode, birthdate, etc.)
│  ├─ Age/Income/Numeric → generalize (brackets) or noise
│  ├─ Date → generalize (year/month only)
│  ├─ Location → generalize (depth)
│  └─ NO → go to STEP 3

STEP 3: Preserve Analytical Payloads
├─ Is it analysis-critical? (medical codes, procedures, measurements)
│  ├─ Sensitive + Need Linkage → hash (consistent transformation)
│  ├─ Sensitive + Need Pattern → redact (partial visibility)
│  ├─ Sensitive + Numeric → noise (with bounds)
|  └─ Needed information → keep
│  └─ NO → go to STEP 4

STEP 4: Non-Sensitive Data
└─ Not sensitive, not quasi-identifier → keep

Part 3: Measuring Privacy and Utility

Anonymization is not a one-shot decision — you need to measure the outcome to verify privacy is protected and data remains useful.

MaskMe provides two families of analytics:

Risk analytics — measure re-identification risk
Utility analytics — measure data quality loss

Both are available through CLI and Python API, and can generate rich HTML reports.

# Re-identification risk
maskme analyze risk --input visits_masked.csv \
    --qi age postal_code --sa diagnosis \
    --report risk_report.html

# Data utility
maskme analyze utility --original visits.csv \
    --anonymized visits_masked.csv \
    --report utility_report.html

The --qi flag takes the quasi-identifier fields — attributes that could be combined to re-identify someone (age + postal code, birthdate + gender, etc.). The --sa flag specifies the sensitive attribute to protect.

Three metrics work together to measure re-identification risk.

k-Anonymity

Every record must be indistinguishable from at least k-1 other records based on quasi-identifiers. A lower k_min means some records are easy to single out.

maskme analyze risk --input visits_masked.csv \
    --qi age postal_code --sa diagnosis \
    --k-threshold 5

Key metrics in the output:

Metric	What it tells you
`k_min`	Smallest equivalence class size (worst case)
`k_mean` / `k_median`	Typical class size across the dataset
`at_risk_records`	Records in classes below the threshold
`pct_at_risk`	Percentage of records at risk

If k_min is 1, the dataset has unique records that can be directly re-identified. Solution: generalize or suppresse more aggressively.

l-Diversity

Even when records are k-anonymous, if everyone in a class has the same sensitive value (e.g. all have the same diagnosis), attribute disclosure is possible. l-diversity requires at least l distinct sensitive values per equivalence class.

maskme analyze risk --input visits_masked.csv \
    --qi age postal_code --sa diagnosis \
    --l-threshold 3

Key metrics:

Metric	What it tells you
`l_min`	Fewest distinct sensitive values in any class
`at_risk_classes`	Classes with fewer distinct values than the threshold

Distinct l-diversity (the variant MaskMe uses) is a good baseline. A l_min of 1 means at least one class has only one sensitive value — full attribute disclosure for those records.

t-Closeness

Measures how much the distribution of the sensitive attribute within each class deviates from the global distribution. Uses Earth Mover’s Distance (EMD).

maskme analyze risk --input visits_masked.csv \
    --qi age postal_code --sa diagnosis \
    --t-threshold 0.15

Key metrics:

Metric	What it tells you
`t_max`	Maximum EMD across all classes (worst case)
`at_risk_classes`	Classes exceeding the threshold

A t_max above 0.2 means some classes have a significantly different sensitive distribution — an adversary could infer the sensitive value with higher confidence than from the global distribution alone.

Threshold recommendations:

Metric	Strict	Moderate	Relaxed
k-anonymity	5–10	3–5	2
l-diversity	5+	3–5	2
t-closeness	0.05–0.10	0.10–0.20	0.20–0.50

from maskme.analytics.risk import run, report

records = [
    {"age": "30-35", "postal_code": "75001", "diagnosis": "Flu"},
    {"age": "30-35", "postal_code": "75001", "diagnosis": "Diabetes"},
    {"age": "30-35", "postal_code": "75001", "diagnosis": "Flu"},
    # ...
]

results = run(
    records=records,
    quasi_identifiers=["age", "postal_code"],
    sensitive_attr="diagnosis",
)

for r in results:
    status = "✓" if r.passed else "✗"
    print(f"{status}  {r.name}: passed={r.passed}")

# HTML report
report.generate(
    results,
    output_path="risk_report.html",
    dataset_info={"records": len(records)},
)

Run individual metrics:

results = run(
    records=records,
    quasi_identifiers=["age", "postal_code"],
    sensitive_attr="diagnosis",
    analytics=["k_anonymity", "t_closeness"],  # skip l-diversity
    k_threshold=5,
    t_threshold=0.1,
)

Anonymization always destroys some information. Utility metrics measure how much data quality is preserved.

Field Retention

Measures how many values remain unchanged per field.

maskme analyze utility --original visits.csv \
    --anonymized visits_masked.csv \

Key output:

✓  Field Retention           score=0.85  passed

This means 85% of values across all fields were preserved identically. If a score is low, check which fields are being heavily modified — they may be over-anonymized.

Statistical Fidelity

Compares statistical properties before and after anonymization: means, standard deviations, distributions. For categorical fields, uses Total Variation Distance; for numerical, uses normalised delta and Spearman rank correlation.

✓  Statistical Fidelity      score=0.72  passed

A score of 0.72 means statistical patterns are reasonably preserved. Low-scoring fields are listed in the detailed output.

Information Loss

The Information Loss Index (ILI) measures how much each field deviates from its original values. Utility score = 1 − ILI.

✓  Information Loss          score=0.65  passed

Numerical fields use Normalised Mean Absolute Error — a value with 20% error contributes ILI of 0.2. Categorical fields count any change (hash, redact) as full loss (ILI = 1.0).

Threshold recommendations:

Metric	Good	Acceptable	Needs review
Field Retention	≥ 0.80	0.50–0.80	< 0.50
Statistical Fidelity	≥ 0.80	0.60–0.80	< 0.60
Information Loss (score)	≥ 0.80	0.50–0.80	< 0.50

from maskme.analytics.utility import run, report

results = run(
    original=original_records,
    anonymised=anonymised_records,
)

for r in results:
    print(f"{r.name}: score={r.score:.2f}  passed={r.passed}")

# HTML report
report.generate(
    results,
    output_path="utility_report.html",
    dataset_info={"records": len(original_records)},
)

Explicitly declare field types for better accuracy:

results = run(
    original=original_records,
    anonymised=anonymised_records,
    numerical_fields=["age", "purchase_count"],
    categorical_fields=["region", "diagnosis"],
    field_retention_threshold=0.6,
    statistical_fidelity_threshold=0.7,
    information_loss_threshold=0.6,
)

Both maskme analyze risk and maskme analyze utility generate self-contained HTML reports when --report is specified:

maskme analyze risk --input visits_masked.csv \
    --qi age postal_code --sa diagnosis \
    --k-threshold 3 --report risk_report.html
# Opens in any browser → risk_report.html

maskme analyze utility --original visits.csv \
    --anonymized visits_masked.csv \
    --report utility_report.html
# → utility_report.html

Each report includes:

Executive summary with pass/fail badges
Per-metric section with score, threshold, and SVG charts
Detailed per-field / per-class breakdowns
Recommendations for improvement

There is no single “right” threshold. The balance depends on your use case:

Scenario	Privacy priority	Utility priority	Typical thresholds
Medical research	High	High	k=5, l=3, t=0.1, utility ≥ 0.7
Public dataset release	High	Medium	k=10, l=5, t=0.05, utility ≥ 0.5
Internal analytics	Medium	High	k=3, l=2, t=0.2, utility ≥ 0.8
Exploratory (masked sample)	Low	Maximum	k=2, l=2, t=0.5, utility ≥ 0.9

Rule of thumb: Run risk + utility together. If privacy passes but utility fails, your anonymization is too aggressive. If utility passes but privacy fails, you need stronger strategies. Adjust and re-measure.

Part 4: Anonymizing Unstructured Text with NER

So far we’ve worked with structured data (tabular records). But what about unstructured text — patient notes, emails, support tickets, or any free-form text containing names, locations, or dates?

MaskMe’s NER module detects and replaces personally identifiable information in free text using Named Entity Recognition (spaCy).

The NER module requires extra dependencies:

pip install maskme[ner]
python -m spacy download en_core_web_lg
python -m spacy download fr_core_news_lg

MaskMe auto-detects French (fr) and English (en). Install only the language(s) you need.

maskme ner document.txt -o document_anon.txt

Example (report.txt):

Le Dr. Martin a diagnostiqué Alice Johnson le 12 janvier 2024
à l'Hôpital Saint-Louis à Paris. Le traitement commence à 14h30.

Run:

maskme ner report.txt -l fr -o report_anon.txt

Output (report_anon.txt):

Le [PERSON] a diagnostiqué [PERSON] le [DATE]
à l'[ORGANISATION] à [LOCATION]. Le traitement commence à [TIME].

Key CLI options:

Option	Description
`-o, --output`	Output file path (stdout if omitted)
`-l, --language`	Language override (`fr` or `en`); auto-detected if omitted
`--lines`	Process each line as a separate text (batch mode)
`--verbose`	Enable debug logging

Line-by-line mode is useful when each line is a separate record:

maskme ner patients.txt --lines -o patients_anon.txt

from maskme.ner import mask

# Single text → returns a PipelineResult
result = mask("Alice habite à Paris.")
print(result.output)
# → '[PERSON] habite à [LOCATION].'

print(result.entities)
# → [Entity(text='Alice', label=<EntityLabel.PERSON: 'PERSON'>, ...),
#     Entity(text='Paris', label=<EntityLabel.LOCATION: 'LOCATION'>, ...)]

print(result.language)
# → 'fr'

# Batch processing → returns a list of PipelineResult
results = mask([
    "Alice habite à Paris.",
    "John lives in London.",
])
for r in results:
    print(f"[{r.language}] {r.output}")
# → [fr] [PERSON] habite à [LOCATION].
# → [en] [PERSON] lives in [LOCATION].

# Language hint (skip auto-detection)
result = mask("Dr. Smith works at General Hospital.", language="en")
print(result.output)
# → '[PERSON] works at [ORGANISATION].'

Detected entity types are replaced with bracketed tags:

Entity Label	Tag	Example
PERSON	`[PERSON]`	Names, doctors, patients
LOCATION	`[LOCATION]`	Cities, countries, addresses
ORGANISATION	`[ORGANISATION]`	Hospitals, companies
DATE	`[DATE]`	Full dates, “12 janvier 2024”
TIME	`[TIME]`	Times, “14h30”

result = mask("Alice habite à Paris.")

result.output        # Tagged text
result.input         # Original text
result.entities      # List of detected Entity objects
result.language      # Detected language
result.entity_count  # Number of entities masked
result.labels_found # Deduplicated, sorted entity labels
result.stats         # Processing metadata (timing, detectors)
result.as_dict()     # Serialize everything to a dict

Note

Without spaCy installed, the module degrades gracefully: mask() returns the original text unchanged and logs a warning with install instructions.

Part 5: Real-World Example

Let’s put it all together with a realistic scenario:

Scenario: Healthcare organization wants to share patient visit data for research while protecting privacy and meeting HIPAA compliance requirements.

Field classification (using privacy-compliance decision tree):

patient_id: Direct identifier → drop (STEP 1: Remove direct identifiers per HIPAA)
age: Quasi-identifier → generalize (STEP 2: Protect quasi-identifiers as per Latanya Sweeney’s research)
postal_code: Quasi-identifier → generalize (STEP 2: Postal code + age can re-identify individuals)
diagnosis: Analytical payload → hash (STEP 3: Preserve for research while protecting identity)
visit_date: Quasi-identifier → generalize (STEP 2: Protect date information)
medication: Non-sensitive → keep (STEP 4: Analytical value without privacy risk)

Rules file (healthcare_rules.json):

{
  "patient_id": "drop",
  "age": {
    "strategy": "generalize",
    "step": 5,
    "method": "range"
  },
  "postal_code": {
    "strategy": "generalize",
    "depth": 2
  },
  "diagnosis": {"strategy": "hash", "salt": "healthcare-2026"},
  "visit_date": {
    "strategy": "generalize",
    "method": "date_year"
  },
  "medication": "keep"
}

Run anonymization:

maskme --rules healthcare_rules.json --input visits.csv --output visits_masked.csv

Next Steps

Now that you understand the basics:

Need to build a custom strategy? See How to Create a Custom Strategy
Want to know more about each strategy? See Anonymization Strategies
Building an application with MaskMe? See API Reference
Anonymizing unstructured text? See NER Module — Unstructured Text Anonymization
Curious about the architecture? See Architecture: The Anonymization Engine

Start with the least-destructive strategy: Keep > Generalize > Redact > Hash > Noise > Drop
Test with a small sample first: Verify behavior before running on production data
Always measure privacy: Don’t anonymize without understanding the privacy-utility trade-off
Use salt for hashing: Makes hashes unique to your organization
Document your choices: Track which strategies you chose and why for compliance