Getting Started with MaskMe

In this tutorial, you’ll learn how to anonymize datasets using MaskMe. Whether you prefer a command-line tool or a Python API for custom applications, you’ll see both approaches in action.

By the end, you’ll understand:

  • The two ways to use MaskMe (CLI vs Python API)

  • How each anonymization strategy works

  • How to choose the right strategy for your data

  • How to measure re-identification risk and data utility

Prerequisites

MaskMe requires Python 3.9+. Install it now:

pip install maskme

Verify the installation:

maskme --version

Part 1: Your First Anonymization

Let’s start with a simple example. Imagine you have customer data you want to anonymize:

Sample data (customers.csv):

id,name,email,phone,region,purchase_count
1,Alice Johnson,alice@example.com,555-0101,US-CA,42
2,Bob Smith,bob@example.com,555-0102,US-NY,15
3,Carol White,carol@example.com,555-0103,US-TX,87

Define a rules file that specifies which strategy to apply to each field:

Rules file (rules.json):

{
  "id": "hash",
  "name": "drop",
  "email": "drop",
  "phone": {"strategy": "redact", "char": "X", "keep_start": 0, "keep_end": 4},
  "region": "keep",
  "purchase_count": "keep"
}

Now run MaskMe from the command line:

maskme --rules rules.json --input customers.csv --output customers_masked.csv

Inspect the result:

cat customers_masked.csv

Output:

id,phone,region,purchase_count
6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b,XXXX0101,US-CA,42
d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35,XXXX0102,US-NY,15
4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce,XXXX0103,US-TX,87

What happened:

  • id: Hashed (unreadable, but consistent for the same input)

  • name: Removed entirely (drop)

  • email: Removed entirely (drop)

  • phone: Last 4 digits kept, rest redacted (keep_end: 4)

  • region and purchase_count: Left unchanged (keep)

For developers building applications on top of MaskMe, use the Python API:

import csv
from maskme import MaskMe

# Define rules (same as the JSON file)
rules = {
    "id": "hash",
    "name": "drop",
    "email": "drop",
    "phone": {"strategy": "redact", "char": "X", "keep_start": 0, "keep_end": 4},
    "region": "keep",
    "purchase_count": "keep"
}

# Load data
with open("customers.csv", "r") as f:
    reader = csv.DictReader(f)
    records = list(reader)

# Initialize engine
engine = MaskMe(rules)

# Process records
masked_records = list(engine.mask(records))

# Save results
with open("customers_masked.csv", "w") as f:
    writer = csv.DictWriter(f, fieldnames=masked_records[0].keys())
    writer.writeheader()
    writer.writerows(masked_records)

Key advantage: You can integrate MaskMe directly into your Python applications — pipelines, data processing scripts, ML workflows, etc.

Part 2: Understanding Strategies

MaskMe provides six strategies. Each solves different privacy vs. utility trade-offs. Let’s explore when to use each.

Strategy Overview

Strategy

What It Does

Best For

Keep

Preserves original value unchanged

Non-sensitive fields, public data

Drop

Removes field entirely

Direct identifiers (PII)

Hash

One-way cryptographic digest

Consistent linking without revealing values

Redact

Replaces characters with placeholders

Partial visibility (e.g., last 4 digits)

Noise

Adds statistical noise to numeric values

Numeric data with preserved distributions

Generalize

Coarsens data into broader categories

Ages, dates, locations


Keep

The Keep strategy keeps the original value unchanged.

Use when: The field is already public, non-sensitive, or represent key analytical dimensions.

Example: Geographic region, product category, or timestamp.

In rules file:

{
  "region": "keep",
  "category": "keep"
}

Input data:

{
  "region": "US-West",
  "category": "Electronics"
}

Output data:

{
  "region": "US-West",
  "category": "Electronics"
 }

Drop

The Drop strategy removes the field entirely.

Use when: The field is a direct identifier (PII) that could lead to re-identification, or if the field is unnecessary for the final dataset.

Example: Names, email addresses, phone numbers, social security numbers.

In rules file:

{
  "user_id": "drop",
  "internal_ref": "drop"
}

Input data:

{
  "user_id": "USR-12345",
  "internal_ref": "REF-999",
  "email": "john@example.com"
}

Output data:

{
  "email": "john@example.com"
 }

Hash

The Hash strategy converts the value into a fixed-length hexadecimal digest.

Use when: You need a consistent, one-way transformation (cannot be reversed).

Common use: Email addresses, usernames, customer IDs when consistency matters for record linking.

Parameters:

  • algo: Hashing algorithm — sha256 (default), sha512, blake2b, etc.

  • salt: Recommended. Salt string for extra security. Same input + same salt = same output.

In rules file:

{
  "email": "hash",
  "customer_id": {
    "strategy": "hash",
    "salt": "my-org-secret-2026"
  },
  "diagnosis_code": {
    "strategy": "hash",
    "algo": "sha512",
    "salt": "healthcare-key"
  }
}

Input data:

{
  "email": "alice@example.com",
  "customer_id": "CUST-5678",
  "diagnosis_code": "E11.9"
}

Output data:

{
  "email": "2a7f8e9c3b1d5f4a6e8c9b1d3f5a7e8c",
  "customer_id": "f4d8b7a9c1e3f5d9b2a4c6e8f1a3d5b7",
  "diagnosis_code": "9f8e7d6c5b4a3f2e1d0c9b8a7f6e5d4ca1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6"
}

Note

Same input + same salt = same output. This is useful for linking records across datasets.


Redact

The Redact strategy replaces characters with a placeholder, preserving length.

Use when: You need some visible information (like last 4 digits) while hiding the rest.

Common use: Phone numbers, credit cards, partially-visible IDs.

Parameters:

  • char: Placeholder character (default: *)

  • keep_start: Characters to show at the beginning (default: 0)

  • keep_end: Characters to show at the end (default: 0)

In rules file:

{
  "phone": {
    "strategy": "redact",
    "keep_end": 4
  },
  "credit_card": {
    "strategy": "redact",
    "char": "X",
    "keep_end": 4
  },
  "email": {
    "strategy": "redact",
    "keep_start": 1,
    "keep_end": 3,
    "char": "*"
  },
  "name": "redact"
}

Input data:

{
  "phone": "555-0101",
  "credit_card": "4532-1234-5678-90",
  "email": "alice@example.com",
  "name": "John Doe"
}

Output data:

{
  "phone": "****0101",
  "credit_card": "XXXXXXXXXXXX7890",
  "email": "a*******com",
  "name": "********"
 }

Noise

The Noise strategy adds statistical noise to numeric values.

Use when: You need to keep numbers but make individual values unrecognizable while preserving statistical distributions.

Common use: Ages, salaries, purchase amounts, measurements.

Two modes — which one for you?

Mode

Pick this when…

You must provide

Direct sigma

You want simple, predictable noise. No formal privacy guarantees needed.

sigma

Differential Privacy

You need a formal, provable privacy guarantee (HIPAA, research, regulation).

epsilon + sensitivity

Direct sigma — picking sigma

Adds Gaussian noise with standard deviation sigma. About 68% of outputs land within ±sigma of the true value, 95% within ±2×sigma.

Pick sigma relative to your data’s scale:

Rule of thumb

Field

Typical range

Suggested sigma

Effect

Age

0–100

2–5

Most values shift ±2–5 years

Salary

$30k–$200k

5k–10k

Most shift ±$5k–$10k

Rating (1–5)

1–5

0.5–1

Most shift ±0.5–1 stars

Purchase count

0–1,000

50–100

Most shift ±50–100

Start at the low end, check if the noise is sufficient for your use case, and increase if needed.

{
  "age": {
    "strategy": "noise",
    "sigma": 2,
    "seed": "reproducible-2026"
  },
  "salary": {
    "strategy": "noise",
    "sigma": 5000,
    "precision": 0,
    "min_val": 20000,
    "max_val": 500000,
    "seed": "reproducible-2026"
  }
}

Differential Privacy — picking epsilon

Noise is calibrated from epsilon (privacy budget). Smaller epsilon = stronger privacy = more noise.

Epsilon cheat sheet

Epsilon

Privacy

When to use

0.1 – 0.5

High

Medical records, financial data, strict compliance

0.5 – 2.0

Moderate

Most business & research datasets

2.0 – 10

Low

Non-sensitive analytics, aggregate stats

> 10

Minimal

Rarely useful — little privacy left

Picking sensitivity

sensitivity = the maximum possible change one person’s data can cause.

The safe formula: sensitivity = max_value − min_value for the field.

If you clamp salaries to $30k–$200k with min_val / max_val, set sensitivity to 170k. This way the DP noise is proportional to your actual data range.

Picking delta

Standard default: 1e-5 (1 in 100,000 chance of privacy breach). Or set it to 1 / number_of_records.

{
  "salary": {
    "strategy": "noise",
    "epsilon": 1.0,
    "sensitivity": 10000,
    "delta": 1e-5,
    "min_val": 20000,
    "max_val": 500000,
    "seed": "reproducible-2026"
  }
}

All parameters quick reference

Parameter

Mode

What it does

How to choose

sigma

Direct

Std dev of noise

~5–10% of data range

epsilon

DP

Privacy budget

0.1–0.5 (high), 0.5–2 (moderate), 2–10 (low)

sensitivity

DP

Max possible change

max − min of your field

delta

DP

Failure probability

1e-5 (default)

min_val

Both

Lower clip bound

Realistic minimum

max_val

Both

Upper clip bound

Realistic maximum

precision

Both

Decimal places

0 for integers

seed

Both

Reproducible noise

Any string; same seed = same noise

Example 1: Direct Sigma Mode

Rules:

{
  "age": {
    "strategy": "noise",
    "sigma": 2,
    "seed": "reproducible-2026"
  }
}

Input data:

{
  "age": 45,
  "visitor_id": "V-001"
}

Output data (example):

{
  "age": 42,
  "visitor_id": "V-001"
}

Note

Noise is random, so the same input produces different output each run. Larger sigma = more privacy but more data distortion.

Example 2: With Clipping Bounds

Rules:

{
  "salary": {
    "strategy": "noise",
    "sigma": 5000,
    "min_val": 20000,
    "max_val": 150000,
    "seed": "reproducible-2026"
  }
}

Input data:

{
  "salary": 95000,
  "dept": "Engineering"
}

Output data (example):

{
  "salary": 98234,
  "dept": "Engineering"
  }

Generalize

The Generalize strategy coarsens data to broader categories.

Use when: You want to keep the type of information but remove specificity.

Common use: Dates (year only), locations (state instead of city), ages (brackets).

For numeric data (ages, scores, amounts):

{
  "age": {
    "strategy": "generalize",
    "step": 10,
    "method": "range"
  },
  "score": {
    "strategy": "generalize",
    "bins": [0, 50, 70, 90, 100],
    "method": "range"
  }
}

For dates:

{
  "birth_date": {
    "strategy": "generalize",
    "method": "date_year"
  },
  "visit_month": {
    "strategy": "generalize",
    "method": "date_month"
  }
}

For locations (comma-separated):

{
  "full_address": {
    "strategy": "generalize",
    "depth": 1
  }
}

Parameters:

  • step: Fixed bracket size (e.g., 10 for ages 0-10, 10-20, etc.)

  • bins: Custom brackets [0, 18, 30, 50, 100] → “0-18”, “18-30”, etc.

  • method: "range" (shows bracket), "floor" (lower bound only), "date_year", "date_month"

  • depth: For locations, number of leading parts to remove

Example 1: Numeric with Step

Rules:

{
  "age": {
    "strategy": "generalize",
    "step": 10,
    "method": "range"
  }
}

Input data:

{
  "age": 27,
  "name": "Alice"
}

Output data:

{
  "age": "20-30",
  "name": "Alice"
}

Example 2: Custom Bins

Rules:

{
  "score": {
    "strategy": "generalize",
    "bins": [0, 50, 70, 90, 100],
    "method": "range"
  }
}

Input data:

{
  "score": 87,
  "student_id": "S-456"
}

Output data:

{
  "score": "70-90",
  "student_id": "S-456"
}

Example 3: Dates

Rules:

{
  "birth_date": {
    "strategy": "generalize",
    "method": "date_year"
  },
  "visit_date": {
    "strategy": "generalize",
    "method": "date_month"
  }
}

Input data:

{
  "birth_date": "1995-06-15",
  "visit_date": "2024-03-15"
}

Output data:

{
  "birth_date": "1995",
  "visit_date": "2024-03"
}

Example 4: Locations

Rules:

{
  "full_address": {
    "strategy": "generalize",
    "depth": 1
  }
}

Input data:

{
  "full_address": "New York,USA,Home"
}

Output data:

{
  "full_address": "USA,Home"
 }

Privacy regulations like HIPAA require a structured approach to data anonymization:

STEP 1: Remove Direct Identifiers (HIPAA Requirement)
├─ Is it a direct identifier? (name, SSN, medical record ID, etc.)
│  └─ YES → drop
│  └─ NO → go to STEP 2

STEP 2: Handle Quasi-Identifiers (Latanya Sweeney's Research)
│  Note: Deletion alone is not sufficient.
│  Quasi-identifiers can be re-linked to external data.
│  Example: Birthdate + zipcode can re-identify in 87% of US population
├─ Is it a quasi-identifier? (age, zipcode, birthdate, etc.)
│  ├─ Age/Income/Numeric → generalize (brackets) or noise
│  ├─ Date → generalize (year/month only)
│  ├─ Location → generalize (depth)
│  └─ NO → go to STEP 3

STEP 3: Preserve Analytical Payloads
├─ Is it analysis-critical? (medical codes, procedures, measurements)
│  ├─ Sensitive + Need Linkage → hash (consistent transformation)
│  ├─ Sensitive + Need Pattern → redact (partial visibility)
│  ├─ Sensitive + Numeric → noise (with bounds)
|  └─ Needed information → keep
│  └─ NO → go to STEP 4

STEP 4: Non-Sensitive Data
└─ Not sensitive, not quasi-identifier → keep

Part 3: Measuring Privacy and Utility

Anonymization is not a one-shot decision — you need to measure the outcome to verify privacy is protected and data remains useful.

MaskMe provides two families of analytics:

  • Risk analytics — measure re-identification risk

  • Utility analytics — measure data quality loss

Both are available through CLI and Python API, and can generate rich HTML reports.

# Re-identification risk
maskme analyze risk --input visits_masked.csv \
    --qi age postal_code --sa diagnosis \
    --report risk_report.html

# Data utility
maskme analyze utility --original visits.csv \
    --anonymized visits_masked.csv \
    --report utility_report.html

The --qi flag takes the quasi-identifier fields — attributes that could be combined to re-identify someone (age + postal code, birthdate + gender, etc.). The --sa flag specifies the sensitive attribute to protect.

Three metrics work together to measure re-identification risk.

k-Anonymity

Every record must be indistinguishable from at least k-1 other records based on quasi-identifiers. A lower k_min means some records are easy to single out.

maskme analyze risk --input visits_masked.csv \
    --qi age postal_code --sa diagnosis \
    --k-threshold 5

Key metrics in the output:

Metric

What it tells you

k_min

Smallest equivalence class size (worst case)

k_mean / k_median

Typical class size across the dataset

at_risk_records

Records in classes below the threshold

pct_at_risk

Percentage of records at risk

If k_min is 1, the dataset has unique records that can be directly re-identified. Solution: generalize or suppresse more aggressively.

l-Diversity

Even when records are k-anonymous, if everyone in a class has the same sensitive value (e.g. all have the same diagnosis), attribute disclosure is possible. l-diversity requires at least l distinct sensitive values per equivalence class.

maskme analyze risk --input visits_masked.csv \
    --qi age postal_code --sa diagnosis \
    --l-threshold 3

Key metrics:

Metric

What it tells you

l_min

Fewest distinct sensitive values in any class

at_risk_classes

Classes with fewer distinct values than the threshold

Distinct l-diversity (the variant MaskMe uses) is a good baseline. A l_min of 1 means at least one class has only one sensitive value — full attribute disclosure for those records.

t-Closeness

Measures how much the distribution of the sensitive attribute within each class deviates from the global distribution. Uses Earth Mover’s Distance (EMD).

maskme analyze risk --input visits_masked.csv \
    --qi age postal_code --sa diagnosis \
    --t-threshold 0.15

Key metrics:

Metric

What it tells you

t_max

Maximum EMD across all classes (worst case)

at_risk_classes

Classes exceeding the threshold

A t_max above 0.2 means some classes have a significantly different sensitive distribution — an adversary could infer the sensitive value with higher confidence than from the global distribution alone.

Threshold recommendations:

Metric

Strict

Moderate

Relaxed

k-anonymity

5–10

3–5

2

l-diversity

5+

3–5

2

t-closeness

0.05–0.10

0.10–0.20

0.20–0.50

from maskme.analytics.risk import run, report

records = [
    {"age": "30-35", "postal_code": "75001", "diagnosis": "Flu"},
    {"age": "30-35", "postal_code": "75001", "diagnosis": "Diabetes"},
    {"age": "30-35", "postal_code": "75001", "diagnosis": "Flu"},
    # ...
]

results = run(
    records=records,
    quasi_identifiers=["age", "postal_code"],
    sensitive_attr="diagnosis",
)

for r in results:
    status = "✓" if r.passed else "✗"
    print(f"{status}  {r.name}: passed={r.passed}")

# HTML report
report.generate(
    results,
    output_path="risk_report.html",
    dataset_info={"records": len(records)},
)

Run individual metrics:

results = run(
    records=records,
    quasi_identifiers=["age", "postal_code"],
    sensitive_attr="diagnosis",
    analytics=["k_anonymity", "t_closeness"],  # skip l-diversity
    k_threshold=5,
    t_threshold=0.1,
)

Anonymization always destroys some information. Utility metrics measure how much data quality is preserved.

Field Retention

Measures how many values remain unchanged per field.

maskme analyze utility --original visits.csv \
    --anonymized visits_masked.csv \

Key output:

✓  Field Retention           score=0.85  passed

This means 85% of values across all fields were preserved identically. If a score is low, check which fields are being heavily modified — they may be over-anonymized.

Statistical Fidelity

Compares statistical properties before and after anonymization: means, standard deviations, distributions. For categorical fields, uses Total Variation Distance; for numerical, uses normalised delta and Spearman rank correlation.

✓  Statistical Fidelity      score=0.72  passed

A score of 0.72 means statistical patterns are reasonably preserved. Low-scoring fields are listed in the detailed output.

Information Loss

The Information Loss Index (ILI) measures how much each field deviates from its original values. Utility score = 1 ILI.

✓  Information Loss          score=0.65  passed

Numerical fields use Normalised Mean Absolute Error — a value with 20% error contributes ILI of 0.2. Categorical fields count any change (hash, redact) as full loss (ILI = 1.0).

Threshold recommendations:

Metric

Good

Acceptable

Needs review

Field Retention

≥ 0.80

0.50–0.80

< 0.50

Statistical Fidelity

≥ 0.80

0.60–0.80

< 0.60

Information Loss (score)

≥ 0.80

0.50–0.80

< 0.50

from maskme.analytics.utility import run, report

results = run(
    original=original_records,
    anonymised=anonymised_records,
)

for r in results:
    print(f"{r.name}: score={r.score:.2f}  passed={r.passed}")

# HTML report
report.generate(
    results,
    output_path="utility_report.html",
    dataset_info={"records": len(original_records)},
)

Explicitly declare field types for better accuracy:

results = run(
    original=original_records,
    anonymised=anonymised_records,
    numerical_fields=["age", "purchase_count"],
    categorical_fields=["region", "diagnosis"],
    field_retention_threshold=0.6,
    statistical_fidelity_threshold=0.7,
    information_loss_threshold=0.6,
)

Both maskme analyze risk and maskme analyze utility generate self-contained HTML reports when --report is specified:

maskme analyze risk --input visits_masked.csv \
    --qi age postal_code --sa diagnosis \
    --k-threshold 3 --report risk_report.html
# Opens in any browser → risk_report.html

maskme analyze utility --original visits.csv \
    --anonymized visits_masked.csv \
    --report utility_report.html
# → utility_report.html

Each report includes:

  • Executive summary with pass/fail badges

  • Per-metric section with score, threshold, and SVG charts

  • Detailed per-field / per-class breakdowns

  • Recommendations for improvement

There is no single “right” threshold. The balance depends on your use case:

Scenario

Privacy priority

Utility priority

Typical thresholds

Medical research

High

High

k=5, l=3, t=0.1, utility ≥ 0.7

Public dataset release

High

Medium

k=10, l=5, t=0.05, utility ≥ 0.5

Internal analytics

Medium

High

k=3, l=2, t=0.2, utility ≥ 0.8

Exploratory (masked sample)

Low

Maximum

k=2, l=2, t=0.5, utility ≥ 0.9

Rule of thumb: Run risk + utility together. If privacy passes but utility fails, your anonymization is too aggressive. If utility passes but privacy fails, you need stronger strategies. Adjust and re-measure.

Part 4: Anonymizing Unstructured Text with NER

So far we’ve worked with structured data (tabular records). But what about unstructured text — patient notes, emails, support tickets, or any free-form text containing names, locations, or dates?

MaskMe’s NER module detects and replaces personally identifiable information in free text using Named Entity Recognition (spaCy).

The NER module requires extra dependencies:

pip install maskme[ner]
python -m spacy download en_core_web_lg
python -m spacy download fr_core_news_lg

MaskMe auto-detects French (fr) and English (en). Install only the language(s) you need.

maskme ner document.txt -o document_anon.txt

Example (report.txt):

Le Dr. Martin a diagnostiqué Alice Johnson le 12 janvier 2024
à l'Hôpital Saint-Louis à Paris. Le traitement commence à 14h30.

Run:

maskme ner report.txt -l fr -o report_anon.txt

Output (report_anon.txt):

Le [PERSON] a diagnostiqué [PERSON] le [DATE]
à l'[ORGANISATION] à [LOCATION]. Le traitement commence à [TIME].

Key CLI options:

Option

Description

-o, --output

Output file path (stdout if omitted)

-l, --language

Language override (fr or en); auto-detected if omitted

--lines

Process each line as a separate text (batch mode)

--verbose

Enable debug logging

Line-by-line mode is useful when each line is a separate record:

maskme ner patients.txt --lines -o patients_anon.txt
from maskme.ner import mask

# Single text → returns a PipelineResult
result = mask("Alice habite à Paris.")
print(result.output)
# → '[PERSON] habite à [LOCATION].'

print(result.entities)
# → [Entity(text='Alice', label=<EntityLabel.PERSON: 'PERSON'>, ...),
#     Entity(text='Paris', label=<EntityLabel.LOCATION: 'LOCATION'>, ...)]

print(result.language)
# → 'fr'

# Batch processing → returns a list of PipelineResult
results = mask([
    "Alice habite à Paris.",
    "John lives in London.",
])
for r in results:
    print(f"[{r.language}] {r.output}")
# → [fr] [PERSON] habite à [LOCATION].
# → [en] [PERSON] lives in [LOCATION].

# Language hint (skip auto-detection)
result = mask("Dr. Smith works at General Hospital.", language="en")
print(result.output)
# → '[PERSON] works at [ORGANISATION].'

Detected entity types are replaced with bracketed tags:

Entity Label

Tag

Example

PERSON

[PERSON]

Names, doctors, patients

LOCATION

[LOCATION]

Cities, countries, addresses

ORGANISATION

[ORGANISATION]

Hospitals, companies

DATE

[DATE]

Full dates, “12 janvier 2024”

TIME

[TIME]

Times, “14h30”

result = mask("Alice habite à Paris.")

result.output        # Tagged text
result.input         # Original text
result.entities      # List of detected Entity objects
result.language      # Detected language
result.entity_count  # Number of entities masked
result.labels_found # Deduplicated, sorted entity labels
result.stats         # Processing metadata (timing, detectors)
result.as_dict()     # Serialize everything to a dict

Note

Without spaCy installed, the module degrades gracefully: mask() returns the original text unchanged and logs a warning with install instructions.

Part 5: Real-World Example

Let’s put it all together with a realistic scenario:

Scenario: Healthcare organization wants to share patient visit data for research while protecting privacy and meeting HIPAA compliance requirements.

Field classification (using privacy-compliance decision tree):

  • patient_id: Direct identifier → drop (STEP 1: Remove direct identifiers per HIPAA)

  • age: Quasi-identifier → generalize (STEP 2: Protect quasi-identifiers as per Latanya Sweeney’s research)

  • postal_code: Quasi-identifier → generalize (STEP 2: Postal code + age can re-identify individuals)

  • diagnosis: Analytical payload → hash (STEP 3: Preserve for research while protecting identity)

  • visit_date: Quasi-identifier → generalize (STEP 2: Protect date information)

  • medication: Non-sensitive → keep (STEP 4: Analytical value without privacy risk)

Rules file (healthcare_rules.json):

{
  "patient_id": "drop",
  "age": {
    "strategy": "generalize",
    "step": 5,
    "method": "range"
  },
  "postal_code": {
    "strategy": "generalize",
    "depth": 2
  },
  "diagnosis": {"strategy": "hash", "salt": "healthcare-2026"},
  "visit_date": {
    "strategy": "generalize",
    "method": "date_year"
  },
  "medication": "keep"
}

Run anonymization:

maskme --rules healthcare_rules.json --input visits.csv --output visits_masked.csv

Next Steps

Now that you understand the basics:

  1. Start with the least-destructive strategy: Keep > Generalize > Redact > Hash > Noise > Drop

  2. Test with a small sample first: Verify behavior before running on production data

  3. Always measure privacy: Don’t anonymize without understanding the privacy-utility trade-off

  4. Use salt for hashing: Makes hashes unique to your organization

  5. Document your choices: Track which strategies you chose and why for compliance