Architecture: The Anonymization Engine

Understanding MaskMe’s design helps you use it effectively and extend it for custom needs.

Design Principles

Modularity: Each component (engine, strategies, I/O handlers, analytics) is independent.

Composability: Mix strategies, formats, and analytics as needed.

Simplicity: One mental model works for CLI, Python API, and custom extensions.

Streaming: Process data record-by-record to handle datasets larger than RAM.

Observability: Measure privacy formally; don’t guess.

Extensibility: Add custom strategies or I/O handlers without modifying core code.

High-Level Overview

Input Data (CSV/JSON/JSONL)
       ↓
I/O Handler (parse records)
       ↓
Engine (apply rules + strategies)
       ↓
Output Data (CSV/JSON/JSONL)
       ↓
Analytics (measure privacy)

The Engine: Core Anonymization Logic

The MaskMe engine is the heart of MaskMe. It:

Receives a rules dict (e.g., {"email": "hash", "id": "drop"})
For each record, resolves strategies and applies transformations
Streams output record-by-record (memory efficient)

Workflow for a single record:

Input Record: {"id": 1, "email": "alice@example.com", "name": "Alice"}

For each rule in rules dict:
  ├─ Rule: "id" → "drop"
  │  └─ Apply drop strategy → "__DROP__"
  │     (Signal: remove this field)
  │
  ├─ Rule: "email" → "hash"
  │  └─ Apply hash strategy → "d6d5d09f12b3f0f1a..."
  │     (Update record)
  │
  └─ Rule: "name" → "keep"
     └─ Apply keep strategy → "Alice"
        (No change)

Output Record: {"email": "d6d5d09f12b3f0f1a...", "name": "Alice"}

Key insight: Rules operate on paths (e.g., user.email for nested dicts).

Path Resolution

MaskMe uses dot notation for nested fields:

record = {
    "user": {
        "profile": {
            "email": "alice@example.com"
        }
    }
}

# Rule path: "user.profile.email"
# MaskMe navigates: record["user"]["profile"]["email"]

This is handled by the navigation module (get_nested, set_nested, delete_nested).

Strategy Resolution

Rules can be:

Simple (string strategy name):

"email": "hash"

Parameterized (dict with strategy + params):

"email": {
  "strategy": "hash",
  "algo": "sha512",
  "salt": "my-salt"
}

The engine resolves this and calls:

strategies["hash"](value, algo="sha512", salt="my-salt")

Registry Pattern:

Strategies are stored in a dict STRATEGIES:

STRATEGIES = {
    "hash": hashing.apply,
    "redact": redaction.apply,
    "keep": noop.apply,
    "drop": drop.apply,
    "noise": noise.apply,
    "generalize": generalization.apply,
}

This makes it easy to: - Look up strategies by name - Register custom strategies (add to the dict) - Mock strategies for testing

I/O Handlers: Format Abstraction

MaskMe works with multiple formats (CSV, JSON, JSONL) using a consistent abstraction.

I/O Handler Interface:

class FormatHandler(ABC):
    def read(self, file_path: str) -> Generator[Dict]:
        """Read file, yield records as dicts."""

    def write(self, records: Iterable[Dict], file_path: str):
        """Write records to file."""

Implementations:

CSVHandler: Reads CSV → dicts, writes dicts → CSV
JSONHandler: Reads JSON list → dicts, writes dicts → JSON array
JSONLHandler: Reads JSONL (one record per line) → dicts, writes JSONL

Key design: All handlers return/accept records as dicts. The engine doesn’t care about format.

Example:

customers.csv (CSV format)
     ↓
CSVHandler.read() → [{"id": 1, "name": "Alice"}, ...]
     ↓
Engine.mask(records) → [{"name": "Alice"}, ...]  (anonymized)
     ↓
CSVHandler.write(records) → customers_masked.csv

The engine logic is identical whether you use CSV, JSON, or JSONL.

Streaming Architecture

Both I/O handlers and the engine stream data:

# CSVHandler.read() is a generator
for record in handler.read("large_file.csv"):  # One record at a time
    # Process record
    pass

# Engine.mask() is a generator
for anonymized_record in masker.mask(records):  # One at a time
    # Write or process
    pass

Benefit: Handle multi-GB files without loading everything into RAM.

The CLI: User-Friendly Interface

The CLI wraps the engine and I/O handlers:

maskme --rules rules.json --input data.csv --output masked.csv

Steps:

Parse rules: Load and validate rules.json
Detect format: Infer CSV/JSON/JSONL from file extensions
Get handlers: Instantiate appropriate I/O handler
Initialize engine: Create MaskMe with rules
Stream: Read records → mask → write records

Pseudo-code:

def main(rules_path, input_path, output_path):
    rules = load_rules(rules_path)
    fmt = detect_format(input_path, output_path)

    handler = get_handler(fmt)
    masker = MaskMe(rules)

    input_records = handler.read(input_path)
    masked_records = masker.mask(input_records)
    handler.write(masked_records, output_path)

Analytics Module

After anonymization, measure privacy:

Metrics computed:

K-anonymity: Minimum group size for quasi-identifiers
L-diversity: Diversity of sensitive attributes within groups
T-closeness: Distance between masked and original distributions
Information loss: Aggregate utility metrics

Architecture:

# Metric classes (compute.py)
class KAnonymity:
    def compute(records, quasi_identifiers) -> int

class LDiversity:
    def compute(records, quasi_identifiers, sensitive) -> float

# Results aggregation (base.py)
class AnalyticResult:
    name: str
    passed: bool
    details: Dict
    charts: List[Chart]

# Reporting (report.py)
def generate(results, output_path, dataset_info):
    # Generates HTML report with charts and summaries

Usage flow:

Original Records + Masked Records
       ↓
Compute K-anonymity, L-diversity, etc.
       ↓
Create AnalyticResult objects
       ↓
Generate HTML report
       ↓
Privacy report (human-readable)

Design Patterns Used

Registry Pattern:

Strategies are registered in a dict, enabling: - Dynamic strategy lookup - Easy custom strategy registration - Dependency injection for testing

Strategy Pattern:

Each strategy is a callable with the same interface:

def apply(value, **kwargs) -> Any

This enables: - Uniform treatment across strategies - Easy swapping/composition - Clear contracts

Factory Pattern:

get_handler(format_name) returns the appropriate handler:

def get_handler(fmt: str) -> FormatHandler:
    handler_class = IO_HANDLERS[fmt]
    return handler_class()

Generator Pattern:

Readers and the engine use generators for memory efficiency:

def read(file) -> Generator[Dict]:
    for line in file:
        yield parse(line)

def mask(records) -> Generator[Dict]:
    for record in records:
        yield process(record)

Extension Points

To extend MaskMe:

1. Add a Custom Strategy:

def my_strategy(value, **kwargs):
    return transform(value)

masker.strategies["my_strategy"] = my_strategy

2. Add a Custom I/O Handler:

class MyFormatHandler(FormatHandler):
    def read(self, file_path):
        # Yield dicts

    def write(self, records, file_path):
        # Write dicts

IO_HANDLERS["myformat"] = MyFormatHandler

3. Add a Custom Analytic:

class MyAnalytic:
    def compute(self, records):
        return AnalyticResult(...)

4. Subclass the Engine:

class MyMaskMe(MaskMe):
    def _process_record(self, record):
        # Custom pre/post-processing
        return super()._process_record(record)

Data Flow Diagrams

CLI Flow:

User Input (rules.json, data.csv)
         ↓
CLI Parser
         ↓
Load Rules & Detect Format
         ↓
I/O Handler: Read CSV
         ↓
Engine: Apply Strategies (streaming)
         ↓
I/O Handler: Write CSV
         ↓
Progress Logging
         ↓
Output (masked.csv)

Python API Flow:

User Code
         ↓
Load Data (any way)
         ↓
Create MaskMe(rules)
         ↓
Call masker.mask(records)
         ↓
Process Anonymized Records
         ↓
User handles I/O

Analytics Flow:

Original + Masked Records
         ↓
Compute Metrics (K-anon, L-div, etc.)
         ↓
AnalyticResult Objects
         ↓
Generate Report (HTML)
         ↓
Privacy Report

Performance Characteristics

Time Complexity:

Per record: O(number of rules) with O(1) average strategy execution
Total: O(n × m) where n = records, m = rules

Space Complexity:

Streaming: O(1) per record (constant memory during processing)
Rules + strategies: O(m) where m = number of rules
Output buffering: O(k) where k = write buffer size (typically small)

Large-file handling:

MaskMe handles multi-GB files efficiently
No need to load entire dataset into RAM
Bottleneck is typically I/O, not processing

Deployment Considerations

CLI Deployment:

Package as entry point (poetry, setuptools)
Use environment variables for sensitive config (salt, API keys)
Log progress for long-running jobs
Add dry-run flag (--limit) for testing

Python API Deployment:

Import as library in data pipelines
Handle errors gracefully (invalid rules, bad data)
Use async/threading for parallelization (outside MaskMe)
Cache rules to avoid reloading

Analytics Deployment:

Run separately after anonymization
Store HTML reports for audit trails
Alert if privacy metrics fail thresholds
Version reports with anonymization batch

Security Considerations

Salt Management:

Salt should be random and organization-specific
Store in secure configuration (not in rules files)
Different salts for different datasets (optional)
Rotate salts periodically (new salt = new anonymization)

Error Messages:

Don’t leak sensitive data in error messages
Validate rules before processing
Log errors securely (no PII in logs)

Input Validation:

Validate rules structure
Check file permissions before reading/writing
Validate CSV/JSON structure early

Access Control:

Limit who can write custom strategies
Audit rules file changes
Require approval for rule changes

Next Steps

Getting Started with MaskMe — Learn by example
Anonymization Strategies — Understand each strategy in detail
How to Create a Custom Strategy — Extend MaskMe with custom logic