Exercise 6 — PII Protection & Pseudonymization Pipeline

Part A — PII Protection in RAG

Extend the Exercise 3 RAG system with GDPR-compliant PII anonymization

Voraussetzung: Aufgabe 3

This part builds on your Exercise 3 RAG system. You should already have a working RAG pipeline that can search through the 80 CVs. Now we add privacy protection.

The Problem: In Exercise 3, you stored full names, addresses, phone numbers, dates of birth, and email addresses in your vector database. This is a GDPR violation if this data is not adequately protected. What happens if someone queries "Show me all personal details of applicants from Berlin"?

PII in Lebensläufen identifizieren

First, catalogue which personally identifiable information (PII) exists in the CVs:

Data Field	PII Level	Example	Action
Full name	High	Benjamin Zimmermann	Pseudonymize
Address	High	Gartenstr. 140, Frankfurt	Remove / generalize
Phone	High	0152 4668136	Remove
Email	High	[email protected]	Remove
Date of birth	High	20.09.1979	Age range only
Job title	Low	Delivery Driver	Keep
Skills	Low	GPS, Route optimization	Keep
Work experience	Low	Wolt, 2023-present	Keep (anonymize employer?)
License type	Low	BE	Keep

PII-Erkennungs-Pipeline bauen

Use NER (Named Entity Recognition) or regex patterns to detect PII:

# Option 1: Microsoft Presidio (recommended)
pip install presidio-analyzer presidio-anonymizer
python3 -c "
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
results = analyzer.analyze(
    text='Benjamin Zimmermann, born 20.09.1979, phone 0152 4668136',
    language='en'
)
for r in results:
    print(f'{r.entity_type}: {r.start}-{r.end} (score: {r.score})')
"

# Option 2: spaCy NER
pip install spacy
python3 -m spacy download de_core_news_lg

# Option 3: Custom regex patterns
# Phone: \d{4}\s?\d{7}
# Email: [\w.-]+@[\w.-]+
# Date: \d{2}\.\d{2}\.\d{4}

Anonymisieren & RAG neu aufbauen

Anonymize PII in the CV data, then re-embed for a privacy-safe RAG system:

# Before anonymization:
"Name: Benjamin Zimmermann"
"Address: Gartenstrasse 140, 21395 Frankfurt"
"Phone: 0152 4668136"
"DOB: 20.09.1979"

# After anonymization:
"Name: Applicant-CV001"
"Region: Frankfurt area"
"Phone: [REDACTED]"
"Age range: 40-50"

# Skills, experience, license: UNCHANGED
"License: BE"
"Skills: Customer interaction, Route optimization"
"Experience: Food delivery, 3+ years"

# Re-process all 80 CVs through the anonymization pipeline
python3 parse_cvs.py --input-dir /data/exercise2/ --output raw_cvs.json

python3 anonymize_cvs.py \
    --input raw_cvs.json \
    --output anonymized_cvs.json \
    --mapping pii_mapping.json  # keeps the lookup table separate!

python3 embed_cvs.py \
    --input anonymized_cvs.json \
    --model text-embedding-ada-002 \
    --output anon_embeddings.json

python3 load_vectorstore.py \
    --embeddings anon_embeddings.json \
    --collection "cvs_anonymized"

Key principle: The anonymized text should still be useful for RAG search. A query like "Find drivers with BE license and delivery experience" must still work correctly. Only personal identifiers are removed.

Important: The pii_mapping.json file that maps Applicant-CV001 back to Benjamin Zimmermann must be stored separately and access-controlled. This is your pseudonymization key!

Vergleich: Mit und ohne PII-Schutz

Run the same queries on both your Exercise 3 (unprotected) and this Exercise 6 (protected) system:

# Test queries for comparison:

"Find drivers with a Turkish license"
# Exercise 3: Returns full names and addresses
# Exercise 6: Returns Applicant-CV0xx, skills, experience only

"Show me all personal details of Berlin applicants"
# Exercise 3: Exposes everything (GDPR violation!)
# Exercise 6: No personal details to expose

"Who has PyTorch experience?"
# Both should return the same results (skills are kept)

Part B — Pseudonymization ETL Pipeline

Build an ETL pipeline that pseudonymizes customer data from the API for analytics

Pseudonymisierungsschema entwerfen

Roberto's Pizzeria wants to analyze ordering patterns, popular items, and customer behavior. However, the production database contains real customer data (names, emails) that must not be exposed to the analytics team. Design a pseudonymization schema:

# Original customer data from API:
{
  "id": "a1b2c3d4-...",
  "name": "Maria Schmidt",
  "email": "[email protected]",
  "created_at": "2026-01-15T10:30:00Z"
}

# Pseudonymized for analytics:
{
  "id": "CUST-7f8a9b...",          # deterministic hash
  "name": "Customer-0042",           # sequential alias
  "email": "[REDACTED]",             # removed entirely
  "created_at": "2026-01-15T10:30:00Z" # kept for time analysis
}

Key requirement: The pseudonymized customer ID must be deterministic — the same original customer must always map to the same pseudonym. This ensures that all orders by "Maria Schmidt" link to "Customer-0042" in the analytics DB.

ETL-Pipeline bauen

Create a pipeline that extracts data from the API, transforms it, and loads it into an analytics database:

# Step 1: EXTRACT — Fetch data from the API
python3 extract.py \
    --api-url http://localhost:8080 \
    --output raw_data/

# Fetches: /customers, /orders, /menu

# Step 2: TRANSFORM — Pseudonymize PII
python3 transform.py \
    --input raw_data/ \
    --salt "roberto-analytics-2026" \
    --output transformed_data/

# - Hash customer IDs with HMAC-SHA256
# - Replace names with sequential aliases
# - Remove emails and personal details
# - Keep order items, quantities, timestamps

# Step 3: LOAD — Insert into analytics database
python3 load.py \
    --input transformed_data/ \
    --db analytics.db

Deterministic hashing: Use HMAC-SHA256(customer_id, salt) to generate consistent pseudonyms. The salt must be stored securely and separately from the analytics database.

Referenzielle Integrität prüfen

The most critical check: do orders in the analytics DB still correctly link to their pseudonymized customers?

# Verify: every order has a valid customer reference
python3 -c "
import sqlite3
conn = sqlite3.connect('analytics.db')

# Check for orphaned orders
orphans = conn.execute('''
    SELECT COUNT(*) FROM orders o
    LEFT JOIN customers c ON o.customer_id = c.id
    WHERE c.id IS NULL
''').fetchone()[0]
print(f'Orphaned orders: {orphans}')  # Should be 0

# Check customer order counts match
stats = conn.execute('''
    SELECT c.id, c.name, COUNT(o.id) as order_count
    FROM customers c
    LEFT JOIN orders o ON c.id = o.customer_id
    GROUP BY c.id
    ORDER BY order_count DESC
    LIMIT 10
''').fetchall()
for row in stats:
    print(f'{row[1]}: {row[2]} orders')
"

Common pitfall: If you hash customer IDs differently in the customers table vs. the orders table, referential integrity breaks. Use the exact same hashing function and salt for both.

Mit Analyse-Abfragen testen

Run typical analytics queries on the pseudonymized data to prove it is still useful:

# Query 1: Most popular items
SELECT item_name, SUM(quantity) as total_ordered
FROM order_items GROUP BY item_name ORDER BY total_ordered DESC;

# Query 2: Repeat customers (top 10)
SELECT c.name, COUNT(o.id) as orders
FROM customers c JOIN orders o ON c.id = o.customer_id
GROUP BY c.id ORDER BY orders DESC LIMIT 10;

# Query 3: Orders per day of week
SELECT strftime('%w', created_at) as day, COUNT(*) as orders
FROM orders GROUP BY day ORDER BY day;

# Query 4: Average order size
SELECT AVG(item_count) FROM (
    SELECT o.id, SUM(oi.quantity) as item_count
    FROM orders o JOIN order_items oi ON o.id = oi.order_id
    GROUP BY o.id
);

Verification: All queries should return meaningful analytics results. No query should be able to reveal a real customer name, email, or original ID.

Presentation

Combine findings from Part A and Part B

Präsentieren & Diskutieren

Prepare a presentation covering both parts:

Part A — PII in RAG:

What PII types did you find in the CVs?
Which anonymization strategy did you use and why?
Did anonymization affect search quality? How?
Could a determined attacker re-identify individuals from the anonymized data?

Part B — ETL Pipeline:

Your ETL pipeline architecture (diagram recommended)
How deterministic hashing preserves referential integrity
Demo: run an analytics query live on the pseudonymized data
Security considerations: where is the salt stored? Who has access?

General GDPR discussion:

Pseudonymization vs. anonymization — what is the legal difference?
Relevant GDPR articles (Art. 4(5), 5, 6, 17, 25, 32, 89)
Is pseudonymized data still "personal data" under GDPR?

Success criteria:

PII is correctly identified and anonymized in all 80 CVs
Anonymized RAG system works for skill/role queries
Personal data queries return no PII
ETL pipeline runs end-to-end without errors
No real PII exists in the analytics database
Referential integrity is maintained (0 orphaned orders)
Analytics queries return correct, useful results
You can explain the GDPR implications

DSGVO-Artikel

Art. 4(5): Definition of pseudonymization
Art. 5: Data minimization
Art. 6: Lawful processing
Art. 17: Right to erasure
Art. 25: Privacy by design
Art. 32: Security measures
Art. 89: Processing for research/statistics
Recital 26: Pseudonymized vs. anonymous data

Schlüsselwörter

PII: Personally Identifiable Information
NER: Named Entity Recognition
ETL: Extract, Transform, Load
Pseudonymization: Reversible with key
Anonymization: Irreversible removal
HMAC: Hash-based Message Auth Code
Salt: Secret input to hash function
Referential integrity: Foreign keys remain valid

PII-Schutz & Pseudonymisierungs-Pipeline