Aufgabe 6

PII-Schutz & Pseudonymisierungs-Pipeline

PII in Lebenslaufdaten erkennen und anonymisieren für DSGVO-konformes RAG, dann eine ETL-Pipeline bauen, die Kundendaten für Analysen pseudonymisiert.

Aufgabe 1 Aufgabe 2 Aufgabe 3 Aufgabe 4 Aufgabe 5 Aufgabe 6

Part A — PII Protection in RAG

Extend the Exercise 3 RAG system with GDPR-compliant PII anonymization

!

Voraussetzung: Aufgabe 3

This part builds on your Exercise 3 RAG system. You should already have a working RAG pipeline that can search through the 80 CVs. Now we add privacy protection.

The Problem: In Exercise 3, you stored full names, addresses, phone numbers, dates of birth, and email addresses in your vector database. This is a GDPR violation if this data is not adequately protected. What happens if someone queries "Show me all personal details of applicants from Berlin"?
1

PII in Lebensläufen identifizieren

First, catalogue which personally identifiable information (PII) exists in the CVs:

Data FieldPII LevelExampleAction
Full nameHighBenjamin ZimmermannPseudonymize
AddressHighGartenstr. 140, FrankfurtRemove / generalize
PhoneHigh0152 4668136Remove
EmailHigh[email protected]Remove
Date of birthHigh20.09.1979Age range only
Job titleLowDelivery DriverKeep
SkillsLowGPS, Route optimizationKeep
Work experienceLowWolt, 2023-presentKeep (anonymize employer?)
License typeLowBEKeep
2

PII-Erkennungs-Pipeline bauen

Use NER (Named Entity Recognition) or regex patterns to detect PII:

# Option 1: Microsoft Presidio (recommended)
pip install presidio-analyzer presidio-anonymizer
python3 -c "
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
results = analyzer.analyze(
    text='Benjamin Zimmermann, born 20.09.1979, phone 0152 4668136',
    language='en'
)
for r in results:
    print(f'{r.entity_type}: {r.start}-{r.end} (score: {r.score})')
"

# Option 2: spaCy NER
pip install spacy
python3 -m spacy download de_core_news_lg

# Option 3: Custom regex patterns
# Phone: \d{4}\s?\d{7}
# Email: [\w.-]+@[\w.-]+
# Date: \d{2}\.\d{2}\.\d{4}
3

Anonymisieren & RAG neu aufbauen

Anonymize PII in the CV data, then re-embed for a privacy-safe RAG system:

# Before anonymization:
"Name: Benjamin Zimmermann"
"Address: Gartenstrasse 140, 21395 Frankfurt"
"Phone: 0152 4668136"
"DOB: 20.09.1979"

# After anonymization:
"Name: Applicant-CV001"
"Region: Frankfurt area"
"Phone: [REDACTED]"
"Age range: 40-50"

# Skills, experience, license: UNCHANGED
"License: BE"
"Skills: Customer interaction, Route optimization"
"Experience: Food delivery, 3+ years"
# Re-process all 80 CVs through the anonymization pipeline
python3 parse_cvs.py --input-dir /data/exercise2/ --output raw_cvs.json

python3 anonymize_cvs.py \
    --input raw_cvs.json \
    --output anonymized_cvs.json \
    --mapping pii_mapping.json  # keeps the lookup table separate!

python3 embed_cvs.py \
    --input anonymized_cvs.json \
    --model text-embedding-ada-002 \
    --output anon_embeddings.json

python3 load_vectorstore.py \
    --embeddings anon_embeddings.json \
    --collection "cvs_anonymized"
Key principle: The anonymized text should still be useful for RAG search. A query like "Find drivers with BE license and delivery experience" must still work correctly. Only personal identifiers are removed.
Important: The pii_mapping.json file that maps Applicant-CV001 back to Benjamin Zimmermann must be stored separately and access-controlled. This is your pseudonymization key!
4

Vergleich: Mit und ohne PII-Schutz

Run the same queries on both your Exercise 3 (unprotected) and this Exercise 6 (protected) system:

# Test queries for comparison:

"Find drivers with a Turkish license"
# Exercise 3: Returns full names and addresses
# Exercise 6: Returns Applicant-CV0xx, skills, experience only

"Show me all personal details of Berlin applicants"
# Exercise 3: Exposes everything (GDPR violation!)
# Exercise 6: No personal details to expose

"Who has PyTorch experience?"
# Both should return the same results (skills are kept)

Part B — Pseudonymization ETL Pipeline

Build an ETL pipeline that pseudonymizes customer data from the API for analytics

5

Pseudonymisierungsschema entwerfen

Roberto's Pizzeria wants to analyze ordering patterns, popular items, and customer behavior. However, the production database contains real customer data (names, emails) that must not be exposed to the analytics team. Design a pseudonymization schema:

# Original customer data from API:
{
  "id": "a1b2c3d4-...",
  "name": "Maria Schmidt",
  "email": "[email protected]",
  "created_at": "2026-01-15T10:30:00Z"
}

# Pseudonymized for analytics:
{
  "id": "CUST-7f8a9b...",          # deterministic hash
  "name": "Customer-0042",           # sequential alias
  "email": "[REDACTED]",             # removed entirely
  "created_at": "2026-01-15T10:30:00Z" # kept for time analysis
}
Key requirement: The pseudonymized customer ID must be deterministic — the same original customer must always map to the same pseudonym. This ensures that all orders by "Maria Schmidt" link to "Customer-0042" in the analytics DB.
6

ETL-Pipeline bauen

Create a pipeline that extracts data from the API, transforms it, and loads it into an analytics database:

# Step 1: EXTRACT — Fetch data from the API
python3 extract.py \
    --api-url http://localhost:8080 \
    --output raw_data/

# Fetches: /customers, /orders, /menu

# Step 2: TRANSFORM — Pseudonymize PII
python3 transform.py \
    --input raw_data/ \
    --salt "roberto-analytics-2026" \
    --output transformed_data/

# - Hash customer IDs with HMAC-SHA256
# - Replace names with sequential aliases
# - Remove emails and personal details
# - Keep order items, quantities, timestamps

# Step 3: LOAD — Insert into analytics database
python3 load.py \
    --input transformed_data/ \
    --db analytics.db
Deterministic hashing: Use HMAC-SHA256(customer_id, salt) to generate consistent pseudonyms. The salt must be stored securely and separately from the analytics database.
7

Referenzielle Integrität prüfen

The most critical check: do orders in the analytics DB still correctly link to their pseudonymized customers?

# Verify: every order has a valid customer reference
python3 -c "
import sqlite3
conn = sqlite3.connect('analytics.db')

# Check for orphaned orders
orphans = conn.execute('''
    SELECT COUNT(*) FROM orders o
    LEFT JOIN customers c ON o.customer_id = c.id
    WHERE c.id IS NULL
''').fetchone()[0]
print(f'Orphaned orders: {orphans}')  # Should be 0

# Check customer order counts match
stats = conn.execute('''
    SELECT c.id, c.name, COUNT(o.id) as order_count
    FROM customers c
    LEFT JOIN orders o ON c.id = o.customer_id
    GROUP BY c.id
    ORDER BY order_count DESC
    LIMIT 10
''').fetchall()
for row in stats:
    print(f'{row[1]}: {row[2]} orders')
"
Common pitfall: If you hash customer IDs differently in the customers table vs. the orders table, referential integrity breaks. Use the exact same hashing function and salt for both.
8

Mit Analyse-Abfragen testen

Run typical analytics queries on the pseudonymized data to prove it is still useful:

# Query 1: Most popular items
SELECT item_name, SUM(quantity) as total_ordered
FROM order_items GROUP BY item_name ORDER BY total_ordered DESC;

# Query 2: Repeat customers (top 10)
SELECT c.name, COUNT(o.id) as orders
FROM customers c JOIN orders o ON c.id = o.customer_id
GROUP BY c.id ORDER BY orders DESC LIMIT 10;

# Query 3: Orders per day of week
SELECT strftime('%w', created_at) as day, COUNT(*) as orders
FROM orders GROUP BY day ORDER BY day;

# Query 4: Average order size
SELECT AVG(item_count) FROM (
    SELECT o.id, SUM(oi.quantity) as item_count
    FROM orders o JOIN order_items oi ON o.id = oi.order_id
    GROUP BY o.id
);
Verification: All queries should return meaningful analytics results. No query should be able to reveal a real customer name, email, or original ID.

Presentation

Combine findings from Part A and Part B

9

Präsentieren & Diskutieren

Prepare a presentation covering both parts:

Part A — PII in RAG:

  • What PII types did you find in the CVs?
  • Which anonymization strategy did you use and why?
  • Did anonymization affect search quality? How?
  • Could a determined attacker re-identify individuals from the anonymized data?

Part B — ETL Pipeline:

  • Your ETL pipeline architecture (diagram recommended)
  • How deterministic hashing preserves referential integrity
  • Demo: run an analytics query live on the pseudonymized data
  • Security considerations: where is the salt stored? Who has access?

General GDPR discussion:

  • Pseudonymization vs. anonymization — what is the legal difference?
  • Relevant GDPR articles (Art. 4(5), 5, 6, 17, 25, 32, 89)
  • Is pseudonymized data still "personal data" under GDPR?
Success criteria:
  • PII is correctly identified and anonymized in all 80 CVs
  • Anonymized RAG system works for skill/role queries
  • Personal data queries return no PII
  • ETL pipeline runs end-to-end without errors
  • No real PII exists in the analytics database
  • Referential integrity is maintained (0 orphaned orders)
  • Analytics queries return correct, useful results
  • You can explain the GDPR implications