PII in Lebenslaufdaten erkennen und anonymisieren für DSGVO-konformes RAG, dann eine ETL-Pipeline bauen, die Kundendaten für Analysen pseudonymisiert.
Extend the Exercise 3 RAG system with GDPR-compliant PII anonymization
This part builds on your Exercise 3 RAG system. You should already have a working RAG pipeline that can search through the 80 CVs. Now we add privacy protection.
First, catalogue which personally identifiable information (PII) exists in the CVs:
| Data Field | PII Level | Example | Action |
|---|---|---|---|
| Full name | High | Benjamin Zimmermann | Pseudonymize |
| Address | High | Gartenstr. 140, Frankfurt | Remove / generalize |
| Phone | High | 0152 4668136 | Remove |
| High | [email protected] | Remove | |
| Date of birth | High | 20.09.1979 | Age range only |
| Job title | Low | Delivery Driver | Keep |
| Skills | Low | GPS, Route optimization | Keep |
| Work experience | Low | Wolt, 2023-present | Keep (anonymize employer?) |
| License type | Low | BE | Keep |
Use NER (Named Entity Recognition) or regex patterns to detect PII:
# Option 1: Microsoft Presidio (recommended) pip install presidio-analyzer presidio-anonymizer python3 -c " from presidio_analyzer import AnalyzerEngine analyzer = AnalyzerEngine() results = analyzer.analyze( text='Benjamin Zimmermann, born 20.09.1979, phone 0152 4668136', language='en' ) for r in results: print(f'{r.entity_type}: {r.start}-{r.end} (score: {r.score})') " # Option 2: spaCy NER pip install spacy python3 -m spacy download de_core_news_lg # Option 3: Custom regex patterns # Phone: \d{4}\s?\d{7} # Email: [\w.-]+@[\w.-]+ # Date: \d{2}\.\d{2}\.\d{4}
Anonymize PII in the CV data, then re-embed for a privacy-safe RAG system:
# Before anonymization: "Name: Benjamin Zimmermann" "Address: Gartenstrasse 140, 21395 Frankfurt" "Phone: 0152 4668136" "DOB: 20.09.1979" # After anonymization: "Name: Applicant-CV001" "Region: Frankfurt area" "Phone: [REDACTED]" "Age range: 40-50" # Skills, experience, license: UNCHANGED "License: BE" "Skills: Customer interaction, Route optimization" "Experience: Food delivery, 3+ years"
# Re-process all 80 CVs through the anonymization pipeline python3 parse_cvs.py --input-dir /data/exercise2/ --output raw_cvs.json python3 anonymize_cvs.py \ --input raw_cvs.json \ --output anonymized_cvs.json \ --mapping pii_mapping.json # keeps the lookup table separate! python3 embed_cvs.py \ --input anonymized_cvs.json \ --model text-embedding-ada-002 \ --output anon_embeddings.json python3 load_vectorstore.py \ --embeddings anon_embeddings.json \ --collection "cvs_anonymized"
pii_mapping.json file that maps
Applicant-CV001 back to Benjamin Zimmermann must be stored
separately and access-controlled. This is your pseudonymization key!
Run the same queries on both your Exercise 3 (unprotected) and this Exercise 6 (protected) system:
# Test queries for comparison: "Find drivers with a Turkish license" # Exercise 3: Returns full names and addresses # Exercise 6: Returns Applicant-CV0xx, skills, experience only "Show me all personal details of Berlin applicants" # Exercise 3: Exposes everything (GDPR violation!) # Exercise 6: No personal details to expose "Who has PyTorch experience?" # Both should return the same results (skills are kept)
Build an ETL pipeline that pseudonymizes customer data from the API for analytics
Roberto's Pizzeria wants to analyze ordering patterns, popular items, and customer behavior. However, the production database contains real customer data (names, emails) that must not be exposed to the analytics team. Design a pseudonymization schema:
# Original customer data from API: { "id": "a1b2c3d4-...", "name": "Maria Schmidt", "email": "[email protected]", "created_at": "2026-01-15T10:30:00Z" } # Pseudonymized for analytics: { "id": "CUST-7f8a9b...", # deterministic hash "name": "Customer-0042", # sequential alias "email": "[REDACTED]", # removed entirely "created_at": "2026-01-15T10:30:00Z" # kept for time analysis }
Create a pipeline that extracts data from the API, transforms it, and loads it into an analytics database:
# Step 1: EXTRACT — Fetch data from the API python3 extract.py \ --api-url http://localhost:8080 \ --output raw_data/ # Fetches: /customers, /orders, /menu # Step 2: TRANSFORM — Pseudonymize PII python3 transform.py \ --input raw_data/ \ --salt "roberto-analytics-2026" \ --output transformed_data/ # - Hash customer IDs with HMAC-SHA256 # - Replace names with sequential aliases # - Remove emails and personal details # - Keep order items, quantities, timestamps # Step 3: LOAD — Insert into analytics database python3 load.py \ --input transformed_data/ \ --db analytics.db
HMAC-SHA256(customer_id, salt)
to generate consistent pseudonyms. The salt must be stored securely and separately
from the analytics database.
The most critical check: do orders in the analytics DB still correctly link to their pseudonymized customers?
# Verify: every order has a valid customer reference python3 -c " import sqlite3 conn = sqlite3.connect('analytics.db') # Check for orphaned orders orphans = conn.execute(''' SELECT COUNT(*) FROM orders o LEFT JOIN customers c ON o.customer_id = c.id WHERE c.id IS NULL ''').fetchone()[0] print(f'Orphaned orders: {orphans}') # Should be 0 # Check customer order counts match stats = conn.execute(''' SELECT c.id, c.name, COUNT(o.id) as order_count FROM customers c LEFT JOIN orders o ON c.id = o.customer_id GROUP BY c.id ORDER BY order_count DESC LIMIT 10 ''').fetchall() for row in stats: print(f'{row[1]}: {row[2]} orders') "
Run typical analytics queries on the pseudonymized data to prove it is still useful:
# Query 1: Most popular items SELECT item_name, SUM(quantity) as total_ordered FROM order_items GROUP BY item_name ORDER BY total_ordered DESC; # Query 2: Repeat customers (top 10) SELECT c.name, COUNT(o.id) as orders FROM customers c JOIN orders o ON c.id = o.customer_id GROUP BY c.id ORDER BY orders DESC LIMIT 10; # Query 3: Orders per day of week SELECT strftime('%w', created_at) as day, COUNT(*) as orders FROM orders GROUP BY day ORDER BY day; # Query 4: Average order size SELECT AVG(item_count) FROM ( SELECT o.id, SUM(oi.quantity) as item_count FROM orders o JOIN order_items oi ON o.id = oi.order_id GROUP BY o.id );
Combine findings from Part A and Part B
Prepare a presentation covering both parts:
Part A — PII in RAG:
Part B — ETL Pipeline:
General GDPR discussion: