Linux Regex Lab: Locating Sensitive Data
Create files containing fake credit card numbers, then use grep and regular expressions to locate them — simulating how DLP tools detect sensitive data at rest.
Lab Objectives
- Create files containing fake sensitive data such as credit card numbers.
- Understand common credit card number formats and Luhn validation.
- Use grep with regular expressions to locate files containing credit card patterns.
- Use find and grep together to search recursively across directories.
- Understand how data loss prevention (DLP) tools use pattern matching.
Prerequisites
- A Linux system (Kali, Ubuntu, or any distribution with a terminal).
- Basic familiarity with the Linux command line (cd, ls, cat, echo).
- No special permissions required — all commands run as a normal user.
Part 1: Credit Card Number Formats
Before searching for credit card numbers, you need to understand their structure. Each card brand has a distinct prefix and length:
| Brand | Prefix | Length | Example (Fake) | Regex Pattern |
|---|---|---|---|---|
| Visa | 4 | 16 digits | 4539 1488 0343 6467 | 4[0-9]{15} |
| Mastercard | 51-55 | 16 digits | 5425 2334 3010 9903 | 5[1-5][0-9]{14} |
| American Express | 34 or 37 | 15 digits | 3714 496353 98431 | 3[47][0-9]{13} |
| Discover | 6011 | 16 digits | 6011 1111 1111 1117 | 6011[0-9]{12} |
Part 2: Create the Lab Environment
Create a directory structure with several files. Some will contain fake credit card numbers and some will not. You will then search for the sensitive files.
Step 1: Create the lab directory structure:
mkdir -p ~/regex-lab/documents ~/regex-lab/logs ~/regex-lab/configStep 2: Create a file with a fake Visa number:
cat > ~/regex-lab/documents/customer-order.txt << 'EOF'
Customer Order #10234
Name: Jane Smith
Item: Wireless Keyboard
Amount: $49.99
Payment: Visa ending in 6467
Full card on file: 4539148803436467
Expiry: 12/26
Status: Shipped
EOFStep 3: Create a file with a fake Mastercard number (with spaces):
cat > ~/regex-lab/logs/payment-log.csv << 'EOF'
timestamp,transaction_id,card_number,amount,status
2025-01-15 09:23:11,TXN-0042,5425 2334 3010 9903,129.99,approved
2025-01-15 09:25:44,TXN-0043,N/A,0.00,declined
2025-01-15 09:30:02,TXN-0044,4111111111111111,59.50,approved
EOFStep 4: Create a file with a fake Amex number:
cat > ~/regex-lab/documents/expense-report.txt << 'EOF'
Expense Report - Q1 2025
Employee: John Doe
Corporate Amex: 371449635398431
Jan: $234.50 - Client dinner
Feb: $89.00 - Office supplies
Mar: $412.75 - Travel
Total: $736.25
EOFStep 5: Create a clean file (no card numbers):
cat > ~/regex-lab/config/app-settings.conf << 'EOF'
# Application Settings
debug_mode=false
max_connections=100
timeout=30
log_level=INFO
database_port=5432
EOFStep 6: Create one more file with a Discover card buried in notes:
cat > ~/regex-lab/documents/meeting-notes.txt << 'EOF'
Team Meeting - 15 Jan 2025
Attendees: Alice, Bob, Charlie
Action items:
- Update firewall rules by Friday
- Review access control policies
- Bob to follow up on PCI audit findings
Note from Bob: found an old test card 6011111111111117 in the staging database.
Need to remove before audit.
EOFOptional: Enter your own fake card number to add to the lab:
Make up a fake number following one of the formats above. The command to create the file will appear.
find ~/regex-lab -type f to confirm all files were created. You should see 5 files (or 6 if you added your own).Part 3: Search with grep and Regular Expressions
Now use grep with regular expressions to locate files containing credit card number patterns.
Step 1: Search for any 16-digit number sequence:
grep -rn '[0-9]\{16\}' ~/regex-lab/-r searches recursively through all subdirectories. -n shows line numbers. This catches card numbers stored without spaces.
Step 2: Search for Visa card patterns specifically:
grep -rn '4[0-9]\{15\}' ~/regex-lab/This matches any 16-digit number starting with 4 (the Visa prefix). Check your output — it should match the Visa number in customer-order.txt and payment-log.csv.
Step 3: Search for Mastercard patterns:
grep -rn '5[1-5][0-9]\{14\}' ~/regex-lab/Matches 16-digit numbers starting with 51–55.
Step 4: Search for American Express patterns:
grep -rn '3[47][0-9]\{13\}' ~/regex-lab/Amex cards start with 34 or 37 and are 15 digits long.
Step 5: Search for Discover patterns:
grep -rn '6011[0-9]\{12\}' ~/regex-lab/Discover cards start with 6011 and are 16 digits long.
Step 6: Search for card numbers that include spaces (formatted numbers):
grep -rn '[0-9]\{4\} [0-9]\{4\} [0-9]\{4\} [0-9]\{4\}' ~/regex-lab/This matches the common XXXX XXXX XXXX XXXX display format. It should find the Mastercard in payment-log.csv.
Step 7: Combine all patterns into a single comprehensive search:
grep -rn -E '(4[0-9]{15}|5[1-5][0-9]{14}|3[47][0-9]{13}|6011[0-9]{12})' ~/regex-lab/-E enables extended regex so you can use | (OR) and {} without escaping. This single command detects Visa, Mastercard, Amex, and Discover patterns.
- Which files contained credit card numbers.
- Which brand was found in each file.
- Did the comprehensive search (Step 7) find all the cards? Were there any it missed (e.g. spaced numbers)?
Part 4: Handling Spaces, Dashes, and Edge Cases
Real-world data often has card numbers formatted with spaces or dashes. Your regex needs to handle these variants.
Step 1: Create a file with dashes in the card number:
cat > ~/regex-lab/documents/invoice.txt << 'EOF'
Invoice #INV-2025-0087
Bill to: Acme Corp
Card: 4539-1488-0343-6467
Amount: $1,250.00
EOFStep 2: Search for card numbers with optional spaces or dashes:
grep -rn -E '4[0-9]{3}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}' ~/regex-lab/[- ]? matches an optional space or dash between groups. This catches 4539148803436467, 4539 1488 0343 6467, and 4539-1488-0343-6467.
Step 3: Build the ultimate all-brand search with optional separators:
grep -rn -E '(4[0-9]{3}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|5[1-5][0-9]{2}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|3[47][0-9]{2}[- ]?[0-9]{6}[- ]?[0-9]{5}|6011[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4})' ~/regex-lab/This comprehensive regex detects all four card brands with or without spaces and dashes. This is similar to what production DLP tools use.
Step 4: Show only the filenames (not the matching lines):
grep -rl -E '(4[0-9]{3}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|5[1-5][0-9]{2}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|3[47][0-9]{2}[- ]?[0-9]{6}[- ]?[0-9]{5}|6011[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4})' ~/regex-lab/-l (lowercase L) lists only the filenames, not the matching content. Useful for identifying which files need remediation.
Step 5: Count how many matches are in each file:
grep -rc -E '(4[0-9]{3}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|5[1-5][0-9]{2}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|3[47][0-9]{2}[- ]?[0-9]{6}[- ]?[0-9]{5}|6011[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4})' ~/regex-lab/-c shows the count of matches per file. Files showing :0 have no card numbers.
- How many total files contain credit card patterns.
- Which file has the most matches.
- Did the enhanced regex catch the dashed and spaced formats that the basic regex missed?
Part 5: Combining find and grep
Use find to filter by file type before running the regex search. This is how you would scan specific file types across an entire system.
Step 1: Search only .txt files for card numbers:
find ~/regex-lab -name '*.txt' -exec grep -lE '(4[0-9]{3}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|5[1-5][0-9]{2}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4})' {} +find locates all .txt files, then -exec grep runs the regex against each one. The + batches files for efficiency.
Step 2: Search only .csv files:
find ~/regex-lab -name '*.csv' -exec grep -nE '[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}' {} +Step 3: Exclude config files from the search:
find ~/regex-lab -type f ! -name '*.conf' -exec grep -lE '(4[0-9]{15}|5[1-5][0-9]{14}|3[47][0-9]{13}|6011[0-9]{12})' {} +! -name '*.conf' excludes configuration files from the search.
- The difference between using
grep -randfind + grep. - Why you might want to filter by file type before searching.
Deliverables
- Screenshot of the directory structure created with
find ~/regex-lab -type f. - Screenshot of each brand-specific regex search (Visa, Mastercard, Amex, Discover) showing matches.
- Screenshot of the comprehensive search (all brands, with separators) showing all detected files.
- Screenshot of the
grep -rcoutput showing match counts per file. - A short written answer: Why can't regex alone reliably detect all credit card numbers? What additional validation would a real DLP tool use?
grep Flags Used
-r— Search recursively through directories.-n— Show line numbers.-l— Show only filenames (not content).-c— Show match count per file.-E— Extended regex (no escaping{}or|).--color— Highlight matches (often default).
Regex Quick Reference
[0-9]— Any single digit.{n}— Exactly n repetitions.{n,m}— Between n and m repetitions.[- ]?— Optional dash or space.|— OR (requires-E).()— Grouping (requires-E).
DLP Connection
Data Loss Prevention (DLP) tools use techniques like these to detect sensitive data:
- Pattern matching — regex for known formats (what you did in this lab).
- Luhn validation — mathematical check on digit sequence.
- Context analysis — looking for keywords like "card", "expiry", "CVV" near numbers.
- Data fingerprinting — exact or partial match against known datasets.
Lab Files Summary
customer-order.txt— Visa (no spaces)payment-log.csv— Mastercard (spaces) + Visaexpense-report.txt— Amex (no spaces)meeting-notes.txt— Discover (no spaces)invoice.txt— Visa (dashes)app-settings.conf— No card numbers