1. Home
  2. /InfoSec & Cyber
  3. /Hands-on Labs
  4. /Linux Regex Credit Card Search Lab

Linux Regex Lab: Locating Sensitive Data

Create files containing fake credit card numbers, then use grep and regular expressions to locate them — simulating how DLP tools detect sensitive data at rest.

Lab Objectives

  • Create files containing fake sensitive data such as credit card numbers.
  • Understand common credit card number formats and Luhn validation.
  • Use grep with regular expressions to locate files containing credit card patterns.
  • Use find and grep together to search recursively across directories.
  • Understand how data loss prevention (DLP) tools use pattern matching.

Prerequisites

  • A Linux system (Kali, Ubuntu, or any distribution with a terminal).
  • Basic familiarity with the Linux command line (cd, ls, cat, echo).
  • No special permissions required — all commands run as a normal user.

Part 1: Credit Card Number Formats

Before searching for credit card numbers, you need to understand their structure. Each card brand has a distinct prefix and length:

BrandPrefixLengthExample (Fake)Regex Pattern
Visa416 digits4539 1488 0343 64674[0-9]{15}
Mastercard51-5516 digits5425 2334 3010 99035[1-5][0-9]{14}
American Express34 or 3715 digits3714 496353 984313[47][0-9]{13}
Discover601116 digits6011 1111 1111 11176011[0-9]{12}

Part 2: Create the Lab Environment

Create a directory structure with several files. Some will contain fake credit card numbers and some will not. You will then search for the sensitive files.

Step 1: Create the lab directory structure:
mkdir -p ~/regex-lab/documents ~/regex-lab/logs ~/regex-lab/config
Step 2: Create a file with a fake Visa number:
cat > ~/regex-lab/documents/customer-order.txt << 'EOF'
Customer Order #10234
Name: Jane Smith
Item: Wireless Keyboard
Amount: $49.99
Payment: Visa ending in 6467
Full card on file: 4539148803436467
Expiry: 12/26
Status: Shipped
EOF
Step 3: Create a file with a fake Mastercard number (with spaces):
cat > ~/regex-lab/logs/payment-log.csv << 'EOF'
timestamp,transaction_id,card_number,amount,status
2025-01-15 09:23:11,TXN-0042,5425 2334 3010 9903,129.99,approved
2025-01-15 09:25:44,TXN-0043,N/A,0.00,declined
2025-01-15 09:30:02,TXN-0044,4111111111111111,59.50,approved
EOF
Step 4: Create a file with a fake Amex number:
cat > ~/regex-lab/documents/expense-report.txt << 'EOF'
Expense Report - Q1 2025
Employee: John Doe
Corporate Amex: 371449635398431

Jan: $234.50 - Client dinner
Feb: $89.00 - Office supplies
Mar: $412.75 - Travel

Total: $736.25
EOF
Step 5: Create a clean file (no card numbers):
cat > ~/regex-lab/config/app-settings.conf << 'EOF'
# Application Settings
debug_mode=false
max_connections=100
timeout=30
log_level=INFO
database_port=5432
EOF
Step 6: Create one more file with a Discover card buried in notes:
cat > ~/regex-lab/documents/meeting-notes.txt << 'EOF'
Team Meeting - 15 Jan 2025
Attendees: Alice, Bob, Charlie

Action items:
- Update firewall rules by Friday
- Review access control policies
- Bob to follow up on PCI audit findings

Note from Bob: found an old test card 6011111111111117 in the staging database.
Need to remove before audit.
EOF
Optional: Enter your own fake card number to add to the lab:

Make up a fake number following one of the formats above. The command to create the file will appear.

Part 3: Search with grep and Regular Expressions

Now use grep with regular expressions to locate files containing credit card number patterns.

Step 1: Search for any 16-digit number sequence:
grep -rn '[0-9]\{16\}' ~/regex-lab/

-r searches recursively through all subdirectories. -n shows line numbers. This catches card numbers stored without spaces.

Step 2: Search for Visa card patterns specifically:
grep -rn '4[0-9]\{15\}' ~/regex-lab/

This matches any 16-digit number starting with 4 (the Visa prefix). Check your output — it should match the Visa number in customer-order.txt and payment-log.csv.

Step 3: Search for Mastercard patterns:
grep -rn '5[1-5][0-9]\{14\}' ~/regex-lab/

Matches 16-digit numbers starting with 51–55.

Step 4: Search for American Express patterns:
grep -rn '3[47][0-9]\{13\}' ~/regex-lab/

Amex cards start with 34 or 37 and are 15 digits long.

Step 5: Search for Discover patterns:
grep -rn '6011[0-9]\{12\}' ~/regex-lab/

Discover cards start with 6011 and are 16 digits long.

Step 6: Search for card numbers that include spaces (formatted numbers):
grep -rn '[0-9]\{4\} [0-9]\{4\} [0-9]\{4\} [0-9]\{4\}' ~/regex-lab/

This matches the common XXXX XXXX XXXX XXXX display format. It should find the Mastercard in payment-log.csv.

Step 7: Combine all patterns into a single comprehensive search:
grep -rn -E '(4[0-9]{15}|5[1-5][0-9]{14}|3[47][0-9]{13}|6011[0-9]{12})' ~/regex-lab/

-E enables extended regex so you can use | (OR) and {} without escaping. This single command detects Visa, Mastercard, Amex, and Discover patterns.

Part 4: Handling Spaces, Dashes, and Edge Cases

Real-world data often has card numbers formatted with spaces or dashes. Your regex needs to handle these variants.

Step 1: Create a file with dashes in the card number:
cat > ~/regex-lab/documents/invoice.txt << 'EOF'
Invoice #INV-2025-0087
Bill to: Acme Corp
Card: 4539-1488-0343-6467
Amount: $1,250.00
EOF
Step 2: Search for card numbers with optional spaces or dashes:
grep -rn -E '4[0-9]{3}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}' ~/regex-lab/

[- ]? matches an optional space or dash between groups. This catches 4539148803436467, 4539 1488 0343 6467, and 4539-1488-0343-6467.

Step 3: Build the ultimate all-brand search with optional separators:
grep -rn -E '(4[0-9]{3}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|5[1-5][0-9]{2}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|3[47][0-9]{2}[- ]?[0-9]{6}[- ]?[0-9]{5}|6011[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4})' ~/regex-lab/

This comprehensive regex detects all four card brands with or without spaces and dashes. This is similar to what production DLP tools use.

Step 4: Show only the filenames (not the matching lines):
grep -rl -E '(4[0-9]{3}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|5[1-5][0-9]{2}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|3[47][0-9]{2}[- ]?[0-9]{6}[- ]?[0-9]{5}|6011[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4})' ~/regex-lab/

-l (lowercase L) lists only the filenames, not the matching content. Useful for identifying which files need remediation.

Step 5: Count how many matches are in each file:
grep -rc -E '(4[0-9]{3}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|5[1-5][0-9]{2}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|3[47][0-9]{2}[- ]?[0-9]{6}[- ]?[0-9]{5}|6011[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4})' ~/regex-lab/

-c shows the count of matches per file. Files showing :0 have no card numbers.

Part 5: Combining find and grep

Use find to filter by file type before running the regex search. This is how you would scan specific file types across an entire system.

Step 1: Search only .txt files for card numbers:
find ~/regex-lab -name '*.txt' -exec grep -lE '(4[0-9]{3}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}|5[1-5][0-9]{2}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4})' {} +

find locates all .txt files, then -exec grep runs the regex against each one. The + batches files for efficiency.

Step 2: Search only .csv files:
find ~/regex-lab -name '*.csv' -exec grep -nE '[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}' {} +
Step 3: Exclude config files from the search:
find ~/regex-lab -type f ! -name '*.conf' -exec grep -lE '(4[0-9]{15}|5[1-5][0-9]{14}|3[47][0-9]{13}|6011[0-9]{12})' {} +

! -name '*.conf' excludes configuration files from the search.

Deliverables

  • Screenshot of the directory structure created with find ~/regex-lab -type f.
  • Screenshot of each brand-specific regex search (Visa, Mastercard, Amex, Discover) showing matches.
  • Screenshot of the comprehensive search (all brands, with separators) showing all detected files.
  • Screenshot of the grep -rc output showing match counts per file.
  • A short written answer: Why can't regex alone reliably detect all credit card numbers? What additional validation would a real DLP tool use?

grep Flags Used

  • -r — Search recursively through directories.
  • -n — Show line numbers.
  • -l — Show only filenames (not content).
  • -c — Show match count per file.
  • -E — Extended regex (no escaping {} or |).
  • --color — Highlight matches (often default).

Regex Quick Reference

  • [0-9] — Any single digit.
  • {n} — Exactly n repetitions.
  • {n,m} — Between n and m repetitions.
  • [- ]? — Optional dash or space.
  • | — OR (requires -E).
  • () — Grouping (requires -E).

DLP Connection

Data Loss Prevention (DLP) tools use techniques like these to detect sensitive data:

  • Pattern matching — regex for known formats (what you did in this lab).
  • Luhn validation — mathematical check on digit sequence.
  • Context analysis — looking for keywords like "card", "expiry", "CVV" near numbers.
  • Data fingerprinting — exact or partial match against known datasets.

Lab Files Summary

  • customer-order.txt — Visa (no spaces)
  • payment-log.csv — Mastercard (spaces) + Visa
  • expense-report.txt — Amex (no spaces)
  • meeting-notes.txt — Discover (no spaces)
  • invoice.txt — Visa (dashes)
  • app-settings.conf — No card numbers