Business Account Data Verification
Date: 2026-02-09 Category: Parsing / Data Processing Parts Completed: 5/5 Language: Python
Problem Summary
Build a KYC verification system that processes business accounts in CSV format. Validate accounts through progressively complex rules: field completeness, descriptor length constraints, generic name blocklist, name consistency checks, and specific error code classification.
Solutions by Part
Part 1: Complete Field Validation
Approach: Parse CSV by splitting on newlines, skip header, split each row by comma, use all() to check every field is non-empty after stripping whitespace.
Part 2: Descriptor Length Validation
Approach: After field completeness check, verify col5 (full descriptor) length is between 5 and 31 inclusive using chained comparison 5 <= len(descriptor) <= 31.
Part 3: Generic Name Blocklist
Approach: Define blocked terms in a set. Use any(term in descriptor.upper() for term in BLOCKED_TERMS) for case-insensitive substring matching against col5.
Part 4: Name Consistency Check
Approach: Extract words from col2, col4, and col5, filter out “LLC”/“Inc” (case-insensitive), combine descriptor words into one set, count matches from business name words. Verify >= 50% match rate.
Part 5: Error Code Classification
Approach: Replace single if/else with if/elif chain checking validations in priority order. Each branch returns a specific error code. Only reaches VERIFIED if all checks pass.
Combined Final Solution
def validate_business_name(business_name, long_descriptor, short_descriptor):
IGNORED_WORDS = ["llc", "inc"]
def get_words(col, IGNORED_WORDS):
words = col.split(" ")
return [word.lower() for word in words if word.lower() not in IGNORED_WORDS]
business_words = get_words(business_name, IGNORED_WORDS)
if not business_words:
return False
descriptor_words = set(get_words(long_descriptor, IGNORED_WORDS) + get_words(short_descriptor, IGNORED_WORDS))
matches = [w for w in business_words if w in descriptor_words]
return len(matches) / len(business_words) >= 0.5
def validate_businesses(csv_data: str):
csv_data_arr = csv_data.split('\n')
BLOCKED_TERMS = set(["ONLINE STORE", "ECOMMERCE", "RETAIL", "SHOP", "GENERAL MERCHANDISE"])
for i, row in enumerate(csv_data_arr):
if i == 0:
continue
row_arr = row.split(',')
business_name = row_arr[1].strip() if row_arr[1] else ""
complete = all(field.strip() for field in row_arr)
long_descriptor = row_arr[4].strip()
short_descriptor = row_arr[3].strip()
business_name_valid = validate_business_name(business_name, long_descriptor, short_descriptor)
meets_length_requirement = (5 <= len(long_descriptor) <= 31)
descriptor_contains_blocked_term = any(term in long_descriptor.upper() for term in BLOCKED_TERMS)
if not complete:
print(f"ERROR_MISSING_FIELDS: {business_name}")
elif not meets_length_requirement:
print(f"ERROR_INVALID_LENGTH: {business_name}")
elif descriptor_contains_blocked_term:
print(f"ERROR_GENERIC_NAME: {business_name}")
elif not business_name_valid:
print(f"ERROR_NAME_MISMATCH: {business_name}")
else:
print(f"VERIFIED: {business_name}")Edge Cases
- Empty business name (col2) → output shows empty after the colon
- Fields with only whitespace → treated as empty by
field.strip() - Descriptor exactly at boundaries → 5 and 31 are valid (inclusive)
- “SHOP” inside “WORKSHOP” → current substring matching would flag it (clarify with interviewer)
- Business name is just “LLC Inc” → zero words after filtering, returns False
- Division by zero → guarded by
if not business_words: return False
Bugs & Issues
- Part 1: Used
/ninstead of\nfor newline character - Part 1: Used
row.split(',')whererowwas an integer index, not the string - Part 1: Printing instead of returning — breaks composability
- Part 4: List comprehension syntax wrong —
[word if condition]instead of[word for word in ... if condition] - Part 4: Trailing
/in float division expression - Part 4: Not lowering words before comparing against lowercase IGNORED_WORDS — “LLC” wouldn’t match “llc”
- Part 4: Passing already-split list to
get_words()which expects a string - Part 4:
name_wordsundefined — variable was namedbusiness_words - Part 4:
get_wordsdefined with 2 params but called with 1 - Part 4: Unreachable
return Falseafter unconditional return statement - Part 4:
row[3]instead ofrow_arr[3]— accessing raw string instead of split array - Part 4: Code indented inside
get_wordsfunction instead of at function body level
Key Learnings
all()andany()with generators — clean Pythonic way to check conditions across collections- Chained comparisons —
5 <= len(x) <= 31is valid Python and reads naturally - if/elif chain for priority-ordered validation — cleanly maps to “return first error” requirement
- Combine sets for OR matching —
set(list_a + list_b)then check membership, instead of checking two lists separately word.lower() not in IGNORED_WORDS— must normalize before comparing against normalized list- Guard against empty collections before division — always check before
len(x) / len(y)
Code Quality Notes
- Good extraction of
validate_business_nameas a separate function get_wordshelper reduces repetition for word normalization- Could further clean up by making
IGNORED_WORDSa module-level constant - Should return results instead of printing for testability
- The
IGNORED_WORDSparameter toget_wordscould use the closure instead of being passed explicitly
Q&A Highlights
- Can you do boolean conditions in comprehensions? Yes —
all(condition for item in list)andany(condition for item in list) ord()not needed here but was learned in the font renderer problem — good to have in the toolkit- Substring vs exact word matching for blocked terms — current implementation uses substring (
term in string), should clarify with interviewer whether “SHOP” in “WORKSHOP” should match