Fatskills
Practice. Master. Repeat.
Study Guide: TECH **SQL String Functions for Data Analysis: Zero-Fluff, Hands-On Guide**
Source: https://www.fatskills.com/introdution-to-engineering/chapter/tech-sql-string-functions-for-data-analysis-zero-fluff-hands-on-guide

TECH **SQL String Functions for Data Analysis: Zero-Fluff, Hands-On Guide**

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~8 min read

SQL String Functions for Data Analysis: Zero-Fluff, Hands-On Guide

(CONCAT, SUBSTRING, REPLACE, UPPER/LOWER, TRIM)


1. What This Is & Why It Matters

You’re analyzing customer data, and your boss asks: "Can you merge first and last names into a full name column, clean up extra spaces, and extract the first 3 characters of each email for a quick lookup?"

If you don’t know string functions, you’ll waste hours manually editing data in Excel—or worse, writing brittle Python scripts. String functions are your Swiss Army knife for cleaning, transforming, and extracting text in SQL. They let you: - Fix messy data (e.g., " [email protected] ""[email protected]").
- Combine fields (e.g., first_name + last_name"John Doe").
- Extract substrings (e.g., grab the first 3 letters of a product code).
- Standardize text (e.g., "USA" vs "usa" → all uppercase).

Without them, you’re stuck with:
❌ Manual data cleaning (slow, error-prone).
❌ Broken joins (e.g., "New York" vs "new york" won’t match).
❌ Ugly reports (e.g., " Customer Name: John " instead of "Customer Name: John").

Real-world scenario:
You inherit a database where phone numbers are stored as "(123) 456-7890" but your CRM expects "1234567890". With REPLACE and TRIM, you fix this in one SQL query instead of exporting to Excel.


2. Core Concepts & Components


1. CONCAT() / || (Concatenation)

  • Definition: Combines two or more strings into one.
  • Production insight: Useful for merging columns (e.g., first_name + last_name) or generating dynamic SQL (e.g., "SELECT * FROM " + table_name).
  • Syntax quirk: Some databases (PostgreSQL, SQLite) use ||; others (MySQL, SQL Server) use CONCAT().

2. SUBSTRING() / SUBSTR()

  • Definition: Extracts a portion of a string (start position + length).
  • Production insight: Critical for parsing fixed-format data (e.g., extracting area codes from phone numbers or product IDs from SKUs).
  • Syntax quirk: MySQL uses SUBSTRING(str, pos, len); SQL Server uses SUBSTRING(str, start, length).

3. REPLACE()

  • Definition: Replaces all occurrences of a substring with another substring.
  • Production insight: Fixes inconsistent data (e.g., "Street""St") or removes unwanted characters (e.g., REPLACE(phone, "-", "")).
  • Performance trap: Avoid REPLACE in WHERE clauses—it can slow down queries.

4. UPPER() / LOWER()

  • Definition: Converts text to uppercase or lowercase.
  • Production insight: Ensures case-insensitive comparisons (e.g., WHERE LOWER(country) = 'usa').
  • Gotcha: Doesn’t handle Unicode well in some databases (e.g., "ß""SS" in German).

5. TRIM() / LTRIM() / RTRIM()

  • Definition: Removes leading/trailing spaces (or other characters).
  • Production insight: Prevents "silent" data issues (e.g., "John" vs " John " breaking joins).
  • Syntax quirk: TRIM() removes both sides; LTRIM() (left) and RTRIM() (right) are more precise.

6. LEN() / LENGTH()

  • Definition: Returns the length of a string (in characters).
  • Production insight: Useful for validating input (e.g., WHERE LENGTH(password) >= 8).
  • Syntax quirk: SQL Server uses LEN(); MySQL/PostgreSQL use LENGTH().

7. LEFT() / RIGHT()

  • Definition: Extracts the first/last N characters of a string.
  • Production insight: Quickly grabs prefixes/suffixes (e.g., LEFT(email, 3) for email domains).

8. CHARINDEX() / POSITION() / INSTR()

  • Definition: Finds the position of a substring in a string.
  • Production insight: Helps parse delimited data (e.g., CHARINDEX('@', email) to split usernames/domains).
  • Syntax quirk: SQL Server uses CHARINDEX(); PostgreSQL uses POSITION(); MySQL uses INSTR().


3. Step-by-Step Hands-On: Cleaning a Customer Dataset


Prerequisites

  • A SQL database (PostgreSQL, MySQL, SQL Server, or SQLite).
  • A table with messy customer data (example below).

Task:

Clean a customers table with: - Extra spaces in names.
- Inconsistent email formats (e.g., "[email protected]").
- Phone numbers with parentheses/dashes.

Step 1: Create the Table

CREATE TABLE customers (
id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100),
phone VARCHAR(20) ); INSERT INTO customers VALUES (1, ' John ', ' Doe ', '[email protected]', '(123) 456-7890'), (2, 'Jane', 'Smith', '[email protected]', '555-123-4567'), (3, ' Bob ', 'Johnson', '[email protected]', '123.456.7890');

Step 2: Clean the Data

-- 1. Trim spaces from names
UPDATE customers
SET first_name = TRIM(first_name),
last_name = TRIM(last_name); -- 2. Standardize email case (lowercase) UPDATE customers SET email = LOWER(email); -- 3. Remove dots from emails (e.g., "[email protected]" → "[email protected]") UPDATE customers SET email = REPLACE(email, '.', ''); -- 4. Clean phone numbers (remove non-digits) UPDATE customers SET phone = REPLACE(REPLACE(REPLACE(phone, '(', ''), ')', ''), '-', ''); -- 5. Create a full_name column ALTER TABLE customers ADD COLUMN full_name VARCHAR(100); UPDATE customers SET full_name = CONCAT(TRIM(first_name), ' ', TRIM(last_name)); -- 6. Extract the first 3 letters of the email (for quick lookup) ALTER TABLE customers ADD COLUMN email_prefix VARCHAR(3); UPDATE customers SET email_prefix = LEFT(email, 3);

Step 3: Verify the Results

SELECT * FROM customers;

Expected Output:
| id | first_name | last_name | email | phone | full_name | email_prefix | |----|------------|-----------|---------------------|------------|-------------|--------------| | 1 | John | Doe | [email protected] | 1234567890 | John Doe | joh | | 2 | Jane | Smith | [email protected] | 5551234567 | Jane Smith | jan | | 3 | Bob | Johnson | [email protected] | 1234567890 | Bob Johnson | bob |


4. ? Production-Ready Best Practices


Security

  • Sanitize inputs: Never concatenate raw user input into SQL (risk of injection). Use parameterized queries.
    ```sql -- ❌ UNSAFE (SQL injection risk) EXEC('SELECT * FROM users WHERE name = ''' + @user_input + '''');

-- ✅ SAFE (parameterized) EXEC sp_executesql N'SELECT * FROM users WHERE name = @name', N'@name VARCHAR(50)', @name = @user_input; ```

Performance

  • Avoid functions in WHERE clauses: They prevent index usage.
    ```sql -- ❌ Slow (can't use index on email) SELECT * FROM users WHERE LOWER(email) = '[email protected]';

-- ✅ Fast (uses index) SELECT * FROM users WHERE email = '[email protected]'; `` - Pre-clean data: If possible, clean data at insertion time (e.g.,TRIM()inINSERT` statements).

Reliability & Maintainability

  • Use consistent case: Decide on UPPER/LOWER for all text fields (e.g., always store emails in lowercase).
  • Document transformations: Add comments explaining why data is cleaned a certain way.
    sql -- Standardize phone numbers: remove all non-digits UPDATE customers SET phone = REGEXP_REPLACE(phone, '[^0-9]', '');

Observability

  • Log transformations: Track data changes (e.g., log before/after values in an audit table).
  • Validate outputs: After cleaning, run checks like: ```sql -- Ensure no emails have spaces SELECT COUNT(*) FROM customers WHERE email LIKE '% %';

-- Ensure phone numbers are 10 digits SELECT COUNT(*) FROM customers WHERE LENGTH(phone) != 10; ```


5. ⚠️ Common Mistakes & Traps

Mistake Symptom Fix/Prevention
Forgetting TRIM() Joins fail ("John"" John "). Always TRIM() text fields in WHERE clauses.
Case sensitivity WHERE country = 'USA' misses 'usa'. Use LOWER()/UPPER() for comparisons.
Off-by-one errors SUBSTRING(email, 1, 3) cuts off too much. Test with LENGTH() first (e.g., SELECT LENGTH(email) FROM users LIMIT 10).
Overusing REPLACE() Nested REPLACE() calls get messy. Use regex (e.g., REGEXP_REPLACE() in PostgreSQL) for complex patterns.
Assuming ASCII UPPER('ß')'SS' (German). Use COLLATE for Unicode (e.g., UPPER(name COLLATE utf8_general_ci) in MySQL).


6. ? Exam/Certification Focus


Typical Question Patterns

  1. String extraction:
    "Extract the domain from an email ([email protected]example.com)."
  2. Answer: SUBSTRING(email, CHARINDEX('@', email) + 1, LENGTH(email)) (SQL Server) or SUBSTRING(email, POSITION('@' IN email) + 1) (PostgreSQL).

  3. Case standardization:
    "Ensure all country codes are uppercase in a WHERE clause."

  4. Answer: WHERE UPPER(country) = 'USA'.

  5. Data cleaning:
    "Remove all non-alphanumeric characters from a phone number."

  6. Answer: REGEXP_REPLACE(phone, '[^0-9]', '') (PostgreSQL) or nested REPLACE() calls.

⚠️ Trap Distinctions

  • CONCAT() vs +:
  • CONCAT('a', NULL)'a' (ignores NULL).
  • 'a' + NULLNULL (SQL Server).
  • SUBSTRING indexing:
  • MySQL/PostgreSQL: SUBSTRING(str, 1, 3) (1-based).
  • Python: str[0:3] (0-based).
  • TRIM() defaults:
  • TRIM() removes spaces; TRIM('x' FROM str) removes 'x'.

Scenario-Based Question

"You need to merge first_name and last_name into a full_name column, but some names have extra spaces. Which function do you use?" - Answer: CONCAT(TRIM(first_name), ' ', TRIM(last_name)).
- Why? TRIM() removes extra spaces; CONCAT() merges the strings.


7. ? Hands-On Challenge

Challenge:
Given a table products with a sku column like "ABC-123-XYZ", extract the middle part ("123") into a new column product_id.

Solution:


-- PostgreSQL/MySQL
UPDATE products SET product_id = SUBSTRING(sku, 5, 3);

-- SQL Server
UPDATE products SET product_id = SUBSTRING(sku, 5, 3);

Why it works:
- SUBSTRING(sku, 5, 3) starts at position 5 ("-" is position 4) and takes 3 characters.


8. ? Rapid-Reference Crib Sheet

Function Syntax Example Notes
CONCAT() CONCAT(str1, str2, ...) CONCAT('A', 'B')'AB' Ignores NULLs.
|| (PostgreSQL) str1 || str2 'A' || 'B''AB' Fails if any input is NULL.
SUBSTRING() SUBSTRING(str, start, length) SUBSTRING('abc', 2, 1)'b' 1-based indexing.
LEFT() LEFT(str, length) LEFT('abc', 2)'ab' Extracts from start.
RIGHT() RIGHT(str, length) RIGHT('abc', 2)'bc' Extracts from end.
REPLACE() REPLACE(str, old, new) REPLACE('abc', 'a', 'x')'xbc' Case-sensitive.
UPPER() UPPER(str) UPPER('aBc')'ABC'
LOWER() LOWER(str) LOWER('aBc')'abc'
TRIM() TRIM(str) TRIM(' a ')'a' Removes spaces by default.
LTRIM() LTRIM(str) LTRIM(' a')'a' Removes left spaces.
RTRIM() RTRIM(str) RTRIM('a ')'a' Removes right spaces.
LENGTH() LENGTH(str) LENGTH('abc')3 SQL Server uses LEN().
CHARINDEX() CHARINDEX(substr, str) CHARINDEX('b', 'abc')2 SQL Server only.
POSITION() POSITION(substr IN str) POSITION('b' IN 'abc')2 PostgreSQL.

⚠️ Exam Traps:
- SUBSTRING('abc', 0, 1) → Error in most databases (1-based indexing).
- CONCAT('a', NULL)'a' (ignores NULL), but 'a' + NULLNULL (SQL Server).
- TRIM('x' FROM 'xxabcxx')'abc' (removes 'x' from both sides).


9. ? Where to Go Next

  1. PostgreSQL String Functions – Official docs with examples.
  2. MySQL String Functions – Comprehensive reference.
  3. SQLZoo String Functions – Interactive exercises.
  4. "SQL for Data Analysis" (O’Reilly) – Chapter 5 covers string manipulation in depth.


ADVERTISEMENT