Fatskills
Practice. Master. Repeat.
Study Guide: Data Science and Machine Learning 101: Deep Learning and NLP Natural Language Processing NLP Pipeline Tokenization Stemming Lemmatization Word Embeddings TFIDF Word2Vec
Source: https://www.fatskills.com/introdution-to-engineering/chapter/data-science-and-machine-learning-data-science-and-machine-learning-deep-learning-and-nlp-natural-language-processing-nlp-pipeline-tokenization-stemming-lemmatization-word-embeddings-tfidf-word2vec

Data Science and Machine Learning 101: Deep Learning and NLP Natural Language Processing NLP Pipeline Tokenization Stemming Lemmatization Word Embeddings TFIDF Word2Vec

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~5 min read

What This Is

A Natural Language Processing (NLP) pipeline is the sequence of transformations that turn raw text (e.g., customer reviews, support tickets, or news articles) into numeric features a machine‑learning model can consume. It typically starts with tokenization, moves through stemming/lemmatization, builds TF‑IDF or word‑embedding matrices, and ends with a model (logistic regression, BERT, etc.). In practice, a well‑engineered pipeline lets you predict churn from free‑form user feedback, flag toxic comments, or power a product‑recommendation engine that understands the semantics of product titles.


Key Terms & Formulas

  • Tokenization – Splits a string into atomic units (tokens). tokens = nltk.word_tokenize(text.lower()).
  • Stemming – Reduces a word to its root by chopping suffixes (e.g., running → run). Common algorithm: PorterStemmer.
  • Lemmatization – Maps a word to its dictionary base form using POS tags (e.g., better → good). lemma = WordNetLemmatizer().lemmatize(word, pos='a').
  • TF‑IDF – Weight = tf × idf where tf = count(word, doc) / len(doc) and idf = log(N / (df + 1)).
  • Cosine Similaritysim(u, v) = (u·v) / (||u||·||v||). Used to compare TF‑IDF or embedding vectors.
  • Word2Vec Skip‑Gram Objective – Maximize ∑_{w∈V} ∑_{c∈C(w)} log σ(v_cᵀ v_w) + ∑_{k=1}^K E_{w_k∼P_n}[log σ(−v_{w_k}ᵀ v_w)].
  • Embedding MatrixE ∈ ℝ^{|V|×d} where each row E_i is the d‑dimensional vector for token i.
  • Out‑of‑Vocabulary (OOV) Handling – Use a special <UNK> token or sub‑word models (e.g., FastText) to embed unseen words.
  • Bag‑of‑Words (BoW) vs. Embedding – BoW counts token frequencies (sparse); embeddings capture context (dense). Choose BoW for linear models, embeddings for deep nets.
  • Gensim Word2Vec APImodel = Word2Vec(sentences, vector_size=100, window=5, min_count=2, workers=4).


Step‑by‑Step / Process Flow

  1. Load & Inspect
    python
    import pandas as pd
    df = pd.read_csv('reviews.csv')
    df['text'].head()
  2. Clean & Tokenize – lower‑case, remove HTML, keep alphanumerics, then tokenize.
    python
    import re, nltk
    def clean(txt):
    txt = re.sub(r'<.*?>', ' ', txt.lower())
    return nltk.word_tokenize(txt)
    df['tokens'] = df['text'].apply(clean)
  3. Stem / Lemmatize (choose one based on downstream model).
    python
    stemmer = nltk.PorterStemmer()
    lemmatizer = nltk.WordNetLemmatizer()
    df['stemmed'] = df['tokens'].apply(lambda t: [stemmer.stem(w) for w in t])
    df['lemma'] = df['tokens'].apply(lambda t: [lemmatizer.lemmatize(w, pos='v') for w in t])
  4. Feature Construction
  5. TF‑IDF (scikit‑learn):
    python
    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
    X_tfidf = tfidf.fit_transform(df['text'])
  6. Word2Vec Embeddings (Gensim):
    python
    from gensim.models import Word2Vec
    w2v = Word2Vec(df['tokens'], vector_size=100, window=5, min_count=2)
    # sentence embedding = mean of token vectors
    def embed(tokens):
    vecs = [w2v.wv[w] for w in tokens if w in w2v.wv]
    return np.mean(vecs, axis=0) if vecs else np.zeros(100)
    df['embed'] = df['tokens'].apply(embed)
  7. Train / Evaluate – split, fit a baseline (e.g., LogisticRegression on TF‑IDF) and a deep model (e.g., simple feed‑forward on embeddings).
    python
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.2, random_state=42)
    from sklearn.linear_model import LogisticRegression
    clf = LogisticRegression(max_iter=200).fit(X_train, y_train)
  8. Iterate – tune max_features, vector_size, window, add n‑grams, try pretrained embeddings (GloVe, fastText), or fine‑tune a transformer if accuracy stalls.

Common Mistakes

  • Mistake: Using stemming and lemmatization together.
    Correction: Pick one; lemmatization preserves more meaning and works better with embeddings, while stemming can over‑truncate useful tokens.

  • Mistake: Feeding raw token strings directly into scikit‑learn models.
    Correction: Convert to numeric vectors (TF‑IDF, CountVectorizer, or embeddings) first; otherwise the model will error or treat each character as a feature.

  • Mistake: Ignoring OOV words when using a pretrained Word2Vec.
    Correction: Provide an <UNK> vector (e.g., zeros or the average of known vectors) or switch to sub‑word models like FastText that can compose OOV embeddings.

  • Mistake: Setting min_df=1 in TF‑IDF for a large corpus.
    Correction: Raise min_df (e.g., 5) or use max_df to drop extremely rare/common terms; this reduces noise and memory usage.

  • Mistake: Evaluating only accuracy on highly imbalanced sentiment data.
    Correction: Use precision, recall, F1, or ROC‑AUC; they surface performance on the minority class (e.g., negative reviews).


Data Science Interview / Practical Insights

  1. “When would you prefer TF‑IDF over Word2Vec?” – Expect to discuss sparsity vs. semantic richness, linear models vs. deep nets, and dataset size.
  2. “Explain the difference between stemming and lemmatization and their impact on downstream performance.” – Interviewers look for awareness of morphological vs. lexical normalization.
  3. “How do you handle OOV tokens in a production pipeline that uses pretrained embeddings?” – Mention <UNK> vectors, character‑n‑gram embeddings (FastText), or fallback to TF‑IDF.
  4. “What is the role of the ‘window’ hyperparameter in Word2Vec, and how does changing it affect the learned vectors?” – Larger windows capture broader context (topic level), smaller windows capture syntactic relations.

Quick Check Questions

  1. Q: Your sentiment classifier using TF‑IDF shows high training accuracy but low test accuracy. What is the most likely cause?
    A: Over‑fitting; reduce max_features, increase min_df, or add regularization (e.g., C in LogisticRegression).

  2. Q: You need a similarity score between two product titles that captures meaning (e.g., “wireless earbuds” vs. “Bluetooth headphones”). Which representation should you use?
    A: Word embeddings (Word2Vec/fastText) with cosine similarity, because they encode semantic relationships beyond exact token overlap.

  3. Q: In a Word2Vec Skip‑Gram model, you increase negative=10. What effect does this have?
    A: More negative samples improve the quality of the embedding at the cost of longer training time.


Last‑Minute Cram Sheet (10 one‑liners)

  1. Tokenization → list of words – always lower‑case before tokenizing.
  2. Stemming = heuristic root; Lemmatization = dictionary base form – lemmatization ≈ “smart” stemming.
  3. TF‑IDF = tf × log(N / (df+1)) – idf down‑weights ubiquitous words.
  4. Cosine similarity = (u·v) / (||u||·||v||) – works on both sparse TF‑IDF and dense embeddings.
  5. Word2Vec Skip‑Gram objective – predicts context words from a target word; use negative sampling for efficiency.
  6. Embedding dimension (d) ≈ 100–300 – larger d captures more nuance but needs more data.
  7. OOV handling: <UNK> vector or FastText sub‑word composition.
  8. Gensim default min_count=5 – filters rare words; adjust for small corpora.
  9. ⚠️ TF‑IDF vectors are high‑dimensional & sparse → use linear models (LogReg, SVM) or dimensionality reduction (TruncatedSVD).
  10. ⚠️ Pretrained embeddings are static; fine‑tune only if you have enough labeled data or a transformer‑based model.


ADVERTISEMENT