By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
A Natural Language Processing (NLP) pipeline is the sequence of transformations that turn raw text (e.g., customer reviews, support tickets, or news articles) into numeric features a machine‑learning model can consume. It typically starts with tokenization, moves through stemming/lemmatization, builds TF‑IDF or word‑embedding matrices, and ends with a model (logistic regression, BERT, etc.). In practice, a well‑engineered pipeline lets you predict churn from free‑form user feedback, flag toxic comments, or power a product‑recommendation engine that understands the semantics of product titles.
tokens = nltk.word_tokenize(text.lower())
lemma = WordNetLemmatizer().lemmatize(word, pos='a')
tf = count(word, doc) / len(doc)
idf = log(N / (df + 1))
sim(u, v) = (u·v) / (||u||·||v||)
∑_{w∈V} ∑_{c∈C(w)} log σ(v_cᵀ v_w) + ∑_{k=1}^K E_{w_k∼P_n}[log σ(−v_{w_k}ᵀ v_w)]
E ∈ ℝ^{|V|×d}
E_i
<UNK>
model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, workers=4)
python import pandas as pd df = pd.read_csv('reviews.csv') df['text'].head()
python import re, nltk def clean(txt): txt = re.sub(r'<.*?>', ' ', txt.lower()) return nltk.word_tokenize(txt) df['tokens'] = df['text'].apply(clean)
python stemmer = nltk.PorterStemmer() lemmatizer = nltk.WordNetLemmatizer() df['stemmed'] = df['tokens'].apply(lambda t: [stemmer.stem(w) for w in t]) df['lemma'] = df['tokens'].apply(lambda t: [lemmatizer.lemmatize(w, pos='v') for w in t])
python from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2)) X_tfidf = tfidf.fit_transform(df['text'])
python from gensim.models import Word2Vec w2v = Word2Vec(df['tokens'], vector_size=100, window=5, min_count=2) # sentence embedding = mean of token vectors def embed(tokens): vecs = [w2v.wv[w] for w in tokens if w in w2v.wv] return np.mean(vecs, axis=0) if vecs else np.zeros(100) df['embed'] = df['tokens'].apply(embed)
python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.2, random_state=42) from sklearn.linear_model import LogisticRegression clf = LogisticRegression(max_iter=200).fit(X_train, y_train)
max_features
vector_size
window
Mistake: Using stemming and lemmatization together. Correction: Pick one; lemmatization preserves more meaning and works better with embeddings, while stemming can over‑truncate useful tokens.
Mistake: Feeding raw token strings directly into scikit‑learn models. Correction: Convert to numeric vectors (TF‑IDF, CountVectorizer, or embeddings) first; otherwise the model will error or treat each character as a feature.
Mistake: Ignoring OOV words when using a pretrained Word2Vec. Correction: Provide an <UNK> vector (e.g., zeros or the average of known vectors) or switch to sub‑word models like FastText that can compose OOV embeddings.
Mistake: Setting min_df=1 in TF‑IDF for a large corpus. Correction: Raise min_df (e.g., 5) or use max_df to drop extremely rare/common terms; this reduces noise and memory usage.
min_df=1
min_df
max_df
Mistake: Evaluating only accuracy on highly imbalanced sentiment data. Correction: Use precision, recall, F1, or ROC‑AUC; they surface performance on the minority class (e.g., negative reviews).
Q: Your sentiment classifier using TF‑IDF shows high training accuracy but low test accuracy. What is the most likely cause? A: Over‑fitting; reduce max_features, increase min_df, or add regularization (e.g., C in LogisticRegression).
C
Q: You need a similarity score between two product titles that captures meaning (e.g., “wireless earbuds” vs. “Bluetooth headphones”). Which representation should you use? A: Word embeddings (Word2Vec/fastText) with cosine similarity, because they encode semantic relationships beyond exact token overlap.
Q: In a Word2Vec Skip‑Gram model, you increase negative=10. What effect does this have? A: More negative samples improve the quality of the embedding at the cost of longer training time.
negative=10
negative
min_count=5
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.