By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
These are the foundational mechanics of how large language models (LLMs) process and generate text. Understanding them helps you write better prompts, avoid errors, and design AI workflows that scale. For example, a marketing team using an LLM to draft social media posts must stay within the model’s context window (e.g., 8K tokens) to avoid losing earlier instructions—otherwise, the model might ignore brand guidelines mentioned at the start.
Model: A trained AI system that predicts the next word (or token) in a sequence. Different models (e.g., GPT-4, Claude 3) have unique strengths (e.g., coding vs. creative writing) and trade-offs (speed vs. accuracy). Example: Use Mistral for multilingual tasks but GPT-4 for complex reasoning.
Token: The basic unit of text an LLM processes. Tokens can be words, subwords (e.g., "unhappiness"-"un", "happi", "ness"), or punctuation. 1 token-4 characters or 0.75 English words. Example: "Hello, world!" = 3 tokens: ["Hello", ",", " world!"]
Context window: The maximum number of tokens an LLM can "remember" in a single interaction (e.g., 4K, 32K, 128K tokens). Includes both your input (prompt) and the model’s output. Example: A 4K-token window fits ~3,000 words—enough for a short report but not a full book chapter.
Input/output limits: Models have hard caps on tokens per request. Exceeding them truncates text or fails silently. Example: If your prompt + output exceeds 8K tokens, the model may cut off mid-sentence.
Attention mechanism: How models weigh the importance of tokens in the context window. Tokens at the start/end of the window often get less "attention" than those in the middle. Example: Put critical instructions (e.g., "Summarize in 3 bullet points") at the end of your prompt for better results.
Temperature: A setting (0.0–2.0) controlling randomness in outputs. Lower = more deterministic (good for facts); higher = more creative (good for brainstorming). Example: Set temperature=0.2 for legal document drafting; 0.8 for ad copy.
Hallucination: When a model generates plausible but false information. More common with vague prompts or insufficient context. Example: Asking "What’s the CEO’s favorite color?" without providing data may yield a confident guess like "blue."
Token efficiency: Strategies to reduce token usage (e.g., abbreviations, bullet points) to fit more content in the context window. Example: Replace "approximately" with "~" to save tokens.
Estimate token count: Use a tokenizer tool to check your prompt + expected output.
Design your prompt for the context window:
###
### Instructions ### Data ###
For long documents, split into chunks and process sequentially.
Set model parameters:
Max tokens: Set slightly below the model’s limit (e.g., 3,800 for a 4K window) to avoid truncation.
Validate outputs:
Use repetition detection (e.g., "Does this answer repeat earlier points?") to catch loops.
Optimize for cost/speed:
Cache frequent prompts (e.g., "Generate a meeting summary template") to avoid reprocessing.
Monitor and iterate:
Mistake: Ignoring the context window and pasting an entire 10K-word report into a 4K-token model. Correction: Split the report into sections (e.g., by headings) and process each separately. Why: The model will truncate or forget early content.
Mistake: Assuming all models have the same token limits (e.g., treating Claude’s 200K window like GPT-4’s 8K). Correction: Check the model’s documentation for its specific limits. Why: Overloading a model causes errors or silent failures.
Mistake: Setting temperature=1.0 for factual tasks (e.g., financial summaries). Correction: Use temperature=0.2–0.5 for tasks requiring consistency. Why: Higher temperatures increase randomness and hallucinations.
Mistake: Writing prompts with redundant instructions (e.g., "Be concise. Write a short summary. Keep it brief."). Correction: Consolidate instructions (e.g., "Summarize in 3 bullet points, <50 words each"). Why: Redundancy wastes tokens and dilutes focus.
Mistake: Not accounting for output tokens, leading to truncated responses. Correction: Estimate output length (e.g., 200 tokens for a short email) and set max_tokens accordingly. Why: The model stops mid-sentence if it hits the limit.
max_tokens
Scenario: You’re using an LLM to draft a press release about a new product. Your prompt includes:1. A 500-word product description.2. Brand guidelines (200 words).3. Instructions: "Write a 300-word press release in our brand voice. Use the product name at least twice." The model’s context window is 4K tokens. After generating the output, you notice the press release ignores the brand voice.
Question: What’s the likely issue, and how would you fix it?
Answer: The prompt likely exceeded the context window, causing the model to "forget" the brand guidelines (placed early). Fix: Move the brand guidelines to the end of the prompt and shorten the product description to fit within 4K tokens.
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.