Fatskills
Practice. Master. Repeat.
Study Guide: AI Literacy: Models tokens context windows and outputs
Source: https://www.fatskills.com/ai-for-work/chapter/ai-ai-literacy-models-tokens-context-windows-and-outputs

AI Literacy: Models tokens context windows and outputs

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~5 min read

Models, Tokens, Context Windows, and Outputs

What This Is

These are the foundational mechanics of how large language models (LLMs) process and generate text. Understanding them helps you write better prompts, avoid errors, and design AI workflows that scale. For example, a marketing team using an LLM to draft social media posts must stay within the model’s context window (e.g., 8K tokens) to avoid losing earlier instructions—otherwise, the model might ignore brand guidelines mentioned at the start.


Key Facts & Principles

  • Model: A trained AI system that predicts the next word (or token) in a sequence. Different models (e.g., GPT-4, Claude 3) have unique strengths (e.g., coding vs. creative writing) and trade-offs (speed vs. accuracy). Example: Use Mistral for multilingual tasks but GPT-4 for complex reasoning.

  • Token: The basic unit of text an LLM processes. Tokens can be words, subwords (e.g., "unhappiness"-"un", "happi", "ness"), or punctuation. 1 token-4 characters or 0.75 English words. Example: "Hello, world!" = 3 tokens: ["Hello", ",", " world!"]

  • Context window: The maximum number of tokens an LLM can "remember" in a single interaction (e.g., 4K, 32K, 128K tokens). Includes both your input (prompt) and the model’s output. Example: A 4K-token window fits ~3,000 words—enough for a short report but not a full book chapter.

  • Input/output limits: Models have hard caps on tokens per request. Exceeding them truncates text or fails silently. Example: If your prompt + output exceeds 8K tokens, the model may cut off mid-sentence.

  • Attention mechanism: How models weigh the importance of tokens in the context window. Tokens at the start/end of the window often get less "attention" than those in the middle. Example: Put critical instructions (e.g., "Summarize in 3 bullet points") at the end of your prompt for better results.

  • Temperature: A setting (0.0–2.0) controlling randomness in outputs. Lower = more deterministic (good for facts); higher = more creative (good for brainstorming). Example: Set temperature=0.2 for legal document drafting; 0.8 for ad copy.

  • Hallucination: When a model generates plausible but false information. More common with vague prompts or insufficient context. Example: Asking "What’s the CEO’s favorite color?" without providing data may yield a confident guess like "blue."

  • Token efficiency: Strategies to reduce token usage (e.g., abbreviations, bullet points) to fit more content in the context window. Example: Replace "approximately" with "~" to save tokens.


Step-by-Step Application

  1. Audit your use case:
  2. Identify if you need precision (e.g., legal contracts) or creativity (e.g., taglines).
  3. Estimate token count: Use a tokenizer tool to check your prompt + expected output.

  4. Design your prompt for the context window:

  5. Place critical instructions last (e.g., "Format as a table" after the data).
  6. Use separators (e.g., ###) to organize sections (e.g., ### Instructions ### Data ###).
  7. For long documents, split into chunks and process sequentially.

  8. Set model parameters:

  9. Temperature: 0.0–0.3 for factual tasks; 0.7–1.0 for creative tasks.
  10. Max tokens: Set slightly below the model’s limit (e.g., 3,800 for a 4K window) to avoid truncation.

  11. Validate outputs:

  12. Check for hallucinations by cross-referencing with source material or asking the model to cite sources.
  13. Use repetition detection (e.g., "Does this answer repeat earlier points?") to catch loops.

  14. Optimize for cost/speed:

  15. Use smaller models (e.g., GPT-3.5) for simple tasks to save tokens/money.
  16. Cache frequent prompts (e.g., "Generate a meeting summary template") to avoid reprocessing.

  17. Monitor and iterate:

  18. Track token usage in logs to identify inefficiencies (e.g., overly verbose prompts).
  19. A/B test prompts (e.g., "Summarize this" vs. "Extract 3 key takeaways") to find the most token-efficient version.

Common Mistakes

  • Mistake: Ignoring the context window and pasting an entire 10K-word report into a 4K-token model. Correction: Split the report into sections (e.g., by headings) and process each separately. Why: The model will truncate or forget early content.

  • Mistake: Assuming all models have the same token limits (e.g., treating Claude’s 200K window like GPT-4’s 8K). Correction: Check the model’s documentation for its specific limits. Why: Overloading a model causes errors or silent failures.

  • Mistake: Setting temperature=1.0 for factual tasks (e.g., financial summaries). Correction: Use temperature=0.2–0.5 for tasks requiring consistency. Why: Higher temperatures increase randomness and hallucinations.

  • Mistake: Writing prompts with redundant instructions (e.g., "Be concise. Write a short summary. Keep it brief."). Correction: Consolidate instructions (e.g., "Summarize in 3 bullet points, <50 words each"). Why: Redundancy wastes tokens and dilutes focus.

  • Mistake: Not accounting for output tokens, leading to truncated responses. Correction: Estimate output length (e.g., 200 tokens for a short email) and set max_tokens accordingly. Why: The model stops mid-sentence if it hits the limit.


Practical Tips

  • Use "token math": For English, estimate 1 token-4 characters or 0.75 words. A 1,000-word document-1,333 tokens.
  • Leverage "few-shot" examples: Include 2–3 examples in your prompt (e.g., "Here’s how to format a product description: [example]") to guide the model without wasting tokens on lengthy instructions.
  • Watch for "lost in the middle": Models pay less attention to tokens in the middle of long prompts. Put key instructions at the start or end.
  • Batch similar tasks: Group related requests (e.g., "Summarize these 5 articles") into one prompt to minimize token overhead from repeated instructions.

Quick Practice Scenario

Scenario: You’re using an LLM to draft a press release about a new product. Your prompt includes:
1. A 500-word product description.
2. Brand guidelines (200 words).
3. Instructions: "Write a 300-word press release in our brand voice. Use the product name at least twice." The model’s context window is 4K tokens. After generating the output, you notice the press release ignores the brand voice.

Question: What’s the likely issue, and how would you fix it?

Answer: The prompt likely exceeded the context window, causing the model to "forget" the brand guidelines (placed early). Fix: Move the brand guidelines to the end of the prompt and shorten the product description to fit within 4K tokens.


Last-Minute Cram Sheet

  1. Token-4 chars or 0.75 words in English. Punctuation/symbols count as tokens.
  2. Context window = max tokens the model "remembers" (input + output). Exceeding it truncates text.
  3. Attention drops in the middle of long prompts—put key instructions at the start or end.
  4. Temperature: 0.0–0.3 for facts; 0.7–1.0 for creativity. High temp = more hallucinations.
  5. Token limits vary by model: GPT-4 = 8K–32K; Claude 3 = 200K.
  6. Estimate output tokens: 1 paragraph-50–100 tokens. Set max_tokens to avoid truncation.
  7. Hallucinations = false but confident outputs. Mitigate with sources or human review.
  8. Token efficiency: Use abbreviations, bullet points, and concise instructions.
  9. Few-shot > zero-shot: Include 2–3 examples to guide the model without verbose instructions.
  10. Cache frequent prompts to save tokens and reduce costs. Don’t reprocess the same instructions repeatedly.