AI Engineering By Chip Huyen

Applied ML
Author

Ritesh Kumar Maurya

Published

January 13, 2026

Chapter 1 Introduction to Building AI Applications with Foundation Models

Language Models (LMs)

Masked Language Models (MLM)

  • Predict missing tokens in a sentence
    Example: My favorite __ is blue
  • Use bidirectional context (both previous and next tokens)
  • Common use cases:
    • Sentiment analysis
    • Text classification
    • Code debugging (requires full contextual understanding)
  • Example: BERT (Bidirectional Encoder Representations from Transformers)

Generative Language Models

  • A language model can generate infinite outputs using a finite vocabulary
  • Models capable of producing open-ended outputs are called generative models
  • This forms the basis of Generative AI

Learning Paradigms

Self-Supervised Learning

  • Labels are automatically derived from the input data itself

Unsupervised Learning

  • No explicit labels are required

When to Outsource Model Building

Build In-House When

  • The model is critical to business differentiation
  • There is a risk of exposing intellectual property to competitors

Outsource When

  • It improves productivity and profitability
  • It provides better performance and multiple vendor options

Role of AI in Products

Critical vs Complementary

  • Complementary:
    The application can function without AI (e.g., Gmail)
  • Critical:
    The application cannot function without AI (e.g., face recognition)
    Requires high robustness and reliability

Reactive vs Proactive

  • Reactive:
    Responds to user inputs (e.g., chat-based responses)
  • Proactive:
    Takes initiative (e.g., traffic alerts in navigation apps)

Static vs Dynamic

  • Static:
    Features are updated only during application upgrades
  • Dynamic:
    Features evolve continuously based on user feedback

Automation in Products (Microsoft Framework)

  • Crawl:
    Human involvement is required
  • Walk:
    AI assists and interacts with internal employees
  • Run:
    AI directly interacts with end users
NoteKey Insight

Achieving 0–60% automation is relatively easy, but moving from 60–100% automation is extremely challenging.


Three Layers of the AI Stack

Application Development Layer

  • AI Interface
  • Prompt Engineering
  • Context Construction
  • Evaluation

Model Development Layer

  • Inference Optimization
  • Dataset Engineering
  • Modeling and Training
  • Evaluation

Infrastructure Layer

  • Compute Management
  • Data Management
  • Model Serving
  • Monitoring

AI Engineering vs ML Engineering

AI Engineering

  • Uses pretrained models
  • Works with large-scale models
  • Requires higher compute resources
  • Workflow:
    Product → Data → Model

ML Engineering

  • Trains models from scratch
  • Requires fewer resources
  • Workflow:
    Data → Model → Product

Adapting Pretrained Models

  • Prompt Engineering
  • Fine-tuning
    • Training a pretrained model on a new task not seen during pretraining

Training Stages

Pretraining

  • Resource-intensive (large data and compute)
  • Model weights initialized randomly
  • Trained for general text completion

Fine-tuning

  • Requires significantly less data and compute
  • Adapts the model to task-specific objectives

Post-training

  • Often used interchangeably with fine-tuning
  • Model developers:
    Perform post-training before releasing the model (e.g., instruction-following)
  • Application developers:
    Fine-tune released models for specific downstream tasks

Chapter 2 Understanding Foundation Models

Model Performance Fundamentals

  • Model performance depends on both:
    • Training process
    • Sampling (decoding) strategy
  • An AI model is only as good as the data it is trained on
  • The “use what we have, not what we want” dataset mindset often leads to:
    • Strong performance on training data
    • Weak performance on real-world tasks
  • Small, high-quality datasets can outperform large, low-quality datasets

RNNs vs Transformers

  • Recurrent Neural Networks (RNNs)
    • Generate text using a compressed summary of the book
    • Struggle with long-range dependencies
  • Transformers
    • Attend directly to many tokens at once
    • Comparable to generating text using entire pages of a book
    • Enable long-context modeling via attention

Transformer Architecture

Transformer Block

Each transformer block consists of: - Attention module - MLP (feedforward network)

Transformer-Based Language Model

A typical transformer-based LM includes:

  • Embedding Module (pre-transformer)
    • Converts tokens into embedding vectors
  • Transformer Blocks
  • Model Head (post-transformer)
    • Converts hidden states into token probabilities
Figure 2.1: Visualization of the weight composition of a transformer model.

Alternatives to Transformers

  • RWKV
    • RNN-based architecture
    • No fixed context-length limitation
    • Performance on extremely long contexts is not guaranteed
  • State Space Models (SSMs)
    • Designed for long-range memory
    • Promising alternative to attention
    • Examples:
      • Mamba
      • Jamba (Hybrid of Transformer and Mamba)

Training Tokens and Scaling Laws

  • \text{Total Training Tokens} = \text{Epochs} \times \text{Tokens per epoch}

  • Chinchilla Scaling Law

    • Optimal training tokens ≈ 20× model parameters
  • Research from Microsoft and OpenAI suggests:

    • Hyperparameters can be transferred from a 40M model to a 6.7B model

Pre-training and Post-training

  • Pre-training
    • Equivalent to reading to acquire knowledge
    • Resource-intensive
    • Produces general-purpose representations
  • Post-training
    • Equivalent to learning how to use knowledge
    • Includes instruction tuning, alignment, and RLHF
Figure 2.2: Training workflow of a large language model.
  • Training only on high-quality data without pre-training is possible, but pre-training followed by post-training yields superior results

Reinforcement Learning from Human Feedback (RLHF)

RLHF is a two-stage process for aligning language models with human preferences:

  1. Train a reward model that scores the foundation model’s outputs
  2. Optimize the foundation model to generate responses that maximize the reward model’s score

Preference Data Structure

Instead of using scalar scores (which vary across individuals), RLHF uses comparative preferences:

  • Each training example consists of: (prompt, winning\_response, losing\_response)
  • This pairwise comparison approach captures relative quality more reliably than absolute ratings

Reward Model Training

Objective

Maximize the score difference between winning and losing responses for each prompt.

Mathematical Formulation

Notation:

  • r: reward model
  • x: prompt
  • y_w: winning response
  • y_l: losing response
  • s_w = r(x, y_w): reward score for winning response
  • s_l = r(x, y_l): reward score for losing response
  • \sigma: sigmoid function

Loss Function:

\mathcal{L} = -\log(\sigma(s_w - s_l))

where \sigma(z) = \frac{1}{1 + e^{-z}}

Goal: Minimize this loss function across all preference pairs

Model Architecture Choices

The reward model can be:

NoteReward Model Options
  • Trained from scratch on preference data
  • Fine-tuned from a foundation model (often preferred, as the reward model should ideally match the capability level of the model being optimized)
  • A smaller model (can also work effectively in practice, offering computational efficiency)

Key Insights

TipImportant Takeaways
  • Pairwise preferences are more consistent and reliable than absolute ratings
  • The sigmoid in the loss function ensures the model learns relative differences in quality
  • The reward model’s capacity should generally align with the foundation model it’s evaluating

Sampling Strategies and Model Hallucination

Temperature-Based Sampling

Core Formula

P(token) = \text{softmax}\left(\frac{\text{logits}}{\text{temperature}}\right)

Temperature Effects

  • Higher temperature reduces the probability of common tokens, thereby increasing the probability of rare tokens
  • This enables more creative and diverse responses
  • Standard value: Temperature of 0.7 balances creativity with coherent generation
TipMonitoring Model Learning

To identify if a model is learning, examine the probability distributions. If probabilities remain random across training, the model is not learning effectively.

Numerical Stability

To avoid underflow issues (since probabilities can be extremely small), we use log probabilities in practice.

Advanced Sampling Strategies

Top-k Sampling

  • Select the k largest logits
  • Apply softmax only to these k tokens to compute probabilities
  • Benefit: Reduces computational cost and filters out unlikely tokens

Top-p (Nucleus) Sampling

  • Find the smallest set of tokens whose cumulative probabilities sum to p
  • Process: Sort tokens by probability (descending order) and keep adding until cumulative probability reaches p
  • More adaptive than top-k as the number of considered tokens varies

Min-p Sampling

  • Set a minimum probability threshold for tokens to be considered
  • Any token with probability below this threshold is excluded from sampling

Best-of-N Sampling

  • Generate multiple responses for the same prompt
  • Select the response with the maximum average probability
  • Improves output quality at the cost of increased compute

Handling Structured Outputs

1. Prompting Techniques

  • Better prompt engineering with clear instructions
  • Two-query approach:
    • First query generates the response
    • Second query validates the response format

2. Post-Processing

  • Identify repeated common mistakes in model outputs
  • Write scripts to correct these systematic errors
  • Works well when the model produces mostly correct formats with predictable issues

3. Constrained Sampling

  • Sample only from a selected set of valid tokens
  • Example: In JSON generation, prevent invalid syntax like {{ without an intervening key
  • Enforces grammatical/syntactic correctness at the token level

4. Test-Time Compute

  • Generate multiple candidate responses
  • Use a selection mechanism to output the best one
  • Trade compute for quality

5. Fine-Tuning

ImportantBest Solution

Fine-tuning is the most effective approach for structured outputs:

  • Full model fine-tuning (better if resources available)
  • Partial fine-tuning (LoRA, adapters, etc.)
  • Directly teaches the model the desired output format

Model Hallucination

Definition

Hallucination occurs when a model generates responses that are not grounded in factual information.

Consistency Issues

Due to the probabilistic nature of language models:

  • Same prompt can produce multiple different outputs
  • This inconsistency can be problematic for production systems

Mitigation Strategies

  1. Caching: Store responses for repeated queries to ensure consistency
  2. Sampling adjustments: Modify temperature, top-k, or top-p parameters
  3. Deterministic seeds: Use fixed random seeds
WarningNo Perfect Solution

Even with these techniques, 100% consistency is not guaranteed. Hardware differences in floating-point computation can lead to variations across different systems.

Why Models Hallucinate

Hypothesis 1: Self-Supervision is the Culprit

Problem: Models cannot differentiate between:

  • Given data (the prompt)
  • Generated data (their own outputs)

Example Scenario

  1. Prompt: “Who is Chip Huyen?”
  2. Model generates: “Chip Huyen is an architect”
  3. Next token generation: The model treats “Chip Huyen is an architect” as a fact, just like the original prompt
  4. Consequence: If the initial generation is incorrect, the model will continue to justify and build upon the incorrect information

Mitigation Techniques

NoteApproach 1: Reinforcement Learning

From RL perspective, train the model to differentiate between:

  • Observations about the world (given text/prompt)
  • Model’s actions (generated text)

This helps the model maintain awareness of what is given vs. what it produces.

NoteApproach 2: Supervised Fine-Tuning with Counterfactuals

Include both factual and counterfactual examples in training data, explicitly teaching the model to recognize and avoid false information.

Hypothesis 2: Supervision is the Culprit

Problem: Conflict between the model’s internal knowledge and labeler knowledge during SFT (Supervised Fine-Tuning).

The Issue

  • During SFT, models are trained on responses written by human labelers
  • Labelers use their own knowledge to write responses
  • The model may not have access to the same knowledge base
  • Result: The model learns to produce responses it cannot properly ground, leading to hallucination

Better Approach

TipSolution: Include Reasoning Chains

When creating training data:

  1. Document the information sources used to arrive at the response
  2. Include reasoning steps that show how the conclusion was reached
  3. This allows the model to understand not just what to say, but why and based on what information