DSPy: Revolutionizing Prompting with Programmatic Pipelines

Introduction

Working with large language models often necessitates crafting numerous prompts. However, as applications expand, managing these prompts manually becomes cumbersome, inefficient, and prone to inconsistencies. Traditional "prompt templates" are typically crafted by trial and error, which can be time-consuming and hard to scale.

This is where DSPy comes into play. DSPy reimagines prompts as programs that can be optimized, allowing users to define structured pipelines for tasks such as question answering, summarization, or information retrieval. These pipelines are then automatically refined for better interaction with language models.

DSPy simplifies the process by transforming language model (LM) pipelines into manageable text transformation graphs. These graphs are composed of modules that can learn and improve their prompting, fine-tuning, and reasoning capabilities.

Imagine building a customer support AI assistant for an e-commerce platform:

Without DSPy

You need to create multiple prompts for classification, retrieval, and response generation.
Manual adjustments are necessary when results are subpar.
Scaling becomes difficult with increasing use cases.

With DSPy

You define a pipeline: Understand query → Retrieve info → Generate response.
DSPy optimizes interactions at each step with the language model.
The system improves over time using data and feedback.

Instead of repeatedly rewriting prompts, DSPy enables the construction of a self-improving AI system, aligning with modern application needs. DSPy tools enhance pipeline performance, enabling smaller, open models to compete with expert-designed prompts for advanced models like GPT-3.5.

Key Takeaways

DSPy replaces manual prompts with systematic optimization.
DSPy facilitates the creation of reliable and reusable AI pipelines, enhancing performance without constant prompt rewriting.
Developers often encounter issues like prompts failing in production, maintaining consistency across tasks, and difficulty measuring or improving output quality.
DSPy addresses these by introducing declarative programming for LLMs, where desired outcomes are defined instead of specific prompt instructions. DSPy optimizers refine prompts and reasoning strategies.
Modular components in DSPy make systems easier to debug and scale, particularly useful for teams developing production-grade AI systems.

Illustration for: - DSPy replaces manual prompts...

Prerequisites

Before exploring DSPy, ensure the following:

Basic programming knowledge, particularly in Python.
Understanding of large language models (LLMs) and their prompting mechanisms.
Access to a Python development environment like Jupyter Notebook or VS Code with the necessary libraries installed.
Basic knowledge of the PyTorch framework, as DSPy is inspired by it.

What is DSPy?

DSPy is a framework that simplifies the optimization of language model prompts and weights, especially when using LMs extensively. Without DSPy, building complex systems with LMs involves numerous manual steps, which can become unwieldy.

DSPy separates the program’s flow from the parameters (prompts and weights) and introduces optimizers that adjust these parameters for desired outcomes. This makes powerful models more reliable and effective. Instead of manually adjusting prompts, DSPy uses algorithms to update parameters, allowing for seamless program recompilation when code, data, or metrics change.

Much like using frameworks such as PyTorch for neural networks, DSPy provides modules and optimizers that automate and enhance working with LMs, shifting focus from manual tweaking to systematic improvement and higher performance.

What does DSPy stand for?

The name DSPy stands for "Declarative Self-improving Language Programs," reflecting its capability to streamline the complex process of optimizing language model prompts and weights, particularly for multi-step pipelines.

Why do we need DSPy?

Prompt templates are predefined instructions crafted through trial and error. They may work for specific tasks but often fail in different contexts due to their lack of adaptability. Manual crafting and fine-tuning of prompt templates are time-consuming and labor-intensive, becoming inefficient as task complexity increases.

Hardcoded prompt templates often lead to issues such as lack of context and relevance, inconsistent output, poor quality responses, and inaccuracy. These challenges arise from limited flexibility and scalability, as templates may not generalize effectively across different models, data domains, or input variations.

Install DSPy

Installing DSPy is straightforward. Use pip to install it:

pip install -U dspy

For those using the OpenAI model, authenticate by setting the OPENAI_API_KEY environment variable or passing the api_key.

import dspy
lm = dspy.LM("openai/gpt-5-mini", api_key="YOUR_OPENAI_API_KEY")
dspy.configure(lm=lm)

Explore other models like Anthropic and Gemini as well.

Major Components in DSPy

Before delving deeper, understand these key components of DSPy:

Signatures
Modules
Teleprompters or Optimizers

A DSPy signature is a function declaration that specifies the required text transformation without detailing how a specific language model should be prompted. It comprises input and output fields with optional instructions.

Question and Answering

The first step with DSPy involves configuring your language model.

import dspy
lm = dspy.LM('openai/gpt-4o-mini')
dspy.settings.configure(lm=lm)

predict = dspy.Predict("question -> answer")

prediction = predict(question="who is the president of France?")
prediction.answer

Defining the signature is straightforward:

class QA(dspy.signature):
    question = dspy.InputField()
    answer = dspy.OutputField()

predict = dspy.Predict(QA)

prediction = predict(question="......")
print(prediction.answer)

A DSPy module is essential for creating programs utilizing language models. Each module encapsulates a specific prompting technique and is versatile enough to work with any DSPy Signature. Modules can be combined to form larger, more complex programs.

For example:

sentence = "it's a charming and often affecting journey."

classify = dspy.Predict('sentence -> sentiment')

response = classify(sentence=sentence)

print(response.sentiment)

Output:

Positive

import dspy

lm = dspy.OpenAI(model="gpt-4o-mini")
dspy.settings.configure(lm=lm)

class ClassifySentiment(dspy.Signature):
    text = dspy.InputField()
    sentiment = dspy.OutputField(desc="positive, negative, or neutral")

class SentimentModule(dspy.Module):
    def __init__(self):
        super().__init__()
        self.classify = dspy.Predict(ClassifySentiment)

    def forward(self, text):
        result = self.classify(text=text)
        return result.sentiment

classifier = SentimentModule()

output = classifier("I love using DSPy, it's so efficient!")
print(output)

Here:

Signature → Defines inputs and outputs
Predict() → Manages the LLM interaction
Module → Wraps everything into a reusable pipeline

Other DSPy modules include:

dspy.ChainOfThought
dspy.ReAct
dspy.MultiChainComparison
dspy.ProgramOfThought

A DSPy teleprompter is used for optimization. These are flexible strategies guiding how modules learn from data.

A DSPy optimizer fine-tunes the parameters of a DSPy program to maximize specified metrics. DSPy offers various built-in optimizers. Typically, a DSPy optimizer needs a DSPy program, a metric function to evaluate output, and a few training inputs.

How do the Optimizers Enhance Performance?

Traditional deep neural networks are optimized using gradient descent with a loss function and training data. DSPy programs, however, comprise multiple language model calls integrated as DSPy modules. Each module has internal parameters: LM weights, instructions, and demonstrations of input/output behavior.

DSPy optimizes all three using multi-stage algorithms, combining gradient descent for LM weights and LM-driven optimization for refining instructions and demonstrations. This approach often produces better prompts than human writing by systematically exploring more options.

A few DSPy optimizers are:

LabeledFewShot
BootstrapFewShot
BootstrapFewShotWithRandomSearch
BootstrapFewShotWithOptuna
KNNFewShot

Refer to DSPy documentation for more information on optimizers.

Getting started with DSPy

Start by installing the necessary packages:

!pip install dspy-ai

Import the necessary packages:

import sys
import os
import dspy
from dspy.datasets import HotPotQA
from dspy.teleprompt import BootstrapFewShot
from dspy.evaluate.evaluate import Evaluate
from dsp.utils import deduplicate

Getting started and loading the data

turbo = dspy.OpenAI(model='gpt-3.5-turbo')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)

dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

HotpotQA is a question-answering dataset sourced from English Wikipedia, comprising around 113,000 crowd-sourced questions. We will create a question-answering system using 20 data points for training and 50 for the development set.

trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

(20, 50)

Let's examine some examples:

train_example = trainset[0]
print(f"Question: {train_example.question}")
print(f"Answer: {train_example.answer}")

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

dev_example = devset[18]
print(f"Question: {dev_example.question}")
print(f"Answer: {dev_example.answer}")
print(f"Relevant Wikipedia Titles: {dev_example.gold_titles}")

Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Answer: English
Relevant Wikipedia Titles: {'Robert Irvine', 'Restaurant: Impossible'}

Creating a chatbot

Define a function called Basic QA with a signature for questions requiring short, factoid answers. Each question will have one answer, limited to one to five words.

class BasicQA(dspy.Signature):
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

Generate the response using dspy.predict, pass the Basic QA class, and call the generate_answer function with an example question.

generate_answer = dspy.Predict(BasicQA)

pred = generate_answer(question=dev_example.question)

print(f"Question: {dev_example.question}")
print(f"Predicted Answer: {pred.answer}")

Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Predicted Answer: American

The answer is incorrect, necessitating further refinement. Inspect how this output was generated:

turbo.inspect_history(n=1)

turbo.inspect_history(n=1)

This chef is British and American, but the model guessed "American." Introducing the ‘chain of thought’ can help.

Chain of Thought

For complex questions, a simple prompt may yield incorrect answers. Using the chain of thought prompts the model to think step-by-step for accurate answers.

generate_answer_with_chain_of_thought = dspy.ChainOfThought(BasicQA)
pred = generate_answer_with_chain_of_thought(question=question)

Creating a chatbot using Chain of Thought

The chain of thought includes a series of intermediate reasoning steps, significantly improving large language models’ ability to perform complex reasoning.

generate_answer_with_chain_of_thought = dspy.ChainOfThought(BasicQA)

pred = generate_answer_with_chain_of_thought(question=dev_example.question)

print(f"Question: {dev_example.question}")
print(f"Thought: {pred.rationale.split('.', 1)[1].strip()}")
print(f"Predicted Answer: {pred.answer}")

Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Thought: We know that the chef and restaurateur featured in Restaurant: Impossible is Robert Irvine.
Predicted Answer: British

The generated answer shows reasoning before arriving at a conclusion. Run the code below to check the reasoning and response generation.

turbo.inspect_history(n=1)

Creating a RAG Application

Build a retrieval-augmented pipeline for answer generation: create a signature, define a module, set up an optimizer, and execute the RAG process by defining a class called GenerateAnswer.

RAG Signature

Define the signature: context, question --> answer.

class GenerateAnswer(dspy.Signature):
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

RAG Module

In the RAG class, which acts as a module, define the model in the init function. Focus on ‘Retrieve’ and ‘GenerateAnswer.’ ‘Retrieve’ gathers relevant passages as context, and ‘GenerateAnswer’ uses ‘ChainOfThought’ for predictions.

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

RAG Optimizer

Compile the RAG program using a training set, define a validation metric, and select a teleprompter for optimization. Teleprompters are powerful optimizers that select effective prompts for modules.

def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

teleprompter = BootstrapFewShot(metric=validate_context_and_answer)

compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

Execute this pipeline:

my_question = "What castle did David Gregory inherit?"

pred = compiled_rag(my_question)

print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

Question: What castle did David Gregory inherit?
Predicted Answer: Kinnairdy Castle
Retrieved Contexts (truncated): ['David Gregory (physician) | David Gregory (20 December 1625 – 1720) was a Scottish physician and inventor. His surname is sometimes spelt as Gregorie, the original Scottish spelling. He inherited Kinn...']

Inspect the history:

turbo.inspect_history(n=1)

Evaluate

Evaluate the RAG model’s performance: assess the basic RAG, uncompiled RAG (without optimizer), and compiled RAG (with optimizer).

Basic RAG

def gold_passages_retrieved(example, pred, trace=None):
    gold_titles = set(map(dspy.evaluate.normalize_text, example['gold_titles']))
    found_titles = set(map(dspy.evaluate.normalize_text, [c.split(' | ')[0] for c in pred.context]))

    return gold_titles.issubset(found_titles)

evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=5)

compiled_rag_retrieval_score = evaluate_on_hotpotqa(compiled_rag, metric=gold_passages_retrieved)

Uncompiled Baleen RAG (Without Optimizer)

Baleen’s primary purpose is to automatically modify the question or query by dividing it into chunks. It retrieves the context from the chunks and then saves it in a variable, which helps generate more accurate answers.

class GenerateSearchQuery(dspy.Signature):
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

Create the module:

class SimplifiedBaleen(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops

    def forward(self, question):
        context = []

        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

Inspect the zero-shot version of the Baleen program:

my_question = "How many storeys are in the castle that David Gregory inherited?"

uncompiled_baleen = SimplifiedBaleen()
pred = uncompiled_baleen(my_question)

print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

Compiled Baleen RAG (with Optimizer)

Define validation logic to ensure:

The predicted answer matches the correct answer.
The retrieved context includes the correct answer.
No generated queries are too long or repetitive.

def validate_context_and_answer_and_hops(example, pred, trace=None):
    if not dspy.evaluate.answer_exact_match(example, pred): return False
    if not dspy.evaluate.answer_passage_match(example, pred): return False

    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

    if max([len(h) for h in hops]) > 100: return False
    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False

    return True

Use BootstrapFewShot teleprompter:

teleprompter = BootstrapFewShot(metric=validate_context_and_answer_and_hops)

Compile the optimizer and evaluate retrieval quality of compiled and uncompiled baleen pipelines:

compiled_baleen = teleprompter.compile(SimplifiedBaleen(), teacher=SimplifiedBaleen(passages_per_hop=2), trainset=trainset)

uncompiled_baleen_retrieval_score = evaluate_on_hotpotqa(uncompiled_baleen, metric=gold_passages_retrieved)

compiled_baleen_retrieval_score = evaluate_on_hotpotqa(compiled_baleen, metric=gold_passages_retrieved)

Print scores for comparison:

print(f"## Retrieval Score for RAG: {compiled_rag_retrieval_score}")
print(f"## Retrieval Score for uncompiled Baleen: {uncompiled_baleen_retrieval_score}")
print(f"## Retrieval Score for compiled Baleen: {compiled_baleen_retrieval_score}")

Output:

## Retrieval Score for RAG: 26.0
## Retrieval Score for uncompiled Baleen: 48.0
## Retrieval Score for compiled Baleen: 60.0

The compiled Baleen method provides more accurate answers than the basic RAG application by dividing questions into smaller chunks, retrieving context, and providing precise answers.

compiled_baleen("How many storeys are in the castle that David Gregory inherited?")
turbo.inspect_history(n=3)

Comparison with Langchain and Llamaindex

LangChain and LlamaIndex Overview

Both LangChain and LlamaIndex are prominent libraries in prompting LMs.
They focus on providing pre-packaged components and chains for application developers, offering reusable pipelines and tools.

Key Differences Between LangChain and LlamaIndex:

LangChain and LlamaIndex rely on manual prompt engineering, which DSPy aims to eliminate.
DSPy offers a structured framework that automatically bootstraps prompts, negating the need for hand-written prompt demonstrations.
In contrast to LangChain's extensive use of long strings for prompts, DSPy achieves high quality without hand-written prompts, offering more modularity and power.

FAQs

1. What is DSPy, and how is it different from prompt engineering?

DSPy is a framework that replaces manual prompt engineering with structured, programmable pipelines. Instead of repeatedly writing and adjusting prompts, tasks are defined as modules with clear inputs and outputs, and DSPy optimizes these interactions automatically, enhancing reliability and scalability.

2. Do I need to train a model to use DSPy?

No, DSPy works on top of existing large language models. Users configure a model and define their pipeline, and DSPy handles optimization internally, facilitating faster development of production-ready applications.

3. Can DSPy be used for real-world applications?

Yes, DSPy is designed for real-world use cases, including chatbots, AI agents, and retrieval-augmented generation systems. By structuring tasks into modules and optimizing them, DSPy ensures consistent and high-quality outputs in production environments.

4. How does DSPy improve the performance of LLM applications?

DSPy uses optimizers to refine the structure of prompts and reasoning steps. Instead of manually experimenting with different prompts, DSPy evaluates and improves them based on defined metrics, enhancing accuracy and consistency while reducing manual maintenance.

5. Is DSPy only useful for developers with AI/ML experience?

No, DSPy is accessible to both beginners and experienced developers. While understanding LLMs is beneficial, DSPy simplifies complex aspects, abstracting prompt engineering into reusable components for building advanced AI systems without deep expertise.

6. How does DSPy relate to RAG (Retrieval-Augmented Generation)?

DSPy can build and optimize RAG pipelines by structuring retrieval, reasoning, and generation steps into modules, ensuring effective use of retrieved context by the language model for more accurate and context-aware responses.

7. Can DSPy work with frameworks like LangChain?

Yes, DSPy can complement frameworks like LangChain, focusing on optimizing components rather than chaining them. It can be used independently or integrated into existing workflows for improved performance.

8. What are the main benefits of using DSPy in production?

DSPy reduces prompt instability, enhances output quality, and simplifies scaling. It transforms experimental setups into structured AI systems, enabling faster development and reliable deployment of LLM-powered applications.

Conclusion

This article explored DSPy, a structured approach to building AI systems with language models. Unlike manual prompt engineering, DSPy provides a more reliable way to design workflows using signatures, modules, and teleprompters, transforming loosely defined prompts into organized, scalable pipelines.

By constructing simple Q&A chatbots and RAG-based applications, DSPy simplifies complex tasks into manageable steps. It demonstrates that achieving strong results doesn't always require large, heavily fine-tuned models; well-structured pipelines and optimization can make a significant difference.

Overall, DSPy shifts the focus from "writing better prompts" to "designing better systems," making it especially valuable for real-world applications where consistency, scalability, and performance are crucial.

References

DSPy GitHub repository
DSPY: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy official documentation