A vendor-neutral catalog of design patterns for building with large language models. Browse individual patterns or follow guides to compose them into production systems.
Retrieval-augmented generation patterns for grounding LLM output in external knowledge.
Give an AI agent control over when, where, and how to retrieve information rather than using a fixed retrieval pipeline.
Ground LLM responses in external knowledge by retrieving relevant documents before generation to reduce hallucinations and stay current.
Answer complex multi-hop questions through iterative cycles of retrieval, reasoning, and gap analysis across multiple sources.
Build trust in RAG outputs through inline citations, out-of-domain detection, and self-correcting retrieval strategies that reduce hallucinations.
Bridge the vocabulary gap between user queries and knowledge base content using hypothetical answers, query expansion, and hybrid search.
Improve retrieval quality by reranking, compressing, and filtering retrieved chunks between the vector search step and LLM generation.
Replace keyword matching with vector embeddings to find documents by meaning rather than exact words, enabling semantic similarity search.
Autonomous systems that plan, use tools, execute code, and coordinate with other agents.
Let LLMs generate and execute code in sandboxed environments for tasks requiring computational precision like data analysis and visualization.
Coordinate multiple specialized agents to solve complex tasks that exceed any single agent's capabilities using supervisor or peer topologies.
Separate strategic planning from tactical execution by having one agent plan and another execute each step for more structured workflows.
Interleave reasoning and action in a loop where the agent thinks, acts, observes, and repeats until the task is complete.
Let LLMs interact with external systems by emitting structured function calls that your code executes safely on their behalf.
Techniques for structuring model inputs to get better reasoning, consistency, and output quality.
Prompt models to show their reasoning step by step to improve accuracy on multi-step problems like math, logic, and complex analysis.
Include input-output examples in your prompt so the model learns the expected format, tone, and behavior by demonstration.
Break complex tasks into a sequence of focused prompts where each step's output feeds into the next for more reliable multi-step results.
Automatically optimize prompts against evaluation datasets instead of relying on manual trial-and-error tuning of instructions.
Generate multiple reasoning paths and take the majority answer to reduce errors from stochastic generation and improve reliability.
Directing requests to the right model, chain, or agent based on intent and constraints.
Try cheaper models first and escalate to more capable ones only when confidence is low, reducing costs while maintaining output quality.
Route queries to the right model tier based on estimated complexity to optimize cost without sacrificing quality on harder tasks.
Classify query intent using embeddings and route to the appropriate handler, tool, or agent pipeline without relying on keyword rules.
Protecting systems from harmful inputs, hallucinated outputs, and policy violations.
Insert safety layers at input, output, retrieval, and execution points to enforce content policies, prevent harm, and block prompt injection.
Detect potential hallucinations by analyzing token probabilities and confidence scores in LLM outputs before they reach the user.
Measuring and improving the quality of LLM outputs through automated and human feedback.
Use an LLM with a custom scoring rubric to evaluate open-ended outputs at scale, replacing expensive human review with consistent automated grading.
Improve LLM outputs through iterative generate-evaluate-critique-regenerate loops that refine quality without retraining the model.
Operating LLM systems efficiently through caching, model selection, and inference optimization.
Maximize inference throughput through batching, KV cache optimization, and model parallelism to reduce latency and serve more requests per GPU.
Reuse responses for repeated or similar prompts through semantic and prefix caching strategies to cut latency and reduce API costs.
Reduce model size through distillation, quantization, or speculative decoding while preserving quality for cost-efficient deployment.
Maintaining context across conversations and sessions beyond the context window.
Manage conversation state across turns using sliding windows, summaries, or entity tracking strategies to maintain coherent multi-turn dialogue.
Persist important facts and preferences in external memory stores and retrieve them to maintain continuity and personalization across sessions.