An Introduction to Large Language Models: Prompt Engineering and P-Tuning
Cross-posted to Cognition, Planning, and Memory.
Abstract
This NVIDIA Developer Blog article (Tanay Varshney, 2023) introduces large language models, makes the case for LLMs over traditional chatbot ensembles, and covers the two primary mechanisms for customising LLM behaviour: prompt engineering and p-tuning. It explains zero-shot, few-shot, and chain-of-thought prompting strategies with concrete examples, then motivates p-tuning as a parameter-efficient alternative to full fine-tuning — using a small trainable model to generate “virtual tokens” that are prepended to the LLM’s input at inference time. The article positions NVIDIA NeMo as the platform for p-tuning at scale. It is a foundational conceptual reference for anyone building agents that interact with LLMs, particularly for understanding how to guide LLM behaviour without retraining the full model.
Key Concepts
Why LLMs over Ensembles
Traditional chatbots use an ensemble of purpose-built BERT-scale models plus a dialog manager. The limitations:
- Each skill requires its own data, fine-tuning pipeline, and MLOps maintenance cycle.
- Adding new capabilities means building and maintaining another pipeline (“time-to-ship” is long).
- Ensembles are rigid: they cannot generalise beyond their training distribution.
LLMs trade individual model efficiency for flexibility: a single model handles diverse tasks from a large pretraining corpus, with new capabilities added via prompting or p-tuning rather than new pipelines.
Prompts and Prompt Engineering
A prompt is any input provided to an LLM to elicit a desired response — text, instructions, examples, data, constraints, or images. The quality and structure of the prompt directly determines the quality of the output.
Three prompting strategies:
| Strategy | Description | Use case |
|---|---|---|
| Zero-shot | No examples; task is described in natural language only | Simple factual or instructional tasks |
| Few-shot | Several input→output examples are prepended to the actual query | Tasks where format or reasoning pattern matters |
| Chain-of-thought (CoT) | Examples include step-by-step reasoning traces, or a zero-shot trigger like “Let’s think about this logically” | Multi-step reasoning, arithmetic, logic problems |
Chain-of-thought prompting causes the LLM to generate an explicit reasoning chain alongside its answer. Zero-shot CoT (“Let’s think step by step”) can elicit reasoning without labelled examples, though it is less reliable than few-shot CoT.
P-Tuning (Prompt Tuning)
P-tuning is a parameter-efficient fine-tuning (PEFT) technique that customises LLM behaviour by training a small auxiliary model (a “soft prompt encoder”) rather than the LLM itself.
Mechanism:
- A small trainable model encodes the task-specific text prompt into a set of virtual tokens.
- Virtual tokens are prepended to the LLM’s input context at inference time.
- After training completes, the virtual tokens are stored in a lookup table — the auxiliary model is no longer needed at runtime.
Advantages over full fine-tuning:
- Drastically fewer trainable parameters → orders-of-magnitude less compute.
- Training time measured in minutes (~20 min on NeMo) rather than days.
- Multiple task-specific virtual token sets can be saved and swapped without storing multiple full model copies.
- The base LLM weights remain unchanged, preserving its general capabilities.
P-tuning on NVIDIA NeMo: NVIDIA’s NeMo LLM service provides a managed p-tuning workflow. The virtual tokens learned on NeMo can be used directly at inference time via the NeMo API.
Key Algorithms
- Zero-shot prompting: Task description only → LLM generates completion. No examples required; accuracy depends on the model’s pretraining coverage of the task.
- Few-shot prompting: Prepend K labelled examples (K typically 3–10) to the query. The LLM learns the task format in-context without gradient updates.
- Chain-of-thought prompting: Prepend examples that include intermediate reasoning steps. The LLM mirrors the reasoning structure and applies it to the new query.
- P-tuning: Backpropagate through a small encoder model; the LLM’s weights are frozen. Gradients flow only through the virtual token encoder. At inference, frozen virtual tokens replace the encoder call.
Key Claims and Findings
- LLMs become viable when scale reaches approximately 1B parameters; emergent abilities (in-context learning, chain-of-thought) arise at this scale.
- The cost-of-ownership argument for LLMs over ensembles must include engineering time, data acquisition, and maintenance — not just inference FLOPs.
- Prompt quality determines output quality: even small changes in phrasing can switch an LLM from an incorrect to a correct answer (demonstrated with a capital-city example).
- P-tuning achieves fine-tuning-level performance customisation at a fraction of the computational cost; NVIDIA demonstrated ~20-minute tuning cycles on the NeMo service.
- Zero-shot chain-of-thought (“Let’s think about this logically”) reliably improves LLM reasoning on multi-step problems without any labelled examples.
Terminology
| Term | Definition |
|---|---|
| LLM | Large language model — a deep learning model with ≥1B parameters trained on large text corpora |
| Zero-shot | Inference without any task-specific examples in the prompt |
| Few-shot | Inference with a small number of labelled examples prepended as context |
| Chain-of-thought (CoT) | Prompting technique that elicits explicit intermediate reasoning steps from an LLM |
| P-tuning / prompt tuning | PEFT technique: trains a small encoder to produce virtual tokens that steer LLM behaviour without modifying LLM weights |
| Virtual tokens | Soft prompt vectors produced by the p-tuning encoder; prepended to the input at inference time |
| PEFT | Parameter-efficient fine-tuning — any technique that tunes a small subset of parameters rather than the full model |
| NeMo | NVIDIA’s LLM training, customisation, and serving framework; supports p-tuning |
Connections
- What Is Agent Memory? — p-tuning and chain-of-thought prompting are mechanisms for shaping agent cognition at inference time; they complement the memory taxonomy by addressing how the LLM processes information present in the context window
- Building Autonomous AI with NVIDIA Agentic NeMo — NeMo is the platform for p-tuning described here; the agentic NeMo stack relies on customised LLMs whose behaviour is shaped by exactly these techniques
- Cross-section: Cognition, Planning, and Memory — cognition perspective on the same material