Chapter 3 of [[ai-ml/nvidia-certs/ncp-aai/cognition-planning-and-memory/Understanding-the-planning-of-LLM-agents-A-survey|Understanding the planning of LLM agents: A survey]]

Abstract

This chapter surveys the methodologies surrounding task decomposition in Large Language Model (LLM) agents, specifically focusing on the structural breakdown of complex objectives into manageable sub-tasks. It establishes the technical distinction between initial decomposition-first strategies, such as HuggingGPT and Plan-and-Solve, and interleaved decomposition methods that dynamically alternate between reasoning and planning, exemplified by ReAct and Chain-of-Thought variants. Furthermore, the text introduces multi-plan selection mechanisms, including Self-consistency and Tree-of-Thought, which address the inherent uncertainty of LLM generation by evaluating diverse candidate plans. These contributions are critical for understanding the planning trajectories of LLM agents, highlighting the trade-offs between pre-determined sub-task rigidity and dynamic adjustment capabilities regarding fault tolerance and hallucination risks.

Key Concepts

  • Decomposition-First Planning: This approach involves explicitly instructing the LLM to break down a given task into multiple sub-tasks at the initial stage before execution begins. It establishes dependencies between tasks, as seen in HuggingGPT, which requires the LLM to structure the workflow prior to any model selection or response generation. The primary technical motivation is to create a stronger correlation between the sub-tasks and the original objectives, thereby reducing the risk of task forgetting and hallucinations during execution.
  • Interleaved Decomposition: Contrasting with decomposition-first methods, this concept involves interleaving task decomposition with sub-task planning, where only one or two sub-tasks are revealed at the current state. This dynamic process allows for dynamic adjustment based on environmental feedback, improving fault tolerance by permitting corrections during the trajectory. However, technically, excessively long trajectories in this mode may lead the LLM to deviate from original goals due to hallucinations.
  • Zero-shot Chain-of-Thought (Zero-shot CoT): An evolution of the standard Chain-of-Thought series, this method unlocks the LLM’s zero-shot reasoning abilities through the specific instruction prompt “Let’s think step-by-step.” It transforms the reasoning process into a two-step instruction: “Let’s first devise a plan” and “Let’s carry out the plan.” This technique has achieved measurable improvements in mathematical reasoning, common-sense reasoning, and symbolic reasoning without requiring few-shot trajectories.
  • ProgPrompt Task Formalization: This concept translates natural language descriptions of tasks into coding problems to symbolize the agent’s action space and objects within the environment. Specifically, each action is formalized as a function and each object is represented as a variable, which naturally transforms task planning into function generation. The execution process involves generating a plan in the form of function callings and then executing them step by step via the code interpreter.
  • ReAct Decoupling: The ReAct framework decouples reasoning and planning into distinct operational phases, alternating between a reasoning stage (Thought step) and a planning stage (Action step). This structural decoupling demonstrates significant improvements in the planning capabilities of the agent compared to methods that embed reasoning strictly within the planning process. It enables the utilization of external tools or models, such as the visual models in Visual ChatGPT, by structuring the interaction as a sequence of thoughts and actions.
  • Program-of-Thought (PoT): This method completely formalizes the reasoning process as programming tasks, leveraging a CodeX model trained on code-related data to enhance performance in mathematical and financial problems. Unlike methods that use code as an auxiliary tool, PoT relies on the formal structure of programming to manage the logic required for complex problem-solving. It ensures that the reasoning is constrained by the syntactic correctness of the code generated by the agent.
  • PAL (Program-Aided Language Models): PAL improves upon standard Chain-of-Thought by leveraging the LLM’s coding abilities during the reasoning phase. It guides the LLM to generate code which is then comprehensively executed by a code interpreter, such as Python, to obtain the final solution. This hybrid approach proves particularly helpful for agents solving mathematical and symbolic reasoning problems where natural language reasoning alone is insufficient.
  • Self-consistency Sampling: This multi-plan generation strategy employs the intuition that solutions for complex problems are rarely unique. Instead of generating a single reasoning path, Self-consistency obtains multiple distinct reasoning paths via sampling strategies embodied in the decoding process, such as temperature sampling or top- sampling. This method constructs a candidate plan set by generating a dozen paths to comprise the final selection pool.
  • Tree-of-Thought (ToT): ToT proposes two specific strategies to generate plans, referred to as thoughts, within a search structure. It includes a sample strategy, which is consistent with Self-consistency, where the LLM samples multiple plans during the decoding process. It also includes a propose strategy, indicating a mechanism to generate new thoughts beyond simple sampling, allowing for a more structured exploration of the planning space.
  • Context Length Constraints: A significant conceptual challenge identified is the limitation imposed by the context length of the LLM when tasks are decomposed into dozens of sub-tasks. This condition leads to the forgetting of planning trajectories, as the model cannot retain the full history of the decomposition and sub-planning process within its window. This constraint dictates the feasibility of long-horizon planning strategies in current LLM architectures.

Key Equations and Algorithms

  • Plan-and-Solve Instruction Sequence: The algorithm transforms the original “Let’s think step-by-step” prompt into a specific two-step instruction: “Let’s first devise a plan” and “Let’s carry out the plan”. This procedure guides the LLM to explicitly separate the planning phase from the execution phase before generating the final response. It serves as a zero-shot approach to improve performance in reasoning tasks.
  • ReAct Thought-Action Loop: The algorithm alternates between reasoning (Thought step) and planning (Action step) in a sequential loop. In the Thought step, the LLM reasons about the current state; in the Action step, the LLM generates a plan or utilizes a tool. This loop continues until the task is completed, decoupling the cognitive reasoning from the immediate action execution.
  • PAL Code Generation and Execution: The procedure guides the LLM to generate code during the reasoning process rather than outputting natural language answers directly. Subsequently, a code interpreter is used to comprehensively execute the generated codes to obtain the solution. This algorithmic separation ensures that the arithmetic or symbolic execution is handled by the interpreter rather than the LLM’s prediction distribution.
  • Self-consistency Decoding Sampling: The algorithm employs sampling strategies embodied in the decoding process, such as temperature sampling and top- sampling, to obtain multiple distinct reasoning paths. These paths are generated in parallel to form a candidate plan set for subsequent selection. The computational complexity is increased by the need to generate multiple paths, but the selection accuracy is theoretically improved by leveraging the consistency among them.
  • Task-to-Function Mapping (ProgPrompt): The algorithm symbolizes the agent’s action space and objects in the environment through code, with each action formalized as a function and each object represented as a variable. Consequently, task planning is transformed into function generation, where the agent generates a plan in the form of function callings. Execution follows by processing these function callings step by step within the defined code environment.

Key Claims and Findings

  • Interleaved decomposition methods improve fault tolerance by dynamically adjusting decomposition based on environmental feedback, unlike decomposition-first methods which are rigid. However, this dynamic nature carries the risk that excessively long trajectories may lead to the LLM deviating from original goals during subsequent sub-tasks.
  • Decomposition-first methods reduce the risk of task forgetting and hallucinations by creating a stronger correlation between the sub-tasks and the original tasks, but they require additional mechanisms for adjustment if errors occur. If a predetermined step fails in a decomposition-first framework, the entire plan may fail without built-in correction mechanisms.
  • Task decomposition introduces additional overhead in terms of reasoning and generation, incurring additional time and computational costs during the planning phase. For highly complex tasks decomposed into dozens of sub-tasks, the planning is strictly constrained by the context length of the LLM.
  • The introduction of Chain-of-Thought (CoT) reveals the few-shot learning capabilities of the LLM, while Zero-shot CoT unlocks zero-shot reasoning abilities without the need for constructed trajectories. Zero-shot CoT has demonstrated improvements across mathematical, common-sense, and symbolic reasoning domains.
  • Multi-plan generation is a necessary approach due to the inherent uncertainty of LLMs, as a single plan generated by an LLM is likely to be suboptimal or even infeasible. This necessitates a process comprising two major steps: multi-plan generation and optimal plan selection to navigate the diverse solution space effectively.
  • ReAct demonstrates significant improvements in planning capabilities by decoupling reasoning and planning, allowing the LLM to alternate between internal thought and external action. This structure enables agents like Visual ChatGPT to equip the LLM as a brain with a series of visual models for image processing capabilities.

Terminology

  • Decomposition-First: A planning paradigm where the initial task decomposition into sub-tasks is completed before any execution or sub-task planning begins. This method predetermines sub-tasks to ensure strong correlation with the original task but risks error propagation if no adjustment mechanism exists.
  • Interleaved Decomposition: A planning method where task decomposition and sub-task planning occur alternately, revealing only one or two sub-tasks at the current state. This allows for dynamic adjustments based on feedback but risks goal deviation over long trajectories due to hallucinations.
  • Sub-task: A component of a larger task that the LLM breaks down, which may involve selecting models, generating responses, or executing function callings. Dependencies between sub-tasks are explicitly provided in methods like HuggingGPT to ensure correct ordering.
  • Function Generation: The process in ProgPrompt where task planning is transformed into generating code functions rather than natural language steps. Each action in the environment is formalized as a function within this specific paradigm.
  • Decoding Process: The mechanism within the generative model where sampling strategies like temperature sampling or top- sampling are applied to generate multiple distinct reasoning paths. This process is critical for Self-consistency methods to create a diverse candidate plan set.
  • Code Interpreter: An external tool, such as Python, used in PAL to comprehensively execute generated code to obtain solutions. It is used to handle mathematical and symbolic reasoning problems that the LLM might otherwise solve incorrectly in natural language.
  • Hallucinations: Errors where the LLM generates content that is factually incorrect or deviates from the original goal, identified as a risk in both decomposition-first (due to lack of adjustment) and interleaved methods (due to long trajectories).
  • Thought Step: The specific stage in the ReAct framework where the agent engages in reasoning before generating an action. It is decoupled from the Action step to improve the overall planning capability of the agent.
  • Action Step: The specific stage in the ReAct framework where the agent generates a plan or utilizes a tool following a Thought step. It represents the execution component of the interleaved reasoning and planning cycle.
  • Top- Sampling: A decoding strategy mentioned in Self-consistency where the model samples from the top most probable tokens during the generation process. This allows for the generation of multiple distinct reasoning paths from a single model instance.