Beam Search Description

Description. Beam search is a pruned breadth-first search algorithm. At each step $t$, beam search constructs a candidate set $\mathcal{C}_t$ consisting of all combinations of length $t-1$ sequences from the previous hypothesis set $Y_{t-1}$ and a single new token $y$ from vocabulary $\mathcal{V}$. The $B$ ("beam width") most likely sequences from the candidate set define the updated hypothesis set $Y_t$. We use a beam width of $\boldsymbol{B=256}$. The size of the vocabulary $\boldsymbol{\mathcal{V}}$ is 100, corresponding to the number of bins used in the discretization procedure.$^1$

Notation Following the presentation in Meister et al., 2020, we have overloaded $\log P_\theta(\cdot \mid \mathbf{x})$ to define the likelihood of a set of sequences in addition to that of a single sequence: $\log P_\theta(Y \mid \mathbf{x}) = \sum_{\mathbf{y} \in Y} \log P_\theta(\mathbf{y} \mid \mathbf{x})$. We use $\small{(\;)}$ to denote the empty sequence and $\circ$ to represent the concatenation operation.

Planning with beam search. Algorithm 1 is presented in an abstracted form so as to apply to a generic sequence generation problem. To use beam search for control, we use the last $c$ transitions (the "context size") of the current trajectory as the input $\mathbf{x}$, corresponding to the most recent $c \cdot (N + M + 2)$ tokens.$^2$ Similarly, when evaluating the likelihood of a generated sequence $\mathbf{y}$, only the most recent $c$ transitions of the sequence are passed as input to the model $P_\theta$. We use a context size of $\boldsymbol{c=5}$.

We use a planning horizon of $10$ transitions, corresponding to a generated sequence length of $\boldsymbol{T=10 \cdot (N + M + 2)}$ tokens. For example, in the Hopper benchmark, $T = 10 \cdot (11 + 3 + 2) = 160$.

$^1$Because the beam width is larger than the vocabulary size, at $t=1$ of Algorithm 1 we can fit all single tokens in the hypothesis set $Y_1$. At $t=2$, when there are $100^2$ options, we must begin pruning. To account for this edge case, the size constraint in line 4, $|Y| = B$, could be more precisely rewritten as $|Y| = \min(B, |Y_{t-1}| * | \mathcal{V} |)$, but we present the simpler version in the pseudocode description for conceptual clarity.
$^2$$N+M+2$ accounts for $N$-dimensional states, $M$-dimensional actions, and scalar rewards and rewards-to-go.