Beam Search Hyperparameters



Beam width 256
Planning horizon 10
Vocabulary size 100
Context size [number of $(\mathbf{s}, \mathbf{a}, r, V)$ tuples] 5
$k_\text{obs}$ [top-k tokens from which observations are sampled] 1
$k_\text{act}$ [top-k tokens from which actions are sampled] 20


Beam width and context size are standard hyperparameters for decoding Transformer language models. Planning horizon is a standard trajectory optimization hyperparameter. $k_\text{obs}$ and $k_\text{act}$ indicate that actions are sampled from the most likely $20\%$ of action tokens and next observations are decoded greedily conditioned on previous observations and actions.