Hyperparameters

Beam Search Hyperparameters

Beam width	256
Planning horizon	10
Vocabulary size	100
Context size [number of $(\mathbf{s}, \mathbf{a}, r, V)$ tuples]	5
$k_\text{obs}$ [top-k tokens from which observations are sampled]	1
$k_\text{act}$ [top-k tokens from which actions are sampled]	20

Beam width and context size are standard hyperparameters for decoding Transformer language models. Planning horizon is a standard trajectory optimization hyperparameter. $k_\text{obs}$ and $k_\text{act}$ indicate that actions are sampled from the most likely $20\%$ of action tokens and next observations are decoded greedily conditioned on previous observations and actions.