Beam Search Hyperparameters
| Beam width |
256 |
| Planning horizon |
10 |
| Vocabulary size |
100 |
| Context size [number of $(\mathbf{s}, \mathbf{a}, r, V)$ tuples] |
5 |
| $k_\text{obs}$ [top-k tokens from which observations are sampled] |
1 |
| $k_\text{act}$ [top-k tokens from which actions are sampled] |
20 |
Beam width and context size are standard hyperparameters for decoding Transformer language models.
Planning horizon is a standard trajectory optimization hyperparameter.
$k_\text{obs}$ and $k_\text{act}$ indicate that actions are sampled from the most likely $20\%$ of action tokens and next observations are decoded greedily conditioned on previous observations and actions.