(Transformer variants)
A continuous variant of the Transformer architecture with outputs paramterizing a Gaussian distribution performs on par with a standard MLP dynamics model.
We found that the unmodified GPT architecture applied to discretized data yielded more accurate long-horizon predictions.
For more information about the discretization methods used, see
this page describing their implementation.