Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions.We propose a sequence-to-sequence vision-language model with a flexible hypothesis space, manifest in the training set and encoded in a layer of learnable query tokens. The architecture is trained with a novel loss, inspired by the language domain, that marginalizes over multiple inference paths in the decoder. This enables us the flexibility to adapt the hypothesis space to the task, rather than restricting to the embedding of a single token as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its autoregressive counterpart, but is faster at inference time since the decoder has to be executed once to jointly produce all output tokens, rather than sequentially to produce them one at a time. We test our model on four vision-language tasks, and perform ablation studies to single out the contribution of each component.