A paper from a week ago found that models trained on multiple data modes perform an order of magnitude better than text-only models of the same or even larger size.
Some of these large models are able to do zero shot learning and perform tasks they weren't explicitly trained on since the training objective is very general.
Being able to perform more advanced types of zero shot learning tasks would be comparable and further the accuracy on those tasks can be evaluated
The next big step for coding LLMs will be context window increases, leaked docs have OpenAI pricing for up to 16K I believe, 4x the current maximum. Now you're talking "write a class" instead of this line and maybe sometimes a method
Are you referring to PALM-E? It didn't have any positive transfer for NLP tasks, in fact the unfrozen model performed slightly worse after the finetune.
That being said, PALM-E wasn't really a multimodal model from the start, it's still basically just a text model with a visual one glued on top. Whether a truly multimodal model will be better at reasoning and data efficiency is still an open question though.