Humans can easily solve multimodal tasks in context, with only a few demonstrations or simple instructions, which current multimodal systems largely struggle to imitate. In this work, we demonstrate that by effectively scaling up generative multimodal models, their task-agnostic in-context learning capabilities can be significantly enhanced.We introduce Emu2, a generative multimodal model with 37 billion parameters, which serves as a base model and general-purpose interface for a variety of multimodal tasks. Emu2 not only achieves strong performance in few-shot setting, but can also be instruct-tuned to follow specific instructions such as visual question answering and object-grounded image generation.Emu2 even emerges to solve tasks that require on-the-fly reasoning, such as visual prompting, which existing models are unlikely to handle. We identify additional tasks where Emu2's in-context learning can further improve, and discuss its broader societal impact.Our code and models will be made publicly available to facilitate future research.