Existing physics-based motion generation methods rely heavily on imitation learning and reward shaping to excel in motion quality and interactivity, which hinders their ability to generalize to unseen scenarios. To address this limitation, we propose AnySkill, a novel hierarchical method that can learn physically plausible interactions following open-vocabulary instructions. Our approach first constructs a repertoire of atomic actions by learning a low-level controller through imitation learning. Then, given an open-vocabulary text instruction, we train a high-level policy that assembles appropriate atomic actions to maximize the CLIP similarity between the rendered images of the agent and the text instruction. Moreover, since we employ an image-based reward design for our high-level policy, the agent can naturally learn the interactions with objects without hand-crafted reward engineering. We show the ability of AnySkill to learn reasonable and natural motion sequences in response to unseen captions of varying lengths, demonstrating the first method to perform open-vocabulary physical skill for interactive humanoid agents.