With the advent of generative models and vision-language pretraining, significant improvement has been made in text-driven face manipulation. The text embedding can be used as target supervision for expression control. However, it is non-trivial to associate with its 3D attributes, \ie, pose and illumination. To address these issues, we propose a Text-conditional Attribute aLignment approach for 3D controllable face image synthesis, and our model is referred to as TcALign. Specifically, since the 3D rendered image can be precisely controlled with the 3D face representation, we first propose a Text-conditional 3D Editor to produce the target face representation to realize text-driven manipulation in the 3D space. An attribute embedding space spanned by the target-related attributes embeddings is also introduced to infer the disentangled task-specific direction.Next, we train a cross-modal latent mapping network conditioned on the derived difference of 3D representation to infer a correct vector in the latent space of StyleGAN. This correction vector learning design can accurately transfer the attribute manipulation on 3D images to 2D images. We show that the proposed method delivers more precise text-driven multi-attribute manipulation for 3D controllable face image synthesis. Extensive qualitative and quantitative experiments verify the effectiveness and superiority of our method over the other competing methods.