Recovering 3D human mesh from videos has recently made significant progress. However, most of the existing methods focus on the temporal consistency of videos, while ignoring the spatial representation in complex scenes, thus failing to recover a reasonable and smooth human mesh sequence under extreme illumination and chaotic backgrounds.To alleviate this problem, we propose a two-stage co-segmentation network based on discriminative representation for recovering human body meshes from videos. Specifically, the first stage of the network segments the video spatial domain to spotlight spatially fine-grained information, and then learns and enhances the intra-frame discriminative representation through a dual-excitation mechanism and a frequency domain enhancement module, while suppressing irrelevant information (e.g., background). The second stage focuses on temporal context by segmenting the video temporal domain, and models inter-frame discriminative representation via a dynamic integration strategy.Further, to efficiently generate reasonable human discriminative actions, we carefully elaborate a landmark anchor area loss to constrain the variation of the human motion area. Extensive experimental results on large publicly available datasets indicate the superiority in comparison with most state-of-the-art. Code will be made public.