Reconstructing CAD construction sequences from raw 3D geometry serves as an interface between real-world objects and digital designs. In this paper, we propose CAD-Diffuser, a multimodal diffusion scheme aiming at integrating top-down design paradigm into generative reconstruction. In particular, we unify CAD point clouds and CAD construction sequences at the token level, guiding our proposed multimodal diffusion strategy to understand and link between the geometry and the design intent concentrated in construction sequences. Leveraging the strong decoding abilities of language models, the forward process is modeled as a random walk between the original token and the [MASK] token, while the reverse process naturally fits the masked token modeling scheme. A volume-based noise schedule is designed to encourage outline-first generation, decomposing the top-down design methodology into a machine-understandable procedure. For tokenizing CAD data of multiple modalities, we introduce a tokenizer with a self-supervised face segmentation task to compress local and global geometric information for CAD point clouds, and the CAD construction sequence is transformed into a primitive token string. Experimental results show that our CAD-Diffuser can perceive geometric details and the results are more likely to be reused by human designers.