Single-pixel imaging (SPI) is a novel computational imaging technique in recent years, which utilizes a spatial light modulator (SLM) to modulate the light distribution and a single-pixel detector (SPD) to record the total reflected/transmissive light intensity for 2- or 3-dimensional object reconstruction. SPI enjoys the advantages of low cost, wide range of detection and high sensitivity compared with conventional array detector. However, SPI takes multiple projections for spatial resolution, the imaging time and quality are linearly related to the number of detection, which largely restricts its application in real time. The introduction of deep learning has achieved significant improvement for SPI in imaging quality and speed. How to further improve the interpretability and performance of deep learning, reduce the computational workload, especially for large-scale imaging, still remain unsolved issues. In this paper, we introduce a novel 2-D modulation method for large-scale SPI. Basically, we utilize the properties of Kronecker product to decompose the large-scale sampling matrix into two much more smaller ones for the initialization of deep learning, thus further improves the training speed and reduces the usage of GPU memory. Besides, a cross-stage multi-scale deep unfolding network (DUN) with Dual-Scale attention (DSA) is proposed for SPI reconstruction. The design of cross-stage multi-scale DUN guarantees the extraction of deep features and its adequate transfer among stages. Inspired by the multi-scale Transformer, the DSA is introduced into DUN to capture multi-frequencies features for further denoising. Finally, we demonstrate the feasibility and effectiveness of our proposed method with both simulation and real experimental results.