Scaled relative pose estimation, i.e., estimating relative rotation and scaled relative translation between two images, has always been a major challenge in global Structure-from-Motion (SfM). This difficulty arises because the two-view relative translation computed by traditional geometric vision methods, e.g. the five-point algorithm, is scaleless. Many researchers have proposed diverse translation averaging methods to solve this problem. Instead of solving the problem in the motion averaging phase, we focus on estimating scaled relative pose with the help of panoramic cameras and deep neural networks. In this paper, a novel network, namely PanoPose, is proposed to estimate the relative motion in a fully self-supervised manner and a global SfM pipeline is built for panorama images. The proposed PanoPose comprises a depth-net and a pose-net, with self-supervision achieved by reconstructing the reference image from its neighboring images based on the estimated depth and relative pose. To maintain precise pose estimation under large viewing angle differences, we randomly rotate the panoramic images and pre-train the pose-net with images before and after the rotation. To enhance scale accuracy, a fusion block is introduced to incorporate depth information into pose estimation. Extensive experiments on panoramic SfM datasets demonstrate the effectiveness of PanoPose compared with state-of-the-arts.