Panoramic viewport prediction is crucial in 360-degree video streaming, aiming to forecast users' future viewing regions for efficient bandwidth management. To achieve accurate panoramic viewport prediction, existing frameworks have explored the utilization of multi-modal inputs, combining trajectory, visual, and audio data. However, they uniformly process different modalities through standardized pipelines and use concatenation-based feature fusion regardless of modality characteristics. With the unmodified application of computationally intensive Transformer architectures, the uniform design exacerbates computational overhead. Besides that, the concatenation-based feature fusion lacks the ability to model global dependencies and explicit interactions between different modalities, which limits the prediction accuracy. To overcome these issues, we introduce a lightweight Modality Diversity-Aware (MDA) framework including two primary components: a lightweight feature refinement module and a cross-modal attention module. The feature refinement module uses compact latent tokens to sequentially process audio-visual data, thereby filtering out irrelevant background signals and reducing model parameters. Following this, our cross-modal attention module effectively fuses trajectory features with the refined audio-visual features by allocating attention weights on the effective features, improving the prediction accuracy. Experimental results on a standard 360-degree video benchmark demonstrate that our MDA framework achieves higher prediction accuracy than current multi-modal frameworks, while requiring up to 50% fewer parameters.
Research Article
Open Access