This thesis introduces Social Activity Forecasting (SAF), a multi-level forecasting task in which future activities are predicted at a fixed horizon for heterogeneous social entities (individuals, pairwise interactions, and groups) in robot-centric panoramic scenes. SAF is formulated as a multi-label, multi-entity forecasting problem with level-specific label spaces, and is instantiated on JRDB-Social under multiple settings that vary observation length and forecast horizon. To address the task, the thesis proposes MMSAFNet, a modular multimodal architecture that supports heterogeneous inputs (including RGB, current activity vectors, textual embeddings, and spatial cues), configurable fusion mechanisms, cross-level information sharing, and alternative latent dynamics modules, enabling controlled architectural and modality ablations. A dedicated evaluation protocol is also introduced, combining per-level samplewise precision/recall/F1 with subset-based reporting over overall, action change, and action unchanged entities, and mean average precision (mAP) for tail-sensitive analysis. Experiments across multiple dataset variants show that persistence-based baselines are highly competitive on the overall subset due to strong dataset priors, while learned multimodal models provide stronger evidence of non-trivial forecasting on the action change subset, especially in tail-sensitive metrics. Results further indicate that modality choice has a larger impact than the explored architectural toggles, with semantic modalities (activity vectors and textual embeddings) providing the strongest gains in the current benchmark regime. Overall, the thesis provides a foundational task formulation, modular baseline architecture, and evaluation methodology for multi-level social activity forecasting in robot-centric scenes.

This thesis introduces Social Activity Forecasting (SAF), a multi-level forecasting task in which future activities are predicted at a fixed horizon for heterogeneous social entities (individuals, pairwise interactions, and groups) in robot-centric panoramic scenes. SAF is formulated as a multi-label, multi-entity forecasting problem with level-specific label spaces, and is instantiated on JRDB-Social under multiple settings that vary observation length and forecast horizon. To address the task, the thesis proposes MMSAFNet, a modular multimodal architecture that supports heterogeneous inputs (including RGB, current activity vectors, textual embeddings, and spatial cues), configurable fusion mechanisms, cross-level information sharing, and alternative latent dynamics modules, enabling controlled architectural and modality ablations. A dedicated evaluation protocol is also introduced, combining per-level samplewise precision/recall/F1 with subset-based reporting over overall, action change, and action unchanged entities, and mean average precision (mAP) for tail-sensitive analysis. Experiments across multiple dataset variants show that persistence-based baselines are highly competitive on the overall subset due to strong dataset priors, while learned multimodal models provide stronger evidence of non-trivial forecasting on the action change subset, especially in tail-sensitive metrics. Results further indicate that modality choice has a larger impact than the explored architectural toggles, with semantic modalities (activity vectors and textual embeddings) providing the strongest gains in the current benchmark regime. Overall, the thesis provides a foundational task formulation, modular baseline architecture, and evaluation methodology for multi-level social activity forecasting in robot-centric scenes.

Social Activity Forecasting: A Foundational Study of Multi-Level Forecasting in Robot-Centric Social Scenes

MARCHIORO, PIERLUIGI
2024/2025

Abstract

This thesis introduces Social Activity Forecasting (SAF), a multi-level forecasting task in which future activities are predicted at a fixed horizon for heterogeneous social entities (individuals, pairwise interactions, and groups) in robot-centric panoramic scenes. SAF is formulated as a multi-label, multi-entity forecasting problem with level-specific label spaces, and is instantiated on JRDB-Social under multiple settings that vary observation length and forecast horizon. To address the task, the thesis proposes MMSAFNet, a modular multimodal architecture that supports heterogeneous inputs (including RGB, current activity vectors, textual embeddings, and spatial cues), configurable fusion mechanisms, cross-level information sharing, and alternative latent dynamics modules, enabling controlled architectural and modality ablations. A dedicated evaluation protocol is also introduced, combining per-level samplewise precision/recall/F1 with subset-based reporting over overall, action change, and action unchanged entities, and mean average precision (mAP) for tail-sensitive analysis. Experiments across multiple dataset variants show that persistence-based baselines are highly competitive on the overall subset due to strong dataset priors, while learned multimodal models provide stronger evidence of non-trivial forecasting on the action change subset, especially in tail-sensitive metrics. Results further indicate that modality choice has a larger impact than the explored architectural toggles, with semantic modalities (activity vectors and textual embeddings) providing the strongest gains in the current benchmark regime. Overall, the thesis provides a foundational task formulation, modular baseline architecture, and evaluation methodology for multi-level social activity forecasting in robot-centric scenes.
2024
This thesis introduces Social Activity Forecasting (SAF), a multi-level forecasting task in which future activities are predicted at a fixed horizon for heterogeneous social entities (individuals, pairwise interactions, and groups) in robot-centric panoramic scenes. SAF is formulated as a multi-label, multi-entity forecasting problem with level-specific label spaces, and is instantiated on JRDB-Social under multiple settings that vary observation length and forecast horizon. To address the task, the thesis proposes MMSAFNet, a modular multimodal architecture that supports heterogeneous inputs (including RGB, current activity vectors, textual embeddings, and spatial cues), configurable fusion mechanisms, cross-level information sharing, and alternative latent dynamics modules, enabling controlled architectural and modality ablations. A dedicated evaluation protocol is also introduced, combining per-level samplewise precision/recall/F1 with subset-based reporting over overall, action change, and action unchanged entities, and mean average precision (mAP) for tail-sensitive analysis. Experiments across multiple dataset variants show that persistence-based baselines are highly competitive on the overall subset due to strong dataset priors, while learned multimodal models provide stronger evidence of non-trivial forecasting on the action change subset, especially in tail-sensitive metrics. Results further indicate that modality choice has a larger impact than the explored architectural toggles, with semantic modalities (activity vectors and textual embeddings) providing the strongest gains in the current benchmark regime. Overall, the thesis provides a foundational task formulation, modular baseline architecture, and evaluation methodology for multi-level social activity forecasting in robot-centric scenes.
File in questo prodotto:
File Dimensione Formato  
MSc_Thesis_2025___Pierluigi_Marchioro (PDF-A).pdf

accesso aperto

Dimensione 6.04 MB
Formato Adobe PDF
6.04 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14247/28247