This thesis introduces Social Activity Forecasting (SAF), a multi-level forecasting task in which future activities are predicted at a fixed horizon for heterogeneous social entities (individuals, pairwise interactions, and groups) in robot-centric panoramic scenes. SAF is formulated as a multi-label, multi-entity forecasting problem with level-specific label spaces, and is instantiated on JRDB-Social under multiple settings that vary observation length and forecast horizon. To address the task, the thesis proposes MMSAFNet, a modular multimodal architecture that supports heterogeneous inputs (including RGB, current activity vectors, textual embeddings, and spatial cues), configurable fusion mechanisms, cross-level information sharing, and alternative latent dynamics modules, enabling controlled architectural and modality ablations. A dedicated evaluation protocol is also introduced, combining per-level samplewise precision/recall/F1 with subset-based reporting over overall, action change, and action unchanged entities, and mean average precision (mAP) for tail-sensitive analysis. Experiments across multiple dataset variants show that persistence-based baselines are highly competitive on the overall subset due to strong dataset priors, while learned multimodal models provide stronger evidence of non-trivial forecasting on the action change subset, especially in tail-sensitive metrics. Results further indicate that modality choice has a larger impact than the explored architectural toggles, with semantic modalities (activity vectors and textual embeddings) providing the strongest gains in the current benchmark regime. Overall, the thesis provides a foundational task formulation, modular baseline architecture, and evaluation methodology for multi-level social activity forecasting in robot-centric scenes.
This thesis introduces Social Activity Forecasting (SAF), a multi-level forecasting task in which future activities are predicted at a fixed horizon for heterogeneous social entities (individuals, pairwise interactions, and groups) in robot-centric panoramic scenes. SAF is formulated as a multi-label, multi-entity forecasting problem with level-specific label spaces, and is instantiated on JRDB-Social under multiple settings that vary observation length and forecast horizon. To address the task, the thesis proposes MMSAFNet, a modular multimodal architecture that supports heterogeneous inputs (including RGB, current activity vectors, textual embeddings, and spatial cues), configurable fusion mechanisms, cross-level information sharing, and alternative latent dynamics modules, enabling controlled architectural and modality ablations. A dedicated evaluation protocol is also introduced, combining per-level samplewise precision/recall/F1 with subset-based reporting over overall, action change, and action unchanged entities, and mean average precision (mAP) for tail-sensitive analysis. Experiments across multiple dataset variants show that persistence-based baselines are highly competitive on the overall subset due to strong dataset priors, while learned multimodal models provide stronger evidence of non-trivial forecasting on the action change subset, especially in tail-sensitive metrics. Results further indicate that modality choice has a larger impact than the explored architectural toggles, with semantic modalities (activity vectors and textual embeddings) providing the strongest gains in the current benchmark regime. Overall, the thesis provides a foundational task formulation, modular baseline architecture, and evaluation methodology for multi-level social activity forecasting in robot-centric scenes.
Social Activity Forecasting: A Foundational Study of Multi-Level Forecasting in Robot-Centric Social Scenes
MARCHIORO, PIERLUIGI
2024/2025
Abstract
This thesis introduces Social Activity Forecasting (SAF), a multi-level forecasting task in which future activities are predicted at a fixed horizon for heterogeneous social entities (individuals, pairwise interactions, and groups) in robot-centric panoramic scenes. SAF is formulated as a multi-label, multi-entity forecasting problem with level-specific label spaces, and is instantiated on JRDB-Social under multiple settings that vary observation length and forecast horizon. To address the task, the thesis proposes MMSAFNet, a modular multimodal architecture that supports heterogeneous inputs (including RGB, current activity vectors, textual embeddings, and spatial cues), configurable fusion mechanisms, cross-level information sharing, and alternative latent dynamics modules, enabling controlled architectural and modality ablations. A dedicated evaluation protocol is also introduced, combining per-level samplewise precision/recall/F1 with subset-based reporting over overall, action change, and action unchanged entities, and mean average precision (mAP) for tail-sensitive analysis. Experiments across multiple dataset variants show that persistence-based baselines are highly competitive on the overall subset due to strong dataset priors, while learned multimodal models provide stronger evidence of non-trivial forecasting on the action change subset, especially in tail-sensitive metrics. Results further indicate that modality choice has a larger impact than the explored architectural toggles, with semantic modalities (activity vectors and textual embeddings) providing the strongest gains in the current benchmark regime. Overall, the thesis provides a foundational task formulation, modular baseline architecture, and evaluation methodology for multi-level social activity forecasting in robot-centric scenes.| File | Dimensione | Formato | |
|---|---|---|---|
|
MSc_Thesis_2025___Pierluigi_Marchioro (PDF-A).pdf
accesso aperto
Dimensione
6.04 MB
Formato
Adobe PDF
|
6.04 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14247/28247