Autonomous robot navigation remains a fundamental yet challenging problem due to the complexity and variability of real world environments. In human-populated settings, navigation systems must not only detect and track individuals but also anticipate their future trajectories to ensure safe interaction. Accurate human trajectory forecasting is, therefore, critical for deployment in crowded and dynamic scenes. Human motion is a combination of non-linear interactions between people influ- enced by both social behaviors and environmental context. There are two types of problems in this field: trajectory prediction approaches that adopt a bird’s-eye view (BEV) and real world robotic systems that operate from a first-person view (FPV), where the central agent is part of the observed scene. A key distinction in FPV is the mobility of the central agent, the robot, which introduces global motion in addition to the local motions introduced by other agents in the scene. The interplay between global and local motion complicates trajectory forecasting in FPV settings, creating what we call a dynamic scene. This thesis introduces approaches for reducing the effects of global motion in first- person view (FPV) trajectory forecasting by transforming the problem in a way that works with bird’s-eye view (BEV) approaches. We compensate for global motion by mapping FPV agent coordinates onto a static BEV world coordinate system. This transformation allows us to leverage existing BEV based models which were originally developed for static environments in dynamic scenes. The proposed pipeline contains two main components. First, we implement a global motion compensation module based on affine transformations which detects and utilizes visual features from objects detected in dynamic scene. We extract ORB keypoints across every two consecutive grayscale frames and match them [13]. After that we employ RANSAC to compute a robust affine matrix representing global camera motion, which we then neutralize [13]. Second, we introduce Pixel2World, a network with autoencoder architecture and attention layers which is trained to project agent coordinates from FPV image space (pixels) to BEV world space (meters). The input of the model is a com- bination of geometric and visual features, in conjunction with camera calibration parameters, to accurately perform this transformation. We integrate these components into an end-to-end pipeline that processes FPV input frames through three steps: first, it compensates for global motion; then, it converts historical FPV coordinates into BEV world coordinates; and finally, it predicts future BEV world coordinates for each observed agent. The resulting trajectories can be fed directly into existing BEV-based forecasting models without retraining. We evaluated the proposed pipeline on both JRDB dataset [5] which contains FPV scenarios and ETH dataset with static BEV scenes [16]. Experimental re- i sults demonstrate that global motion compensation improves the performance of the HST model [11], trained solely on static BEV data from ETH but tested on dynamic scenes from JRDB dataset. Specifically, the average displacement error (ADE) is reduced from 2.61 meters to 2.11 meters. This work presents a novel FPV-to-BEV transformation framework that operates using only visual input and camera calibration, eliminating the need for high pre- cision odometry. This pipeline allows us to adopt existing BEV approaches for trajectory forecasting in static environments for FPV dynamic scenarios. Further- more, the proposed motion compensation strategy offers a broader applicability in domains requiring the disentanglement of camera motion from dynamic visual scenes.
Autonomous robot navigation remains a fundamental yet challenging problem due to the complexity and variability of real world environments. In human-populated settings, navigation systems must not only detect and track individuals but also anticipate their future trajectories to ensure safe interaction. Accurate human trajectory forecasting is, therefore, critical for deployment in crowded and dynamic scenes. Human motion is a combination of non-linear interactions between people influ- enced by both social behaviors and environmental context. There are two types of problems in this field: trajectory prediction approaches that adopt a bird’s-eye view (BEV) and real world robotic systems that operate from a first-person view (FPV), where the central agent is part of the observed scene. A key distinction in FPV is the mobility of the central agent, the robot, which introduces global motion in addition to the local motions introduced by other agents in the scene. The interplay between global and local motion complicates trajectory forecasting in FPV settings, creating what we call a dynamic scene. This thesis introduces approaches for reducing the effects of global motion in first- person view (FPV) trajectory forecasting by transforming the problem in a way that works with bird’s-eye view (BEV) approaches. We compensate for global motion by mapping FPV agent coordinates onto a static BEV world coordinate system. This transformation allows us to leverage existing BEV based models which were originally developed for static environments in dynamic scenes. The proposed pipeline contains two main components. First, we implement a global motion compensation module based on affine transformations which detects and utilizes visual features from objects detected in dynamic scene. We extract ORB keypoints across every two consecutive grayscale frames and match them [13]. After that we employ RANSAC to compute a robust affine matrix representing global camera motion, which we then neutralize [13]. Second, we introduce Pixel2World, a network with autoencoder architecture and attention layers which is trained to project agent coordinates from FPV image space (pixels) to BEV world space (meters). The input of the model is a com- bination of geometric and visual features, in conjunction with camera calibration parameters, to accurately perform this transformation. We integrate these components into an end-to-end pipeline that processes FPV input frames through three steps: first, it compensates for global motion; then, it converts historical FPV coordinates into BEV world coordinates; and finally, it predicts future BEV world coordinates for each observed agent. The resulting trajectories can be fed directly into existing BEV-based forecasting models without retraining. We evaluated the proposed pipeline on both JRDB dataset [5] which contains FPV scenarios and ETH dataset with static BEV scenes [16]. Experimental re- i sults demonstrate that global motion compensation improves the performance of the HST model [11], trained solely on static BEV data from ETH but tested on dynamic scenes from JRDB dataset. Specifically, the average displacement error (ADE) is reduced from 2.61 meters to 2.11 meters. This work presents a novel FPV-to-BEV transformation framework that operates using only visual input and camera calibration, eliminating the need for high pre- cision odometry. This pipeline allows us to adopt existing BEV approaches for trajectory forecasting in static environments for FPV dynamic scenarios. Further- more, the proposed motion compensation strategy offers a broader applicability in domains requiring the disentanglement of camera motion from dynamic visual scenes.
Future Localization of Human Trajectories in Dynamic First-Person View Video
KUTTUBEK KYZY, GAUKHAR
2024/2025
Abstract
Autonomous robot navigation remains a fundamental yet challenging problem due to the complexity and variability of real world environments. In human-populated settings, navigation systems must not only detect and track individuals but also anticipate their future trajectories to ensure safe interaction. Accurate human trajectory forecasting is, therefore, critical for deployment in crowded and dynamic scenes. Human motion is a combination of non-linear interactions between people influ- enced by both social behaviors and environmental context. There are two types of problems in this field: trajectory prediction approaches that adopt a bird’s-eye view (BEV) and real world robotic systems that operate from a first-person view (FPV), where the central agent is part of the observed scene. A key distinction in FPV is the mobility of the central agent, the robot, which introduces global motion in addition to the local motions introduced by other agents in the scene. The interplay between global and local motion complicates trajectory forecasting in FPV settings, creating what we call a dynamic scene. This thesis introduces approaches for reducing the effects of global motion in first- person view (FPV) trajectory forecasting by transforming the problem in a way that works with bird’s-eye view (BEV) approaches. We compensate for global motion by mapping FPV agent coordinates onto a static BEV world coordinate system. This transformation allows us to leverage existing BEV based models which were originally developed for static environments in dynamic scenes. The proposed pipeline contains two main components. First, we implement a global motion compensation module based on affine transformations which detects and utilizes visual features from objects detected in dynamic scene. We extract ORB keypoints across every two consecutive grayscale frames and match them [13]. After that we employ RANSAC to compute a robust affine matrix representing global camera motion, which we then neutralize [13]. Second, we introduce Pixel2World, a network with autoencoder architecture and attention layers which is trained to project agent coordinates from FPV image space (pixels) to BEV world space (meters). The input of the model is a com- bination of geometric and visual features, in conjunction with camera calibration parameters, to accurately perform this transformation. We integrate these components into an end-to-end pipeline that processes FPV input frames through three steps: first, it compensates for global motion; then, it converts historical FPV coordinates into BEV world coordinates; and finally, it predicts future BEV world coordinates for each observed agent. The resulting trajectories can be fed directly into existing BEV-based forecasting models without retraining. We evaluated the proposed pipeline on both JRDB dataset [5] which contains FPV scenarios and ETH dataset with static BEV scenes [16]. Experimental re- i sults demonstrate that global motion compensation improves the performance of the HST model [11], trained solely on static BEV data from ETH but tested on dynamic scenes from JRDB dataset. Specifically, the average displacement error (ADE) is reduced from 2.61 meters to 2.11 meters. This work presents a novel FPV-to-BEV transformation framework that operates using only visual input and camera calibration, eliminating the need for high pre- cision odometry. This pipeline allows us to adopt existing BEV approaches for trajectory forecasting in static environments for FPV dynamic scenarios. Further- more, the proposed motion compensation strategy offers a broader applicability in domains requiring the disentanglement of camera motion from dynamic visual scenes.| File | Dimensione | Formato | |
|---|---|---|---|
|
RealFinalFinal-Converted.pdf
accesso aperto
Dimensione
823.11 kB
Formato
Adobe PDF
|
823.11 kB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14247/26982