Future Localization of Human Trajectories in Dynamic First-Person View Video

Autonomous robot navigation remains a fundamental yet challenging problem due to the complexity and variability of real world environments. In human-populated settings, navigation systems must not only detect and track individuals but also anticipate their future trajectories to ensure safe interaction. Accurate human trajectory forecasting is, therefore, critical for deployment in crowded and dynamic scenes. Human motion is a combination of non-linear interactions between people influ- enced by both social behaviors and environmental context. There are two types of problems in this field: trajectory prediction approaches that adopt a bird’s-eye view (BEV) and real world robotic systems that operate from a first-person view (FPV), where the central agent is part of the observed scene. A key distinction in FPV is the mobility of the central agent, the robot, which introduces global motion in addition to the local motions introduced by other agents in the scene. The interplay between global and local motion complicates trajectory forecasting in FPV settings, creating what we call a dynamic scene. This thesis introduces approaches for reducing the effects of global motion in first- person view (FPV) trajectory forecasting by transforming the problem in a way that works with bird’s-eye view (BEV) approaches. We compensate for global motion by mapping FPV agent coordinates onto a static BEV world coordinate system. This transformation allows us to leverage existing BEV based models which were originally developed for static environments in dynamic scenes. The proposed pipeline contains two main components. First, we implement a global motion compensation module based on affine transformations which detects and utilizes visual features from objects detected in dynamic scene. We extract ORB keypoints across every two consecutive grayscale frames and match them [13]. After that we employ RANSAC to compute a robust affine matrix representing global camera motion, which we then neutralize [13]. Second, we introduce Pixel2World, a network with autoencoder architecture and attention layers which is trained to project agent coordinates from FPV image space (pixels) to BEV world space (meters). The input of the model is a com- bination of geometric and visual features, in conjunction with camera calibration parameters, to accurately perform this transformation. We integrate these components into an end-to-end pipeline that processes FPV input frames through three steps: first, it compensates for global motion; then, it converts historical FPV coordinates into BEV world coordinates; and finally, it predicts future BEV world coordinates for each observed agent. The resulting trajectories can be fed directly into existing BEV-based forecasting models without retraining. We evaluated the proposed pipeline on both JRDB dataset [5] which contains FPV scenarios and ETH dataset with static BEV scenes [16]. Experimental re- i sults demonstrate that global motion compensation improves the performance of the HST model [11], trained solely on static BEV data from ETH but tested on dynamic scenes from JRDB dataset. Specifically, the average displacement error (ADE) is reduced from 2.61 meters to 2.11 meters. This work presents a novel FPV-to-BEV transformation framework that operates using only visual input and camera calibration, eliminating the need for high pre- cision odometry. This pipeline allows us to adopt existing BEV approaches for trajectory forecasting in static environments for FPV dynamic scenarios. Further- more, the proposed motion compensation strategy offers a broader applicability in domains requiring the disentanglement of camera motion from dynamic visual scenes.

Future Localization of Human Trajectories in Dynamic First-Person View Video

KUTTUBEK KYZY, GAUKHAR

2024/2025

Abstract

Autonomous robot navigation remains a fundamental yet challenging problem due to the complexity and variability of real world environments. In human-populated settings, navigation systems must not only detect and track individuals but also anticipate their future trajectories to ensure safe interaction. Accurate human trajectory forecasting is, therefore, critical for deployment in crowded and dynamic scenes. Human motion is a combination of non-linear interactions between people influ- enced by both social behaviors and environmental context. There are two types of problems in this field: trajectory prediction approaches that adopt a bird’s-eye view (BEV) and real world robotic systems that operate from a first-person view (FPV), where the central agent is part of the observed scene. A key distinction in FPV is the mobility of the central agent, the robot, which introduces global motion in addition to the local motions introduced by other agents in the scene. The interplay between global and local motion complicates trajectory forecasting in FPV settings, creating what we call a dynamic scene. This thesis introduces approaches for reducing the effects of global motion in first- person view (FPV) trajectory forecasting by transforming the problem in a way that works with bird’s-eye view (BEV) approaches. We compensate for global motion by mapping FPV agent coordinates onto a static BEV world coordinate system. This transformation allows us to leverage existing BEV based models which were originally developed for static environments in dynamic scenes. The proposed pipeline contains two main components. First, we implement a global motion compensation module based on affine transformations which detects and utilizes visual features from objects detected in dynamic scene. We extract ORB keypoints across every two consecutive grayscale frames and match them [13]. After that we employ RANSAC to compute a robust affine matrix representing global camera motion, which we then neutralize [13]. Second, we introduce Pixel2World, a network with autoencoder architecture and attention layers which is trained to project agent coordinates from FPV image space (pixels) to BEV world space (meters). The input of the model is a com- bination of geometric and visual features, in conjunction with camera calibration parameters, to accurately perform this transformation. We integrate these components into an end-to-end pipeline that processes FPV input frames through three steps: first, it compensates for global motion; then, it converts historical FPV coordinates into BEV world coordinates; and finally, it predicts future BEV world coordinates for each observed agent. The resulting trajectories can be fed directly into existing BEV-based forecasting models without retraining. We evaluated the proposed pipeline on both JRDB dataset [5] which contains FPV scenarios and ETH dataset with static BEV scenes [16]. Experimental re- i sults demonstrate that global motion compensation improves the performance of the HST model [11], trained solely on static BEV data from ETH but tested on dynamic scenes from JRDB dataset. Specifically, the average displacement error (ADE) is reduced from 2.61 meters to 2.11 meters. This work presents a novel FPV-to-BEV transformation framework that operates using only visual input and camera calibration, eliminating the need for high pre- cision odometry. This pipeline allows us to adopt existing BEV approaches for trajectory forecasting in static environments for FPV dynamic scenarios. Further- more, the proposed motion compensation strategy offers a broader applicability in domains requiring the disentanglement of camera motion from dynamic visual scenes.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				INFORMATICA - COMPUTER SCIENCE
			
	Anno Accademico
	
				2024
			
	Abstract in italiano
	
				Autonomous robot navigation remains a fundamental yet challenging problem due
to the complexity and variability of real world environments. In human-populated
settings, navigation systems must not only detect and track individuals but also
anticipate their future trajectories to ensure safe interaction. Accurate human
trajectory forecasting is, therefore, critical for deployment in crowded and dynamic
scenes.
Human motion is a combination of non-linear interactions between people influ-
enced by both social behaviors and environmental context. There are two types
of problems in this field: trajectory prediction approaches that adopt a bird’s-eye
view (BEV) and real world robotic systems that operate from a first-person view
(FPV), where the central agent is part of the observed scene. A key distinction
in FPV is the mobility of the central agent, the robot, which introduces global
motion in addition to the local motions introduced by other agents in the scene.
The interplay between global and local motion complicates trajectory forecasting
in FPV settings, creating what we call a dynamic scene.
This thesis introduces approaches for reducing the effects of global motion in first-
person view (FPV) trajectory forecasting by transforming the problem in a way
that works with bird’s-eye view (BEV) approaches. We compensate for global
motion by mapping FPV agent coordinates onto a static BEV world coordinate
system. This transformation allows us to leverage existing BEV based models
which were originally developed for static environments in dynamic scenes.
The proposed pipeline contains two main components. First, we implement a
global motion compensation module based on affine transformations which detects
and utilizes visual features from objects detected in dynamic scene. We extract
ORB keypoints across every two consecutive grayscale frames and match them [13].
After that we employ RANSAC to compute a robust affine matrix representing
global camera motion, which we then neutralize [13].
Second, we introduce Pixel2World, a network with autoencoder architecture and
attention layers which is trained to project agent coordinates from FPV image
space (pixels) to BEV world space (meters). The input of the model is a com-
bination of geometric and visual features, in conjunction with camera calibration
parameters, to accurately perform this transformation.
We integrate these components into an end-to-end pipeline that processes FPV
input frames through three steps: first, it compensates for global motion; then,
it converts historical FPV coordinates into BEV world coordinates; and finally,
it predicts future BEV world coordinates for each observed agent. The resulting
trajectories can be fed directly into existing BEV-based forecasting models without
retraining.
We evaluated the proposed pipeline on both JRDB dataset [5] which contains
FPV scenarios and ETH dataset with static BEV scenes [16]. Experimental re-
i
sults demonstrate that global motion compensation improves the performance of
the HST model [11], trained solely on static BEV data from ETH but tested on
dynamic scenes from JRDB dataset. Specifically, the average displacement error
(ADE) is reduced from 2.61 meters to 2.11 meters.
This work presents a novel FPV-to-BEV transformation framework that operates
using only visual input and camera calibration, eliminating the need for high pre-
cision odometry. This pipeline allows us to adopt existing BEV approaches for
trajectory forecasting in static environments for FPV dynamic scenarios. Further-
more, the proposed motion compensation strategy offers a broader applicability
in domains requiring the disentanglement of camera motion from dynamic visual
scenes.
			
	Relatore
	
				RAHMAN, MUHAMMAD RAMEEZ UR
			
	Correlatore
	
				VASCON, SEBASTIANO
			
	Appare nelle tipologie:
	
				Laurea magistrale

File in questo prodotto:

File	Dimensione	Formato
RealFinalFinal-Converted.pdf accesso aperto Dimensione 823.11 kB Formato Adobe PDF Visualizza/Apri	823.11 kB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14247/26982