Driver and Pedestrian Mutual Awareness for Path Prediction in Intelligent Vehicles

M. Roth

doi:10.4233/uuid:7d3c6107-812e-458e-bd11-04f5c1e5931a

Driver and Pedestrian Mutual Awareness for Path Prediction in Intelligent Vehicles

M. Roth

Intelligent Vehicles

Research output: Thesis › Dissertation (TU Delft)

52 Downloads (Pure)

Abstract

This thesis addresses the sensor-based perception of driver and pedestrian to improve joint path prediction of ego-vehicle and pedestrian based on mutual awareness in the domain of intelligent vehicles.

According to the World Health Organization (WHO), more than half of global traffic deaths are among Vulnerable Road Users (VRUs), such as pedestrians and riders, and human error is still a major cause of accidents. This motivates paying special attention to pedestrians and drivers while they are interacting in traffic. For the foreseeable future, the reality on the road (and the accident numbers) will largely be determined by Advanced Driver-assistance Systems (ADAS) where the driver is still required to keep the eyes on the road. To that end, the scope of this thesis resides within ADAS and driving automation up to (including) autonomy level 3 as defined by the Society of Automotive Engineers (SAE). While current ADAS consider pedestrians and the driver individually, their mutual awareness has not been leveraged to improve path prediction and thereby road safety. This thesis presents a framework that estimates driver head pose from driver camera images, estimates pedestrian location and orientation from exterior camera images and lidar point clouds, uses this information over time to reason about driver and pedestrian mutual awareness, and performs joint probabilistic path prediction of ego-vehicle and pedestrian to assess collision risk.

Deep neural networks demand a large training set to tune the vast amount of parameters. This thesis introduces DD-Pose, the Daimler TU Delft Driver Head Pose Benchmark, a large-scale and diverse benchmark for image-based head pose estimation and driver analysis. It contains 330k measurements from multiple cameras acquired by an in-car setup during naturalistic drives. Large out-of-plane head rotations and occlusions are induced by complex driving scenarios. Precise head pose annotations are obtained by a motion capture sensor and a novel calibration device. The new dataset offers a broad distribution of head poses, comprising an order of magnitude more samples of rare poses than a comparable dataset.

Utilizing the dataset, this thesis presents intrApose, a novel method for continuous 6 degrees of freedom (DOF) head pose estimation from a single camera image without prior detection or landmark localization. intrApose uses camera intrinsics consistently within the deep neural network and is crop-aware and scale-aware: poses estimated from bounding boxes within the overall image are converted to a consistent pose within the camera frame. It employs a continuous, differentiable rotation representation that simplifies the overall architecture compared to existing methods. Experiments show that leveraging camera intrinsics and a continuous rotation representation (SVDO+) results in improved pose estimation compared to intrinsics agnostic variants and variants with discontinuous rotation representations. Driver head pose of naturalistic driving is biased towards close-to-frontal orientations. Training with an unbiased data distribution, i.e., a more uniform distribution of head poses, further reduces rotation error, specifically for extreme orientations and occlusions.

In addition to considering the inside of the vehicle, this thesis also focuses on the outside environment and presents a method for 3D person detection from a pair of camera image and lidar point cloud in automotive scenes. The method comprises a deep neural network that estimates the 3D location, spatial extent, and yaw orientation of persons present in the scene. 3D anchor proposals are refined in two stages: a region proposal network and a subsequent detection network. For both input modalities high-level feature representations are learned from raw sensor data instead of being manually designed. To that end, the method uses Voxel Feature Encoders to obtain point cloud features instead of widely used projection-based point cloud representations. Experiments are conducted on the KITTI 3D object detection benchmark, a commonly used dataset in the automotive domain.

Eventually, the output provided by the methods of the former chapters, namely, driver head pose and 3D person locations, are leveraged by a novel method for vehicle-pedestrian path prediction that takes into account the awareness of the driver and the pedestrian of each other’s presence. The method jointly models the paths of ego-vehicle and a pedestrian within a single Dynamic Bayesian Network (DBN). In this DBN, subgraphs model the environment and entity-specific context cues of the vehicle and pedestrian (incl. awareness), which affect their future motion. These sub-graphs share a latent state which models whether the vehicle and pedestrian are on collision course. The method is validated with real-world data obtained by on-board vehicle sensing, spanning various awareness conditions and dynamic characteristics of the participants. Results show that at a prediction horizon of 1.5 s, context-aware models outperform context-agnostic models in path prediction for scenarios with a dynamics change while performing similarly otherwise. Results further indicate that driver attention-aware models improve collision risk estimation compared to driver-agnostic models. This illustrates that driver contextual cues can support a more anticipatory collision warning and vehicle control strategy.

The main conclusions and findings of this thesis are: using a measurement device with a per-subject calibration procedure simplifies the data acquisition process to obtain a broad distribution of head poses. Using an intrinsics-aware head pose estimation method with a continuous rotation representations allows for a simple architecture that yields robust head pose estimates across a broad spectrum of head poses. Modeling of both driver and pedestrian mutual awareness in a unified DBN improves joint probabilistic path prediction compared to driver-agnostic models. Additionally, it provides explainability for model parameters and interpretability of the internal decision making process. Further research can be conducted to understand the behavior of humans inside and outside an intelligent vehicle. Two major trends go towards integrating uncertainties into the components and combining them to a system that can be trained end-to-end from raw sensor data to predicted paths. Future work would greatly benefit from representative, worldwide, naturalistic, multi-sensor, temporal data which cover the outside environment as well as the inside of the vehicle - ideally shared across research institutions and companies.

Original language	English
Qualification	Doctor of Philosophy
Awarding Institution	Delft University of Technology
Supervisors/Advisors	Gavrila, D., Supervisor Kooij, J.F.P., Advisor
Award date	20 Dec 2023
Electronic ISBNs	978-94-6384-502-1
DOIs	https://doi.org/10.4233/uuid:7d3c6107-812e-458e-bd11-04f5c1e5931a
Publication status	Published - 2023

Keywords

Head pose estimation
Head pose dataset
Person detection
Ego-vehicle path prediction
Pedestrian path prediction
Intelligent vehicles
Automated driving

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.4233/uuid:7d3c6107-812e-458e-bd11-04f5c1e5931a

TUD Dissertation Markus RothFinal published version, 29.8 MB

2 Conference contribution
2 Article

IntrApose: Monocular Driver 6 DOF Head Pose Estimation Leveraging Camera Intrinsics
Roth, M. & Gavrila, D. M., 2023, In: IEEE Transactions on Intelligent Vehicles. 8, 8, p. 4057-4068
Research output: Contribution to journal › Article › Scientific › peer-review

Open Access
File
1 Citation (Scopus)

48 Downloads (Pure)
Driver and Pedestrian Mutual Awareness for Path Prediction and Collision Risk Estimation
Roth, M., Stapel, J., Happee, R. & Gavrila, D. M., 2022, In: IEEE Transactions on Intelligent Vehicles. 7, 4, p. 896-907
Research output: Contribution to journal › Article › Scientific › peer-review

Open Access
File
7 Citations (Scopus)

18 Downloads (Pure)
DD-Pose: A large-scale Driver Head Pose Benchmark
Roth, M. & Gavrila, D., 2019, Proceedings IEEE Symposium Intelligent Vehicles (IV 2019). Piscataway, NJ, USA: IEEE, p. 927-934
Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

Open Access
File
23 Citations (Scopus)

296 Downloads (Pure)

Cite this

@phdthesis{7d3c6107812e458ebd1104f5c1e5931a,

title = "Driver and Pedestrian Mutual Awareness for Path Prediction in Intelligent Vehicles",

abstract = "This thesis addresses the sensor-based perception of driver and pedestrian to improve joint path prediction of ego-vehicle and pedestrian based on mutual awareness in the domain of intelligent vehicles.According to the World Health Organization (WHO), more than half of global traffic deaths are among Vulnerable Road Users (VRUs), such as pedestrians and riders, and human error is still a major cause of accidents. This motivates paying special attention to pedestrians and drivers while they are interacting in traffic. For the foreseeable future, the reality on the road (and the accident numbers) will largely be determined by Advanced Driver-assistance Systems (ADAS) where the driver is still required to keep the eyes on the road. To that end, the scope of this thesis resides within ADAS and driving automation up to (including) autonomy level 3 as defined by the Society of Automotive Engineers (SAE). While current ADAS consider pedestrians and the driver individually, their mutual awareness has not been leveraged to improve path prediction and thereby road safety. This thesis presents a framework that estimates driver head pose from driver camera images, estimates pedestrian location and orientation from exterior camera images and lidar point clouds, uses this information over time to reason about driver and pedestrian mutual awareness, and performs joint probabilistic path prediction of ego-vehicle and pedestrian to assess collision risk.Deep neural networks demand a large training set to tune the vast amount of parameters. This thesis introduces DD-Pose, the Daimler TU Delft Driver Head Pose Benchmark, a large-scale and diverse benchmark for image-based head pose estimation and driver analysis. It contains 330k measurements from multiple cameras acquired by an in-car setup during naturalistic drives. Large out-of-plane head rotations and occlusions are induced by complex driving scenarios. Precise head pose annotations are obtained by a motion capture sensor and a novel calibration device. The new dataset offers a broad distribution of head poses, comprising an order of magnitude more samples of rare poses than a comparable dataset.Utilizing the dataset, this thesis presents intrApose, a novel method for continuous 6 degrees of freedom (DOF) head pose estimation from a single camera image without prior detection or landmark localization. intrApose uses camera intrinsics consistently within the deep neural network and is crop-aware and scale-aware: poses estimated from bounding boxes within the overall image are converted to a consistent pose within the camera frame. It employs a continuous, differentiable rotation representation that simplifies the overall architecture compared to existing methods. Experiments show that leveraging camera intrinsics and a continuous rotation representation (SVDO+) results in improved pose estimation compared to intrinsics agnostic variants and variants with discontinuous rotation representations. Driver head pose of naturalistic driving is biased towards close-to-frontal orientations. Training with an unbiased data distribution, i.e., a more uniform distribution of head poses, further reduces rotation error, specifically for extreme orientations and occlusions.In addition to considering the inside of the vehicle, this thesis also focuses on the outside environment and presents a method for 3D person detection from a pair of camera image and lidar point cloud in automotive scenes. The method comprises a deep neural network that estimates the 3D location, spatial extent, and yaw orientation of persons present in the scene. 3D anchor proposals are refined in two stages: a region proposal network and a subsequent detection network. For both input modalities high-level feature representations are learned from raw sensor data instead of being manually designed. To that end, the method uses Voxel Feature Encoders to obtain point cloud features instead of widely used projection-based point cloud representations. Experiments are conducted on the KITTI 3D object detection benchmark, a commonly used dataset in the automotive domain.Eventually, the output provided by the methods of the former chapters, namely, driver head pose and 3D person locations, are leveraged by a novel method for vehicle-pedestrian path prediction that takes into account the awareness of the driver and the pedestrian of each other{\textquoteright}s presence. The method jointly models the paths of ego-vehicle and a pedestrian within a single Dynamic Bayesian Network (DBN). In this DBN, subgraphs model the environment and entity-specific context cues of the vehicle and pedestrian (incl. awareness), which affect their future motion. These sub-graphs share a latent state which models whether the vehicle and pedestrian are on collision course. The method is validated with real-world data obtained by on-board vehicle sensing, spanning various awareness conditions and dynamic characteristics of the participants. Results show that at a prediction horizon of 1.5 s, context-aware models outperform context-agnostic models in path prediction for scenarios with a dynamics change while performing similarly otherwise. Results further indicate that driver attention-aware models improve collision risk estimation compared to driver-agnostic models. This illustrates that driver contextual cues can support a more anticipatory collision warning and vehicle control strategy.The main conclusions and findings of this thesis are: using a measurement device with a per-subject calibration procedure simplifies the data acquisition process to obtain a broad distribution of head poses. Using an intrinsics-aware head pose estimation method with a continuous rotation representations allows for a simple architecture that yields robust head pose estimates across a broad spectrum of head poses. Modeling of both driver and pedestrian mutual awareness in a unified DBN improves joint probabilistic path prediction compared to driver-agnostic models. Additionally, it provides explainability for model parameters and interpretability of the internal decision making process. Further research can be conducted to understand the behavior of humans inside and outside an intelligent vehicle. Two major trends go towards integrating uncertainties into the components and combining them to a system that can be trained end-to-end from raw sensor data to predicted paths. Future work would greatly benefit from representative, worldwide, naturalistic, multi-sensor, temporal data which cover the outside environment as well as the inside of the vehicle - ideally shared across research institutions and companies.",

keywords = "Head pose estimation, Head pose dataset, Person detection, Ego-vehicle path prediction, Pedestrian path prediction, Intelligent vehicles, Automated driving",

author = "M. Roth",

year = "2023",

doi = "10.4233/uuid:7d3c6107-812e-458e-bd11-04f5c1e5931a",

language = "English",

type = "Dissertation (TU Delft)",

school = "Delft University of Technology",

}

TY - THES

T1 - Driver and Pedestrian Mutual Awareness for Path Prediction in Intelligent Vehicles

AU - Roth, M.

PY - 2023

Y1 - 2023

N2 - This thesis addresses the sensor-based perception of driver and pedestrian to improve joint path prediction of ego-vehicle and pedestrian based on mutual awareness in the domain of intelligent vehicles.According to the World Health Organization (WHO), more than half of global traffic deaths are among Vulnerable Road Users (VRUs), such as pedestrians and riders, and human error is still a major cause of accidents. This motivates paying special attention to pedestrians and drivers while they are interacting in traffic. For the foreseeable future, the reality on the road (and the accident numbers) will largely be determined by Advanced Driver-assistance Systems (ADAS) where the driver is still required to keep the eyes on the road. To that end, the scope of this thesis resides within ADAS and driving automation up to (including) autonomy level 3 as defined by the Society of Automotive Engineers (SAE). While current ADAS consider pedestrians and the driver individually, their mutual awareness has not been leveraged to improve path prediction and thereby road safety. This thesis presents a framework that estimates driver head pose from driver camera images, estimates pedestrian location and orientation from exterior camera images and lidar point clouds, uses this information over time to reason about driver and pedestrian mutual awareness, and performs joint probabilistic path prediction of ego-vehicle and pedestrian to assess collision risk.Deep neural networks demand a large training set to tune the vast amount of parameters. This thesis introduces DD-Pose, the Daimler TU Delft Driver Head Pose Benchmark, a large-scale and diverse benchmark for image-based head pose estimation and driver analysis. It contains 330k measurements from multiple cameras acquired by an in-car setup during naturalistic drives. Large out-of-plane head rotations and occlusions are induced by complex driving scenarios. Precise head pose annotations are obtained by a motion capture sensor and a novel calibration device. The new dataset offers a broad distribution of head poses, comprising an order of magnitude more samples of rare poses than a comparable dataset.Utilizing the dataset, this thesis presents intrApose, a novel method for continuous 6 degrees of freedom (DOF) head pose estimation from a single camera image without prior detection or landmark localization. intrApose uses camera intrinsics consistently within the deep neural network and is crop-aware and scale-aware: poses estimated from bounding boxes within the overall image are converted to a consistent pose within the camera frame. It employs a continuous, differentiable rotation representation that simplifies the overall architecture compared to existing methods. Experiments show that leveraging camera intrinsics and a continuous rotation representation (SVDO+) results in improved pose estimation compared to intrinsics agnostic variants and variants with discontinuous rotation representations. Driver head pose of naturalistic driving is biased towards close-to-frontal orientations. Training with an unbiased data distribution, i.e., a more uniform distribution of head poses, further reduces rotation error, specifically for extreme orientations and occlusions.In addition to considering the inside of the vehicle, this thesis also focuses on the outside environment and presents a method for 3D person detection from a pair of camera image and lidar point cloud in automotive scenes. The method comprises a deep neural network that estimates the 3D location, spatial extent, and yaw orientation of persons present in the scene. 3D anchor proposals are refined in two stages: a region proposal network and a subsequent detection network. For both input modalities high-level feature representations are learned from raw sensor data instead of being manually designed. To that end, the method uses Voxel Feature Encoders to obtain point cloud features instead of widely used projection-based point cloud representations. Experiments are conducted on the KITTI 3D object detection benchmark, a commonly used dataset in the automotive domain.Eventually, the output provided by the methods of the former chapters, namely, driver head pose and 3D person locations, are leveraged by a novel method for vehicle-pedestrian path prediction that takes into account the awareness of the driver and the pedestrian of each other’s presence. The method jointly models the paths of ego-vehicle and a pedestrian within a single Dynamic Bayesian Network (DBN). In this DBN, subgraphs model the environment and entity-specific context cues of the vehicle and pedestrian (incl. awareness), which affect their future motion. These sub-graphs share a latent state which models whether the vehicle and pedestrian are on collision course. The method is validated with real-world data obtained by on-board vehicle sensing, spanning various awareness conditions and dynamic characteristics of the participants. Results show that at a prediction horizon of 1.5 s, context-aware models outperform context-agnostic models in path prediction for scenarios with a dynamics change while performing similarly otherwise. Results further indicate that driver attention-aware models improve collision risk estimation compared to driver-agnostic models. This illustrates that driver contextual cues can support a more anticipatory collision warning and vehicle control strategy.The main conclusions and findings of this thesis are: using a measurement device with a per-subject calibration procedure simplifies the data acquisition process to obtain a broad distribution of head poses. Using an intrinsics-aware head pose estimation method with a continuous rotation representations allows for a simple architecture that yields robust head pose estimates across a broad spectrum of head poses. Modeling of both driver and pedestrian mutual awareness in a unified DBN improves joint probabilistic path prediction compared to driver-agnostic models. Additionally, it provides explainability for model parameters and interpretability of the internal decision making process. Further research can be conducted to understand the behavior of humans inside and outside an intelligent vehicle. Two major trends go towards integrating uncertainties into the components and combining them to a system that can be trained end-to-end from raw sensor data to predicted paths. Future work would greatly benefit from representative, worldwide, naturalistic, multi-sensor, temporal data which cover the outside environment as well as the inside of the vehicle - ideally shared across research institutions and companies.

AB - This thesis addresses the sensor-based perception of driver and pedestrian to improve joint path prediction of ego-vehicle and pedestrian based on mutual awareness in the domain of intelligent vehicles.According to the World Health Organization (WHO), more than half of global traffic deaths are among Vulnerable Road Users (VRUs), such as pedestrians and riders, and human error is still a major cause of accidents. This motivates paying special attention to pedestrians and drivers while they are interacting in traffic. For the foreseeable future, the reality on the road (and the accident numbers) will largely be determined by Advanced Driver-assistance Systems (ADAS) where the driver is still required to keep the eyes on the road. To that end, the scope of this thesis resides within ADAS and driving automation up to (including) autonomy level 3 as defined by the Society of Automotive Engineers (SAE). While current ADAS consider pedestrians and the driver individually, their mutual awareness has not been leveraged to improve path prediction and thereby road safety. This thesis presents a framework that estimates driver head pose from driver camera images, estimates pedestrian location and orientation from exterior camera images and lidar point clouds, uses this information over time to reason about driver and pedestrian mutual awareness, and performs joint probabilistic path prediction of ego-vehicle and pedestrian to assess collision risk.Deep neural networks demand a large training set to tune the vast amount of parameters. This thesis introduces DD-Pose, the Daimler TU Delft Driver Head Pose Benchmark, a large-scale and diverse benchmark for image-based head pose estimation and driver analysis. It contains 330k measurements from multiple cameras acquired by an in-car setup during naturalistic drives. Large out-of-plane head rotations and occlusions are induced by complex driving scenarios. Precise head pose annotations are obtained by a motion capture sensor and a novel calibration device. The new dataset offers a broad distribution of head poses, comprising an order of magnitude more samples of rare poses than a comparable dataset.Utilizing the dataset, this thesis presents intrApose, a novel method for continuous 6 degrees of freedom (DOF) head pose estimation from a single camera image without prior detection or landmark localization. intrApose uses camera intrinsics consistently within the deep neural network and is crop-aware and scale-aware: poses estimated from bounding boxes within the overall image are converted to a consistent pose within the camera frame. It employs a continuous, differentiable rotation representation that simplifies the overall architecture compared to existing methods. Experiments show that leveraging camera intrinsics and a continuous rotation representation (SVDO+) results in improved pose estimation compared to intrinsics agnostic variants and variants with discontinuous rotation representations. Driver head pose of naturalistic driving is biased towards close-to-frontal orientations. Training with an unbiased data distribution, i.e., a more uniform distribution of head poses, further reduces rotation error, specifically for extreme orientations and occlusions.In addition to considering the inside of the vehicle, this thesis also focuses on the outside environment and presents a method for 3D person detection from a pair of camera image and lidar point cloud in automotive scenes. The method comprises a deep neural network that estimates the 3D location, spatial extent, and yaw orientation of persons present in the scene. 3D anchor proposals are refined in two stages: a region proposal network and a subsequent detection network. For both input modalities high-level feature representations are learned from raw sensor data instead of being manually designed. To that end, the method uses Voxel Feature Encoders to obtain point cloud features instead of widely used projection-based point cloud representations. Experiments are conducted on the KITTI 3D object detection benchmark, a commonly used dataset in the automotive domain.Eventually, the output provided by the methods of the former chapters, namely, driver head pose and 3D person locations, are leveraged by a novel method for vehicle-pedestrian path prediction that takes into account the awareness of the driver and the pedestrian of each other’s presence. The method jointly models the paths of ego-vehicle and a pedestrian within a single Dynamic Bayesian Network (DBN). In this DBN, subgraphs model the environment and entity-specific context cues of the vehicle and pedestrian (incl. awareness), which affect their future motion. These sub-graphs share a latent state which models whether the vehicle and pedestrian are on collision course. The method is validated with real-world data obtained by on-board vehicle sensing, spanning various awareness conditions and dynamic characteristics of the participants. Results show that at a prediction horizon of 1.5 s, context-aware models outperform context-agnostic models in path prediction for scenarios with a dynamics change while performing similarly otherwise. Results further indicate that driver attention-aware models improve collision risk estimation compared to driver-agnostic models. This illustrates that driver contextual cues can support a more anticipatory collision warning and vehicle control strategy.The main conclusions and findings of this thesis are: using a measurement device with a per-subject calibration procedure simplifies the data acquisition process to obtain a broad distribution of head poses. Using an intrinsics-aware head pose estimation method with a continuous rotation representations allows for a simple architecture that yields robust head pose estimates across a broad spectrum of head poses. Modeling of both driver and pedestrian mutual awareness in a unified DBN improves joint probabilistic path prediction compared to driver-agnostic models. Additionally, it provides explainability for model parameters and interpretability of the internal decision making process. Further research can be conducted to understand the behavior of humans inside and outside an intelligent vehicle. Two major trends go towards integrating uncertainties into the components and combining them to a system that can be trained end-to-end from raw sensor data to predicted paths. Future work would greatly benefit from representative, worldwide, naturalistic, multi-sensor, temporal data which cover the outside environment as well as the inside of the vehicle - ideally shared across research institutions and companies.

KW - Head pose estimation

KW - Head pose dataset

KW - Person detection

KW - Ego-vehicle path prediction

KW - Pedestrian path prediction

KW - Intelligent vehicles

KW - Automated driving

U2 - 10.4233/uuid:7d3c6107-812e-458e-bd11-04f5c1e5931a

DO - 10.4233/uuid:7d3c6107-812e-458e-bd11-04f5c1e5931a

M3 - Dissertation (TU Delft)

ER -

Driver and Pedestrian Mutual Awareness for Path Prediction in Intelligent Vehicles

Abstract

Keywords

UN SDGs

Access to Document

Fingerprint

Research output

IntrApose: Monocular Driver 6 DOF Head Pose Estimation Leveraging Camera Intrinsics

Driver and Pedestrian Mutual Awareness for Path Prediction and Collision Risk Estimation

DD-Pose: A large-scale Driver Head Pose Benchmark

Cite this