Learning to Model Diverse Interactive Traffic with Driving Tendency-Guided Policy Optimization

Published in IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2025

Jialin Fan^1,2, Ying Ni^1,2, Yuhao Yang^1,2, Wentao Zheng^1,2, Jie Sun^1,2, and Jian Sun^1,2

¹ Department of Transportation Engineering, Tongji University, Shanghai, China

² Key Laboratory of Road and Traffic Engineering, Ministry of Education, Shanghai, China

The safe deployment of autonomous vehicles (AVs) into real-world traffic requires robust interaction with human drivers exhibiting heterogeneous behavioral tendencies, spanning from rational cooperation to adversarial aggression. Existing simulation frameworks often lack the capacity to systematically model such behavioral diversity, limiting their applicability for rigorous AV evaluation. To address this challenge, we propose a multi-agent reinforcement learning framework that generates dynamically controllable traffic through Tendency-Guided Policy Optimization (TGPO). Central to TGPO is the Adversary-Rationality-Tendency (ART), a continuous hyperparameter that enables fine-grained control over the spectrum of driving behaviors by fusing separately learned adversarial and rational value functions. Furthermore, we design an ART-guided policy network incorporating multi-head mechanisms to resolve high-dimensional multi-agent observations, adaptively prioritizing context features aligned with assigned driving tendencies. Extensive experiments across urban and highway scenarios demonstrate that TGPO generates traffic flows with enhanced behavioral controllability and diversity. The proposed method provides a scalable solution for simulating realistic driver interactions, thereby facilitating the development of AV systems capable of handling complex real-world corner cases.

Case Studies with TGPO

To validate the capability of TGPO in generating diverse and controllable driving behaviors, we construct three representative interactive scenarios involving an autonomous vehicle (AV), which was governed by a vanilla PPO agent trained by single-agent RL [24]; And four surrounding vehicles (SVs) governed by TGPO agents. Each scenario defines fixed start and goal positions for all vehicles, requiring agents to dynamically negotiate trajectories without predefined rules. Two SVs are designed to interact directly with the AV, while the remaining two serve as control groups to verify baseline driving task completion.

Proposed — Scenario 1: Roundabout Negotiation #1.

Baseline — Scenario 1: Roundabout Negotiation #2.

Share on

Twitter Facebook LinkedIn

Fan Jialin

Share on