StyleVLA: Driving Style-Aware Vision Language Action Model

StyleVLA Framework Overview — **Figure 1.** StyleVLA framework: dataset construction (top), CARLA instruction generation (middle), and physics-informed fine-tuning with hybrid loss ℒ_total for BEV & FPV trajectory generation (bottom).

Abstract

Vision Language Models (VLMs) are transforming intelligent systems by bridging the gap between visual perception and linguistic reasoning. In autonomous driving, this synergy has catalyzed the development of Vision Language Action (VLA) models, which aim to translate high-level multimodal understanding into actionable driving behaviors, typically represented as future trajectories. However, current VLA models predominantly focus on generating generic collision-free trajectories. While collision avoidance is a fundamental requirement, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized user experiences. Furthermore, they often treat trajectory generation as a naive token prediction task, leading to kinematically infeasible actions.

To address these limitations, we present StyleVLA, a physics-informed VLA framework that generates diverse, physically plausible driving behaviors. We introduce a novel hybrid loss function that integrates a physics-informed kinematic consistency constraint with a continuous regression head, enhancing the physical feasibility of generated trajectories. To train StyleVLA based on Qwen3-VL 4B, we construct a large-scale instruction dataset containing over 1.2k scenarios with 76k BEV and 42k FPV samples, featuring ground-truth trajectories for five distinct driving styles and natural-language instructions. Extensive experiments demonstrate that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA baselines.

Key Contributions

1

StyleVLA Dataset

We present a large-scale StyleVLA dataset — 1,216 scenarios with 27.13 s average duration, 76,030 BEV samples and 42,015 FPV samples — featuring ground-truth trajectories for five distinct driving styles (Default, Balanced, Comfort, Sporty, Safety) with natural language instructions, bridging the gap in behavioral diversity for VLA model training.

2

Comprehensive Benchmarking

We evaluate off-the-shelf VLMs and state-of-the-art methods on the StyleVLA dataset, analyzing their ability to generate driving-style-aware trajectories. Results show that even top proprietary models like Gemini-3-Pro fail to generate valid style-conditioned trajectories in zero-shot settings.

3

Physics-Informed VLA Framework

We propose a new framework combining cross-entropy loss with an MLP regression head and a physics-informed kinematic consistency (PIKC) loss, to fine-tune a 4B VLM for driving style-aware trajectory generation — outperforming zero-shot VLMs and SOTA VLAs on unseen data with real-time inference.

Dataset Samples

Map Conversion Demonstrations

Browsing 100 randomly selected scenarios. Each shows the CommonRoad BEV map and the CARLA rendering side by side.

1 / 100

CommonRoad BEV

CARLA

Driving Behavior Demonstrations

100 scenarios are randomly selected to demonstrate StyleVLA's trajectory generation across 5 distinct driving styles. The 2D BEV animation (top) is rendered in CommonRoad, and the FPV video (bottom) is simulated in CARLA. Use Prev / Next or the dots to browse all scenarios.

1 / 100

Default 2D

Default FPV

Balanced 2D

Balanced FPV

Comfort 2D

Comfort FPV

Sporty 2D

Sporty FPV

Safety 2D

Safety FPV

StyleVLA Dataset

1,216 Scenarios

76,030 BEV Samples

42,015 FPV Samples

27.13 s Avg. Duration

14 Countries

Driving Styles & Kinematic Characteristics

Style	Samples (Count / %)	Avg Velocity (m/s)	RMS Accel (m/s²)	RMS Jerk (m/s³)	Path Length (m)
Balanced	14,102 / 18.5%	7.15	0.588	0.750	24.44
Comfort	13,766 / 18.1%	7.21	0.585	0.727	24.53
Default	17,684 / 23.3%	6.80	0.486	0.794	23.31
Sporty	14,790 / 19.5%	7.32	0.558	0.780	25.13
Safety	15,688 / 20.6%	6.39	0.583	0.746	21.44

VQA Dataset Samples

100 randomly selected FPV question-answer pairs from the StyleVLA VQA dataset. Each entry shows the ego-centric CARLA front-view image, the full trajectory planning prompt (with structured vehicle state), and StyleVLA's predicted trajectory response.

1 / 100

FPV Image

💬 Prompt

Structured Vehicle State Data

🤖 StyleVLA Response

Qualitative Demo

StyleVLA generates diverse, style-conditioned trajectories for the same scenario (DEU_Salzwedel-41) across all five driving behaviors. The BEV trajectory comparison (top) is rendered in CommonRoad; the FPV video (bottom) is simulated in CARLA.

Default BEV

Default FPV

Balanced BEV

Balanced FPV

Comfort BEV

Comfort FPV

Sporty BEV

Sporty FPV

Safety BEV

Safety FPV

★ Goal Ground Truth StyleVLA

Quantitative Results

All experiments use a Dell Alienware R15 (Intel i7-13700KF, NVIDIA RTX 4090, 128 GB RAM). StyleVLA (Qwen3-VL-4B) significantly outperforms all baselines across both domains.

BEV Domain — Zero-Shot Benchmarking

Model	Score ↑	PSR ↑	MR ↓	ADE (m) ↓	FDE (m) ↓	KCE (m) ↓	Time (s) ↓
Open Source Models
Qwen3-VL-4B	0.00	0.00%	100.0%	—	—	—	2.00
Qwen2.5-VL-7B	0.00	0.00%	100.0%	—	—	—	3.43
InternVL3-9B	0.00	0.00%	100.0%	12.63	25.89	3.72	5.77
Proprietary Models
Gemini-2.5-Flash	0.26	13.30%	73.40%	2.40	5.70	0.07	44.18
Gemini-2.5-Pro	0.27	13.41%	71.57%	2.25	5.74	0.09	44.77
Gemini-3-Pro	0.32	16.38%	66.21%	1.72	4.37	0.11	73.83
Ours (Fine-Tuned)
Qwen2.5-VL-7B	0.46	33.19%	48.18%	1.17	3.06	0.12	3.70
Qwen3-VL-4B (StyleVLA)	0.55	39.47%	39.91%	1.15	2.93	0.08	1.92

Metrics are computed on generated trajectories. InternVL3-9B achieves a 90.91% generation rate; Qwen3-VL-4B and Qwen2.5-VL-7B generate none. — indicates metrics not computable due to zero success rate.

FPV Domain (CARLA) — Zero-Shot Benchmarking

Model	Score ↑	PSR ↑	MR ↓	ADE (m) ↓	FDE (m) ↓	KCE (m) ↓	Time (s) ↓
Open Source Baselines
Qwen3-VL-4B (Base)	0.00	—	—	—	—	—	4.97
Proprietary Models
Gemini-2.5-Flash	0.12	9.04%	87.21%	9.39	18.21	3.42	1.80
Gemini-2.5-Pro	0.29	16.63%	70.74%	2.23	5.76	0.06	35.48
GPT-5 Nano	0.29	16.67%	69.70%	2.36	6.06	0.11	49.05
Gemini-3-Pro	0.35	17.65%	62.35%	1.54	3.94	0.06	91.39
SOTA VLA Models
SimLingo (1B)	0.00	0.30%	99.40%	8.01	17.58	/	0.55
Orion (7B)	0.05	2.10%	96.40%	11.13	21.35	/	0.36
OpenDriveVLA (0.5B)	0.13	7.38%	87.25%	5.83	9.82	/	0.51
Alpamayo-R1 (10B)	0.19	13.60%	73.10%	3.37	5.85	/	0.65
Qwen3-VL-4B (StyleVLA)	0.51	38.60%	36.90%	1.17	3.13	0.11	2.13

Metrics are computed on successfully generated trajectories. All models achieve 100% generation except OpenDriveVLA (14.9%), Gemini 2.5 Flash (90.9%), and Qwen3-VL-4B Base (0%). — indicates metrics not computable due to zero success rate. / indicates models lacking velocity or acceleration outputs; KCE cannot be computed.

Ablation Study

Impact of data scaling and loss components on BEV StyleVLA (Qwen2.5-VL-7B, 3 s horizon).

Configuration	ADE (m) ↓	FDE (m) ↓	PSR ↑	Heading MAE (rad) ↓
Impact of Training Data Scaling (with Physics-Informed Hybrid Loss)
Small (4.5k)	2.08	5.43	20.60%	0.073
Medium (20k)	1.51	3.92	27.14%	0.046
Large (40k)	1.47	3.81	29.37%	0.044
Standard (50k)	1.17	3.06	33.19%	0.035
Impact of Loss Components (50k training dataset)
CE	1.47	3.82	29.00%	0.043
CE + REG	1.21	3.17	32.08%	0.036
CE + REG + PIKC (Ours)	1.17	3.06	33.19%	0.035

BibTeX

@inproceedings{stylevla2026,
  title     = {StyleVLA: Driving Style-Aware Vision Language Action Model
               for Autonomous Driving},
  author    = {Anonymous Authors},
  booktitle = {Under Review},
  year      = {2026},
  note      = {Under review}
}