Vision Language Models (VLMs) are transforming intelligent systems by bridging the gap between visual perception and linguistic reasoning. In autonomous driving, this synergy has catalyzed the development of Vision Language Action (VLA) models, which aim to translate high-level multimodal understanding into actionable driving behaviors, typically represented as future trajectories. However, current VLA models predominantly focus on generating generic collision-free trajectories. While collision avoidance is a fundamental requirement, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized user experiences. Furthermore, they often treat trajectory generation as a naive token prediction task, leading to kinematically infeasible actions.
To address these limitations, we present StyleVLA, a physics-informed VLA framework that generates diverse, physically plausible driving behaviors. We introduce a novel hybrid loss function that integrates a physics-informed kinematic consistency constraint with a continuous regression head, enhancing the physical feasibility of generated trajectories. To train StyleVLA based on Qwen3-VL 4B, we construct a large-scale instruction dataset containing over 1.2k scenarios with 76k BEV and 42k FPV samples, featuring ground-truth trajectories for five distinct driving styles and natural-language instructions. Extensive experiments demonstrate that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA baselines.
We present a large-scale StyleVLA dataset — 1,216 scenarios with 27.13 s average duration, 76,030 BEV samples and 42,015 FPV samples — featuring ground-truth trajectories for five distinct driving styles (Default, Balanced, Comfort, Sporty, Safety) with natural language instructions, bridging the gap in behavioral diversity for VLA model training.
We evaluate off-the-shelf VLMs and state-of-the-art methods on the StyleVLA dataset, analyzing their ability to generate driving-style-aware trajectories. Results show that even top proprietary models like Gemini-3-Pro fail to generate valid style-conditioned trajectories in zero-shot settings.
We propose a new framework combining cross-entropy loss with an MLP regression head and a physics-informed kinematic consistency (PIKC) loss, to fine-tune a 4B VLM for driving style-aware trajectory generation — outperforming zero-shot VLMs and SOTA VLAs on unseen data with real-time inference.
Browsing 100 randomly selected scenarios. Each shows the CommonRoad BEV map and the CARLA rendering side by side.
100 scenarios are randomly selected to demonstrate StyleVLA's trajectory generation across 5 distinct driving styles. The 2D BEV animation (top) is rendered in CommonRoad, and the FPV video (bottom) is simulated in CARLA. Use Prev / Next or the dots to browse all scenarios.
| Style | Samples (Count / %) | Avg Velocity (m/s) | RMS Accel (m/s²) | RMS Jerk (m/s³) | Path Length (m) |
|---|---|---|---|---|---|
| Balanced | 14,102 / 18.5% | 7.15 | 0.588 | 0.750 | 24.44 |
| Comfort | 13,766 / 18.1% | 7.21 | 0.585 | 0.727 | 24.53 |
| Default | 17,684 / 23.3% | 6.80 | 0.486 | 0.794 | 23.31 |
| Sporty | 14,790 / 19.5% | 7.32 | 0.558 | 0.780 | 25.13 |
| Safety | 15,688 / 20.6% | 6.39 | 0.583 | 0.746 | 21.44 |
100 randomly selected FPV question-answer pairs from the StyleVLA VQA dataset. Each entry shows the ego-centric CARLA front-view image, the full trajectory planning prompt (with structured vehicle state), and StyleVLA's predicted trajectory response.
StyleVLA generates diverse, style-conditioned trajectories for the same scenario (DEU_Salzwedel-41) across all five driving behaviors. The BEV trajectory comparison (top) is rendered in CommonRoad; the FPV video (bottom) is simulated in CARLA.
All experiments use a Dell Alienware R15 (Intel i7-13700KF, NVIDIA RTX 4090, 128 GB RAM). StyleVLA (Qwen3-VL-4B) significantly outperforms all baselines across both domains.
| Model | Score ↑ | PSR ↑ | MR ↓ | ADE (m) ↓ | FDE (m) ↓ | KCE (m) ↓ | Time (s) ↓ |
|---|---|---|---|---|---|---|---|
| Open Source Models | |||||||
| Qwen3-VL-4B | 0.00 | 0.00% | 100.0% | — | — | — | 2.00 |
| Qwen2.5-VL-7B | 0.00 | 0.00% | 100.0% | — | — | — | 3.43 |
| InternVL3-9B | 0.00 | 0.00% | 100.0% | 12.63 | 25.89 | 3.72 | 5.77 |
| Proprietary Models | |||||||
| Gemini-2.5-Flash | 0.26 | 13.30% | 73.40% | 2.40 | 5.70 | 0.07 | 44.18 |
| Gemini-2.5-Pro | 0.27 | 13.41% | 71.57% | 2.25 | 5.74 | 0.09 | 44.77 |
| Gemini-3-Pro | 0.32 | 16.38% | 66.21% | 1.72 | 4.37 | 0.11 | 73.83 |
| Ours (Fine-Tuned) | |||||||
| Qwen2.5-VL-7B | 0.46 | 33.19% | 48.18% | 1.17 | 3.06 | 0.12 | 3.70 |
| Qwen3-VL-4B (StyleVLA) | 0.55 | 39.47% | 39.91% | 1.15 | 2.93 | 0.08 | 1.92 |
Metrics are computed on generated trajectories. InternVL3-9B achieves a 90.91% generation rate; Qwen3-VL-4B and Qwen2.5-VL-7B generate none. — indicates metrics not computable due to zero success rate.
| Model | Score ↑ | PSR ↑ | MR ↓ | ADE (m) ↓ | FDE (m) ↓ | KCE (m) ↓ | Time (s) ↓ |
|---|---|---|---|---|---|---|---|
| Open Source Baselines | |||||||
| Qwen3-VL-4B (Base) | 0.00 | — | — | — | — | — | 4.97 |
| Proprietary Models | |||||||
| Gemini-2.5-Flash | 0.12 | 9.04% | 87.21% | 9.39 | 18.21 | 3.42 | 1.80 |
| Gemini-2.5-Pro | 0.29 | 16.63% | 70.74% | 2.23 | 5.76 | 0.06 | 35.48 |
| GPT-5 Nano | 0.29 | 16.67% | 69.70% | 2.36 | 6.06 | 0.11 | 49.05 |
| Gemini-3-Pro | 0.35 | 17.65% | 62.35% | 1.54 | 3.94 | 0.06 | 91.39 |
| SOTA VLA Models | |||||||
| SimLingo (1B) | 0.00 | 0.30% | 99.40% | 8.01 | 17.58 | / | 0.55 |
| Orion (7B) | 0.05 | 2.10% | 96.40% | 11.13 | 21.35 | / | 0.36 |
| OpenDriveVLA (0.5B) | 0.13 | 7.38% | 87.25% | 5.83 | 9.82 | / | 0.51 |
| Alpamayo-R1 (10B) | 0.19 | 13.60% | 73.10% | 3.37 | 5.85 | / | 0.65 |
| Qwen3-VL-4B (StyleVLA) | 0.51 | 38.60% | 36.90% | 1.17 | 3.13 | 0.11 | 2.13 |
Metrics are computed on successfully generated trajectories. All models achieve 100% generation except OpenDriveVLA (14.9%), Gemini 2.5 Flash (90.9%), and Qwen3-VL-4B Base (0%). — indicates metrics not computable due to zero success rate. / indicates models lacking velocity or acceleration outputs; KCE cannot be computed.
Impact of data scaling and loss components on BEV StyleVLA (Qwen2.5-VL-7B, 3 s horizon).
| Configuration | ADE (m) ↓ | FDE (m) ↓ | PSR ↑ | Heading MAE (rad) ↓ |
|---|---|---|---|---|
| Impact of Training Data Scaling (with Physics-Informed Hybrid Loss) | ||||
| Small (4.5k) | 2.08 | 5.43 | 20.60% | 0.073 |
| Medium (20k) | 1.51 | 3.92 | 27.14% | 0.046 |
| Large (40k) | 1.47 | 3.81 | 29.37% | 0.044 |
| Standard (50k) | 1.17 | 3.06 | 33.19% | 0.035 |
| Impact of Loss Components (50k training dataset) | ||||
| CE | 1.47 | 3.82 | 29.00% | 0.043 |
| CE + REG | 1.21 | 3.17 | 32.08% | 0.036 |
| CE + REG + PIKC (Ours) | 1.17 | 3.06 | 33.19% | 0.035 |
@inproceedings{stylevla2026,
title = {StyleVLA: Driving Style-Aware Vision Language Action Model
for Autonomous Driving},
author = {Anonymous Authors},
booktitle = {Under Review},
year = {2026},
note = {Under review}
}