StyleVLA: Driving Style-Aware
Vision Language Action Model
for Autonomous Driving

Anonymous Authors · Anonymous Institution · Under Review

StyleVLA Framework Overview
Figure 1. StyleVLA framework: dataset construction (top), CARLA instruction generation (middle), and physics-informed fine-tuning with hybrid loss ℒtotal for BEV & FPV trajectory generation (bottom).

Abstract

Vision Language Models (VLMs) are transforming intelligent systems by bridging the gap between visual perception and linguistic reasoning. In autonomous driving, this synergy has catalyzed the development of Vision Language Action (VLA) models, which aim to translate high-level multimodal understanding into actionable driving behaviors, typically represented as future trajectories. However, current VLA models predominantly focus on generating generic collision-free trajectories. While collision avoidance is a fundamental requirement, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized user experiences. Furthermore, they often treat trajectory generation as a naive token prediction task, leading to kinematically infeasible actions.

To address these limitations, we present StyleVLA, a physics-informed VLA framework that generates diverse, physically plausible driving behaviors. We introduce a novel hybrid loss function that integrates a physics-informed kinematic consistency constraint with a continuous regression head, enhancing the physical feasibility of generated trajectories. To train StyleVLA based on Qwen3-VL 4B, we construct a large-scale instruction dataset containing over 1.2k scenarios with 76k BEV and 42k FPV samples, featuring ground-truth trajectories for five distinct driving styles and natural-language instructions. Extensive experiments demonstrate that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA baselines.

Key Contributions

1

StyleVLA Dataset

We present a large-scale StyleVLA dataset — 1,216 scenarios with 27.13 s average duration, 76,030 BEV samples and 42,015 FPV samples — featuring ground-truth trajectories for five distinct driving styles (Default, Balanced, Comfort, Sporty, Safety) with natural language instructions, bridging the gap in behavioral diversity for VLA model training.

2

Comprehensive Benchmarking

We evaluate off-the-shelf VLMs and state-of-the-art methods on the StyleVLA dataset, analyzing their ability to generate driving-style-aware trajectories. Results show that even top proprietary models like Gemini-3-Pro fail to generate valid style-conditioned trajectories in zero-shot settings.

3

Physics-Informed VLA Framework

We propose a new framework combining cross-entropy loss with an MLP regression head and a physics-informed kinematic consistency (PIKC) loss, to fine-tune a 4B VLM for driving style-aware trajectory generation — outperforming zero-shot VLMs and SOTA VLAs on unseen data with real-time inference.

Dataset Samples

Map Conversion Demonstrations

Browsing 100 randomly selected scenarios. Each shows the CommonRoad BEV map and the CARLA rendering side by side.

1 / 100
BEV map CommonRoad BEV
CARLA map CARLA

Driving Behavior Demonstrations

100 scenarios are randomly selected to demonstrate StyleVLA's trajectory generation across 5 distinct driving styles. The 2D BEV animation (top) is rendered in CommonRoad, and the FPV video (bottom) is simulated in CARLA. Use Prev / Next or the dots to browse all scenarios.

1 / 100
Default 2D
Default FPV
Balanced 2D
Balanced FPV
Comfort 2D
Comfort FPV
Sporty 2D
Sporty FPV
Safety 2D
Safety FPV

StyleVLA Dataset

1,216 Scenarios
76,030 BEV Samples
42,015 FPV Samples
27.13 s Avg. Duration
14 Countries

Driving Styles & Kinematic Characteristics

Style Samples (Count / %) Avg Velocity (m/s) RMS Accel (m/s²) RMS Jerk (m/s³) Path Length (m)
Balanced 14,102 / 18.5% 7.15 0.588 0.750 24.44
Comfort 13,766 / 18.1% 7.21 0.585 0.727 24.53
Default 17,684 / 23.3% 6.80 0.486 0.794 23.31
Sporty 14,790 / 19.5% 7.32 0.558 0.780 25.13
Safety 15,688 / 20.6% 6.39 0.583 0.746 21.44

VQA Dataset Samples

100 randomly selected FPV question-answer pairs from the StyleVLA VQA dataset. Each entry shows the ego-centric CARLA front-view image, the full trajectory planning prompt (with structured vehicle state), and StyleVLA's predicted trajectory response.

1 / 100
FPV Image
CARLA Front-View
💬 Prompt

Structured Vehicle State Data

            
🤖 StyleVLA Response

          

Qualitative Demo

StyleVLA generates diverse, style-conditioned trajectories for the same scenario (DEU_Salzwedel-41) across all five driving behaviors. The BEV trajectory comparison (top) is rendered in CommonRoad; the FPV video (bottom) is simulated in CARLA.

Default BEV
Default FPV
Balanced BEV
Balanced FPV
Comfort BEV
Comfort FPV
Sporty BEV
Sporty FPV
Safety BEV
Safety FPV
Goal Ground Truth StyleVLA

Quantitative Results

All experiments use a Dell Alienware R15 (Intel i7-13700KF, NVIDIA RTX 4090, 128 GB RAM). StyleVLA (Qwen3-VL-4B) significantly outperforms all baselines across both domains.

BEV Domain — Zero-Shot Benchmarking

Model Score ↑ PSR ↑ MR ↓ ADE (m) ↓ FDE (m) ↓ KCE (m) ↓ Time (s) ↓
Open Source Models
Qwen3-VL-4B 0.000.00%100.0% 2.00
Qwen2.5-VL-7B 0.000.00%100.0% 3.43
InternVL3-9B 0.000.00%100.0% 12.6325.893.725.77
Proprietary Models
Gemini-2.5-Flash 0.2613.30%73.40% 2.405.700.0744.18
Gemini-2.5-Pro 0.2713.41%71.57% 2.255.740.0944.77
Gemini-3-Pro 0.3216.38%66.21% 1.724.370.1173.83
Ours (Fine-Tuned)
Qwen2.5-VL-7B 0.4633.19%48.18% 1.173.060.123.70
Qwen3-VL-4B (StyleVLA) 0.5539.47%39.91% 1.152.930.081.92

Metrics are computed on generated trajectories. InternVL3-9B achieves a 90.91% generation rate; Qwen3-VL-4B and Qwen2.5-VL-7B generate none. — indicates metrics not computable due to zero success rate.

FPV Domain (CARLA) — Zero-Shot Benchmarking

Model Score ↑ PSR ↑ MR ↓ ADE (m) ↓ FDE (m) ↓ KCE (m) ↓ Time (s) ↓
Open Source Baselines
Qwen3-VL-4B (Base) 0.00 4.97
Proprietary Models
Gemini-2.5-Flash 0.129.04%87.21% 9.3918.213.421.80
Gemini-2.5-Pro 0.2916.63%70.74% 2.235.760.0635.48
GPT-5 Nano 0.2916.67%69.70% 2.366.060.1149.05
Gemini-3-Pro 0.3517.65%62.35% 1.543.940.0691.39
SOTA VLA Models
SimLingo (1B) 0.000.30%99.40% 8.0117.58/0.55
Orion (7B) 0.052.10%96.40% 11.1321.35/0.36
OpenDriveVLA (0.5B) 0.137.38%87.25% 5.839.82/0.51
Alpamayo-R1 (10B) 0.1913.60%73.10% 3.375.85/0.65
Qwen3-VL-4B (StyleVLA) 0.5138.60%36.90% 1.173.130.112.13

Metrics are computed on successfully generated trajectories. All models achieve 100% generation except OpenDriveVLA (14.9%), Gemini 2.5 Flash (90.9%), and Qwen3-VL-4B Base (0%). — indicates metrics not computable due to zero success rate. / indicates models lacking velocity or acceleration outputs; KCE cannot be computed.

Ablation Study

Impact of data scaling and loss components on BEV StyleVLA (Qwen2.5-VL-7B, 3 s horizon).

Configuration ADE (m) ↓ FDE (m) ↓ PSR ↑ Heading MAE (rad) ↓
Impact of Training Data Scaling (with Physics-Informed Hybrid Loss)
Small (4.5k)2.085.4320.60%0.073
Medium (20k)1.513.9227.14%0.046
Large (40k)1.473.8129.37%0.044
Standard (50k)1.173.0633.19%0.035
Impact of Loss Components (50k training dataset)
CE1.473.8229.00%0.043
CE + REG1.213.1732.08%0.036
CE + REG + PIKC (Ours)1.173.0633.19%0.035

BibTeX

@inproceedings{stylevla2026,
  title     = {StyleVLA: Driving Style-Aware Vision Language Action Model
               for Autonomous Driving},
  author    = {Anonymous Authors},
  booktitle = {Under Review},
  year      = {2026},
  note      = {Under review}
}