Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLMs
每日信息看板 · 2026-02-10
2026-02-10T23:51:00Z
Published
AI 总结
该论文提出面向端侧LLM的软硬件协同缩放律,将训练损失与Roofline时延建模耦合以直接优化精度-延迟权衡,在Jetson Orin上显著加速架构选型并在同延迟下优于Qwen2.5-0.5B。
- 提出硬件协同设计定律:用架构超参数显式建模训练损失,并用Roofline模型刻画推理时延。
- 在NVIDIA Jetson Orin上评估1942个候选架构,选取170个模型各训练100亿token以拟合缩放律。
- 通过耦合损失缩放律与时延模型,建立精度-延迟直接映射并给出Pareto前沿。
- 将架构搜索表述为精度与性能的联合优化,在工业硬件与应用预算下得到可行设计区域。
- 将架构选型周期从数月缩短到数天;在与Qwen2.5-0.5B相同时延下,WikiText-2困惑度降低19.42%。
- 作者称这是首个可操作的端侧LLM硬件协同缩放律框架,并计划开源代码与检查点。
#arXiv #paper #研究/论文 #Scaling Law #Jetson Orin #VLA
内容摘录
Vision-Language-Action Models (VLAs) have emerged as a key paradigm of Physical AI and are increasingly deployed in autonomous vehicles, robots, and smart spaces. In these resource-constrained on-device settings, selecting an appropriate large language model (LLM) backbone is a critical challenge: models must balance accuracy with strict inference latency and hardware efficiency constraints. This makes hardware-software co-design a game-changing requirement for on-device LLM deployment, where each hardware platform demands a tailored architectural solution. We propose a hardware co-design law that jointly captures model accuracy and inference performance. Specifically, we model training loss as an explicit function of architectural hyperparameters and characterise inference latency via roofline modelling. We empirically evaluate 1,942 candidate architectures on NVIDIA Jetson Orin, training 170 selected models for 10B tokens each to fit a scaling law relating architecture to training loss. By coupling this scaling law with latency modelling, we establish a direct accuracy-latency correspondence and identify the Pareto frontier for hardware co-designed LLMs. We further formulate architecture search as a joint optimisation over precision and performance, deriving feasible design regions under industrial hardware and application budgets. Our approach reduces architecture selection from months to days. At the same latency as Qwen2.5-0.5B on the target hardware, our co-designed architecture achieves 19.42% lower perplexity on WikiText-2. To our knowledge, this is the first principled and operational framework for hardware co-design scaling laws in on-device LLM deployment. We will make the code and related checkpoints publicly available.