Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics

每日信息看板 · 2026-02-16
研究/论文
Category
arxiv_search
Source
90
Score
2026-02-16T23:20:58Z
Published

AI 总结

该论文系统验证了单细胞转录组掩码重建Transformer的缩放定律:在数据充足时损失随模型规模呈幂律下降,而数据稀缺时几乎不缩放,说明数据-参数比是构建高效单细胞基础模型的关键。
#arXiv #paper #研究/论文 #scRNA-seq #Transformer

内容摘录

Neural scaling laws -- power-law relationships between loss, model size, and data -- have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.