LMD3: Language Model Data Density Dependence
CoRR(2024)
摘要
We develop a methodology for analyzing language model task performance at the
individual example level based on training data density estimation. Experiments
with paraphrasing as a controlled intervention on finetuning data demonstrate
that increasing the support in the training distribution for specific test
queries results in a measurable increase in density, which is also a
significant predictor of the performance increase caused by the intervention.
Experiments with pretraining data demonstrate that we can explain a significant
fraction of the variance in model perplexity via density measurements. We
conclude that our framework can provide statistical evidence of the dependence
of a target model's predictions on subsets of its training data, and can more
generally be used to characterize the support (or lack thereof) in the training
data for a given test task.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要