Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification
CoRR(2024)
摘要
For extremely weak-supervised text classification, pioneer research generates
pseudo labels by mining texts similar to the class names from the raw corpus,
which may end up with very limited or even no samples for the minority classes.
Recent works have started to generate the relevant texts by prompting LLMs
using the class names or definitions; however, there is a high risk that LLMs
cannot generate in-distribution (i.e., similar to the corpus where the text
classifier will be applied) data, leading to ungeneralizable classifiers. In
this paper, we combine the advantages of these two approaches and propose to
bridge the gap via a novel framework, text grafting, which aims to
obtain clean and near-distribution weak supervision for minority classes.
Specifically, we first use LLM-based logits to mine masked templates from the
raw corpus, which have a high potential for data synthesis into the target
minority class. Then, the templates are filled by state-of-the-art LLMs to
synthesize near-distribution texts falling into minority classes. Text grafting
shows significant improvement over direct mining or synthesis on minority
classes. We also use analysis and case studies to comprehend the property of
text grafting.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要