Featured image of post 少样本多标签文本分类初探

少样本多标签文本分类初探

在帮忙过程中积累到的一些知识.

数据集

datasets/AAPD/data · master · Jimmy Lin / Castor-data · GitLab

The Extreme Classification Repository

数据集预处理

Train/Test Split

问题描述

  1. 数据集平衡问题 如何针对一个raw数据集, 每个样本具有一个或多个标签进行平衡的划分.
// 数据集格式大致如此
[
	{
	"content": "Abc", 
	"label": [100, 102, 400]
	}, 
	... 
]

即划分的两个数据子集标签集合应当近似或完全相同, 在此之上实现一定的比例划分. 实现的最终效果为 len( set(train_labels).different(set(test_labels)) ) --> 0

  1. 标签剔除 如何选择一部分标签, 选中的这一部分标签对应的整个数据集样本数目在一定范围内. 删除策略问题: 有两种方式:
  • 直接将含有该标签的样本剔除
  • 每一个样本删除此标签, 最后再删除没有任何标签的样本

可以使用第二种方法.

库和相关方法
  • Scikit-learn 划分: `As you’ve noticed, stratification for scikit-learn’s train_test_split() does not consider the labels individually, but rather as a “label set”. This does not work well at all for multi-label data because the number of unique combinations grows exponentially with the number of labels.

Train-test-split for multilabel datasets · Issue #25596 · scikit-learn/scikit-learn · GitHub MultiLabelBinarizer — scikit-learn 1.5.2 documentation

Scikit-learn不会将每个标签纳入考虑, 因而不建议使用scikit-learn提供的划分方法.

scikit-multilearn: Multi-Label Classification in Python — Multi-Label Classification for Python

GitHub - scikit-multilearn/scikit-multilearn: A scikit-learn based module for multi-label et. al. classification

这些方法都需要将标签转换为onehot向量, 可用scikit-learn的MultiLabelBinarizer, 也可自行实现.

在使用时, 注意将向量转换为from scipy.sparse import csr_matrix的稀疏矩阵, 否则对于大型数据集非常容易出现OOM.

同时, 针对一些天生不平衡的数据集, (如某些标签下只有一个样本), 这些库不能自动剔除这些样本, 因而需要根据需求提前剔除这些样本. 如果不进行预处理直接划分, 那结果恐怕不会很好看.

least_common = count.most_common()
least_common.reverse()
# Get the least common classes, which only has 1-10 instances
least_common = [item[0] for item in least_common if item[1] <= 10]
print(least_common[0:10])
print(len(least_common))
print(max(label_flatten))

[669667, 667207, 376867, 376864, 73946, 73852, 291455, 291447, 512571, 495124] 352605 
670090  # 在670090+1个标签中, 有352605个标签对应的instance都是比較稀少的!

对于有157167个样本, 670091个标签的部分AmazonTitle数据集, 使用scikit-multilearn进行划分, 前者转换为(157167, 50)的特征向量, 后者转换为(157167,670091)的Onehot向量, 使用稀疏矩阵存储. 所需的参考时间如下:

# i9-12900K, 32G DDR4 
X_train, y_train, X_test, y_test = iterative_train_test_split(a, b, test_size = 0.35)
# 254m7.4s
StackOverflow等

python - How to perform MultiLabel stratified sampling? - Stack Overflow

data mining - How can I perform stratified sampling for multi-label multi-class classification? - Data Science Stack Exchange

machine learning - How to use sklearn train_test_split to stratify data for multi-label classification? - Data Science Stack Exchange

Balancing a multilabel dataset - nlp - PyTorch Forums

GitHub - stxupengyu/Multi-Label-Classification-Data-Preprocessing: 对于多标签分类数据集的预处理。Data preprocessing for multi label classification.

Check This: GameTrainSkill | 严鸿昌的个人博客

源论文: Sechidis K., Tsoumakas G., Vlahavas I. (2011) On the Stratification of Multi-Label Data. In: Gunopulos D., Hofmann T., Malerba D., Vazirgiannis M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science, vol 6913. Springer, Berlin, Heidelberg.

内存优化
  • np.memmap , 将过大的矩阵保存在外存中.
  • csr_latrix, 将矩阵以稀疏矩阵的方式存储, scikit-multilearn支持.

总结

  1. 获取数据集, 统计数据集信息
  2. 从标签分布中选定一定的标签集合
  3. 剔除不需要的样本, 根据剩余的样本数量和标签分布重新考虑选定的标签集合
  4. 使用scikit-multilearn划分集合
  5. 检查划分后的子集标签集合是否相同

方法

CRFA

Combining Context-relevant Features with Multi-stage Attention Network for Short Text Classification

GitHub - peipeilihfut/CRFA: Combining Context-relevant Features with Multi-stage Attention Network for Short Text Classification

Comal

性能指标

Understanding Micro, Macro, and Weighted Averages for Scikit-Learn metrics in multi-class classification with example

Creating a balanced multi-label dataset for machine learning | by Adam K | GumGum Tech Blog | Medium

相关资源

微软Concept Probase

MSRA 如果不可用, 可到下面的仓库下载: GitHub - HKUST-KnowComp/atomic-conceptualization: Code and data for the paper Acquiring and Modelling Abstract Commonsense Knowledge via Conceptualization

GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

GloVe: Global Vectors for Word Representation