数据集
datasets/AAPD/data · master · Jimmy Lin / Castor-data · GitLab
- Extreme-classification 包含了常用的数据集, 如: AmazonTitles Amazon WikiSeeAlsoTitles Wikipedia ORCAS等 GitHub - Extreme-classification/MUFIN: Multimodal extreme classification
The Extreme Classification Repository
数据集预处理
Train/Test Split
问题描述
- 数据集平衡问题 如何针对一个raw数据集, 每个样本具有一个或多个标签进行平衡的划分.
// 数据集格式大致如此
[
{
"content": "Abc",
"label": [100, 102, 400]
},
...
]
即划分的两个数据子集标签集合应当近似或完全相同, 在此之上实现一定的比例划分.
实现的最终效果为
len( set(train_labels).different(set(test_labels)) ) --> 0
- 标签剔除 如何选择一部分标签, 选中的这一部分标签对应的整个数据集样本数目在一定范围内. 删除策略问题: 有两种方式:
- 直接将含有该标签的样本剔除
- 每一个样本删除此标签, 最后再删除没有任何标签的样本
可以使用第二种方法.
库和相关方法
- Scikit-learn 划分: `As you’ve noticed, stratification for scikit-learn’s train_test_split() does not consider the labels individually, but rather as a “label set”. This does not work well at all for multi-label data because the number of unique combinations grows exponentially with the number of labels.
Train-test-split for multilabel datasets · Issue #25596 · scikit-learn/scikit-learn · GitHub MultiLabelBinarizer — scikit-learn 1.5.2 documentation
Scikit-learn不会将每个标签纳入考虑, 因而不建议使用scikit-learn提供的划分方法.
- iterative-stratification: 分层划分 GitHub - trent-b/iterative-stratification: scikit-learn cross validators for iterative stratification of multilabel data
scikit-multilearn: Multi-Label Classification in Python — Multi-Label Classification for Python
这些方法都需要将标签转换为onehot
向量, 可用scikit-learn
的MultiLabelBinarizer, 也可自行实现.
在使用时, 注意将向量转换为from scipy.sparse import csr_matrix
的稀疏矩阵, 否则对于大型数据集非常容易出现OOM.
同时, 针对一些天生不平衡的数据集, (如某些标签下只有一个样本), 这些库不能自动剔除这些样本, 因而需要根据需求提前剔除这些样本. 如果不进行预处理直接划分, 那结果恐怕不会很好看.
least_common = count.most_common()
least_common.reverse()
# Get the least common classes, which only has 1-10 instances
least_common = [item[0] for item in least_common if item[1] <= 10]
print(least_common[0:10])
print(len(least_common))
print(max(label_flatten))
[669667, 667207, 376867, 376864, 73946, 73852, 291455, 291447, 512571, 495124] 352605
670090 # 在670090+1个标签中, 有352605个标签对应的instance都是比較稀少的!
对于有157167个样本, 670091个标签的部分AmazonTitle数据集, 使用scikit-multilearn进行划分, 前者转换为(157167, 50)的特征向量, 后者转换为(157167,670091)的Onehot向量, 使用稀疏矩阵存储. 所需的参考时间如下:
# i9-12900K, 32G DDR4
X_train, y_train, X_test, y_test = iterative_train_test_split(a, b, test_size = 0.35)
# 254m7.4s
- Multilabel Balanced Sampler GitHub - issamemari/pytorch-multilabel-balanced-sampler: PyTorch samplers that output roughly balanced batches with support for multilabel datasets
StackOverflow等
python - How to perform MultiLabel stratified sampling? - Stack Overflow
Balancing a multilabel dataset - nlp - PyTorch Forums
Check This: GameTrainSkill | 严鸿昌的个人博客
源论文: Sechidis K., Tsoumakas G., Vlahavas I. (2011) On the Stratification of Multi-Label Data. In: Gunopulos D., Hofmann T., Malerba D., Vazirgiannis M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science, vol 6913. Springer, Berlin, Heidelberg.
内存优化
np.memmap
, 将过大的矩阵保存在外存中.csr_latrix
, 将矩阵以稀疏矩阵的方式存储,scikit-multilearn
支持.
总结
- 获取数据集, 统计数据集信息
- 从标签分布中选定一定的标签集合
- 剔除不需要的样本, 根据剩余的样本数量和标签分布重新考虑选定的标签集合
- 使用
scikit-multilearn
划分集合 - 检查划分后的子集标签集合是否相同
方法
CRFA
Combining Context-relevant Features with Multi-stage Attention Network for Short Text Classification
Comal
性能指标
Creating a balanced multi-label dataset for machine learning | by Adam K | GumGum Tech Blog | Medium
相关资源
微软Concept Probase
MSRA 如果不可用, 可到下面的仓库下载: GitHub - HKUST-KnowComp/atomic-conceptualization: Code and data for the paper Acquiring and Modelling Abstract Commonsense Knowledge via Conceptualization
GloVe
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.