您好, 访客   登录/注册

基于间隔理论的过采样集成算法

来源:用户上传      作者:

  摘 要:针对传统集成算法不适用于不平衡数据分类的问题,提出基于间隔理论的AdaBoost算法(MOSBoost)。首先通过预训练得到原始样本的间隔; 然后依据间隔排序对少类样本进行启发式复制,从而形成新的平衡样本集; 最后将平衡样本集输入AdaBoost算法进行训练以得到最终集成分类器。在UCI数据集上进行测试实验,利用Fmeasure和Gmean两个准则对MOSBoost、AdaBoost、随机过采样AdaBoost(ROSBoost)和随机降采样AdaBoost(RDSBoost)四种算法进行评价。实验结果表明,MOSBoost算法分类性能优于其他三种算法,其中,相对于AdaBoost算法,MOSBoost算法在Fmeasure和Gmean准则下分别提升了8.4%和6.2%。
  关键词:不平衡数据;间隔理论;过采样方法;集成分类器;机器学习
  中图分类号:TP181
  文献标志码:A
  Abstract: In order to solve the problem that traditional ensemble algorithms are not suitable for imbalanced data classification, Over Sampling AdaBoost based on Margin theory (MOSBoost) was proposed. Firstly, the margins of original samples were obtained by pretraining. Then, the minority class samples were heuristic duplicated by margin sorting thus forming a new balanced sample set. Finally, the finall ensemble classifier was obtained by the trained AdaBoost with the balanced sample set as the input. In the experiment on UCI dataset, Fmeasure and Gmean were used to evaluate MOSBoost, AdaBoost, Random OverSampling AdaBoost (ROSBoost) and Random UnderSampling AdaBoost (RDSBoost). The experimental results show that MOSBoost is superior to other three algorithm. Compared with AdaBoost, MOSBoost improves 8.4% and 6.2% respctively under Fmeasure and Gmean criteria.
  英文关键词Key words: imbalanced data; margin theory; over sampling method; ensemble classifier; machine learning
  0 引言
  近些年,不平衡数据分类问题成为了机器学习的热点问题,它广泛存在于现实生产生活中,例如邮件过滤[1]、图像分类[2]、软件缺陷预测[3]、医疗诊断[4]、基因数据分析[5]等。对于二分类问题,不平衡数据中多类的样本数量远大于少类。传统的分类方法以总体分类精度为目标,忽视了类别不平衡性,从而导致少类样本分类准确率降低,然而少类样本往往具有较高的价值,这使得错分代价较大。
  针对不平衡数据的处理方法大致分为算法层面和数据层面: 算法层面指构造新的算法或对原有算法进行改造以偏向少类; 数据层面主要是利用重采樣方法获得平衡样本集,再结合现有分类器进行分类。重采样方法,包括欠采样法和过采样法,形式上比较简练,且不影响分类器设计,因此得到了广泛的研究。根据采取的策略,它又可分为随机采样和启发式采样: 随机采样不依据数据信息,只是简单地随机删除或添加样本; 启发式采样则是在利用数据内部特性的基础上进行采样。典型的启发式欠采样方法如Tomek links[6]、One sided selection[7]、Neighborhood Cleaning Rule[8]等克服了随机欠采样中容易缺失有用信息的缺点,一定程度上提高了算法性能。而启发式过采样中比较有代表性的是SMOTE(Synthetic Minority Oversampling TEchnique)[9]方法及其改进算法[10-12]。SMOTE方法的基本假设是相同类别的邻近数据点所生成的凸集也属于同一类别。启发式重采样方法基本都是在某种准则下对样本进行筛选,对数据集的依赖性较强,然而不平衡数据集往往存在类内不平衡、小析取项、高噪声等特点,使得其难以满足准则要求,进而降低了算法性能。表面上看,这是数据集与准则之间的适配性问题,实际上是这些方法缺乏理论基础,泛化性较低。
  AdaBoost算法是一种经典的集成分类算法,在机器学习中有广泛的应用[13-15]。AdaBoost以最小化总体分类误差为目标,忽视了类别间的不平衡性,因而不适用于不平衡数据分类。间隔理论是AdaBoost算法的重要理论基础,成功解释了AdaBoost算法不易过拟合等现象。本文从间隔理论出发,定义了少类间隔和多类间隔,对少类间隔样本依据符号正负进行筛选,对正的少类间隔样本进行启发式复制,形成新的平衡样本集,在此样本集上进行AdaBoost训练,形成了MOSBoost算法,从而提高了不平衡数据分类性能。
  1 相关工作   1.1 AdaBoost算法
  AdaBoost算法將训练样本集{(x1,y1),(x2,y2),…,(xN,yN)}作为输入,其中xi是样本,yi为其类标,对于二分类问题,yi∈{-1,1}。然后根据已知的基分类算法在t=1,2,…,T轮中不断地运算。Dt(i)表示第t轮中第i个训练样本的权重。基分类算法的任务是在权重分布Dt的基础上得到基分类器ht来最小化分类误差。当ht训练完成,AdaBoost选择一个参数αt∈R来衡量ht的分类性能。然后更新权重分布Dt。最终的集成分类器F是T个基分类器的加权输出。具体算法如算法1所示。
  参考文献 (References)
  [1] DAI H L. Class imbalance learning via a fuuzy total margin based support vector machine[J]. Applied Soft Computing, 2015, 31(C): 172-184.
  [2] 谭洁帆,朱焱,陈同孝,等.基于卷积神经网络和代价敏感的不平衡图像分类方法[J].计算机应用,2018,38(7):1862-1865,1871.(TAN J F, ZHU Y, CHEN T X, et al. Imbalanced image classification approach based on convolution network and costsensitivity[J]. Journal of Computer Applications,2018,38(7):1862-1865,1871.)
  [3] WANG S, YAO X. Using class imbalance learning for software defect prediction[J]. IEEE Transactions on Reliability, 2013, 62(2): 434-443.
  [4] OZCIFT A, GULTEN A. Classifer ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms[J]. Computer Methods and Programs in Biomedicine, 2011, 104(3):443-451.
  [5] YU H, NI J, ZHAO J. ACOSampling: an ant colony optimizationbased undersampling method for classifying imbalanced DNA microarray data[J]. Neurocomputing, 2013,101:309-318.
  [6] TOMEK I. Two modifications of CNN[J]. IEEE Transactions on Systems, Man and Cybernetics, 1976, SMC6(11): 769-772.
  [7] KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: onesided selection[C]// Proceedings of the 14th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1997: 179-186.
  [8] LAURIKKALA J. Improving identification of difficult small classes by balancing class distribution[C]// Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. Berlin: Springer, 2001: 63-66.
  [9] CHAWLA N, BOWYER K, HALL L, et al. SMOTE: synthetic minority oversampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
  [10] RIVERA W A. Noise reduction a priori synthetic oversampling for class imbalanced data sets[J]. Information Sciences, 2017, 408(C): 146-161.
  [11] MA L, FAN S. CURESMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests [J]. BMC Bioinformatics, 2017,18(1): 169.
  [12] BOROWSKA, K, STEPANIUK J. Imbalanced data classification: a novel resampling approach combining versatile improved SMOTE and rough sets[C]// CISIM 2016: IFIP International Conference on Computer Information Systems and Industrial Management. Berlin: Springer, 2016: 31-42.   [13] BAIG M M, AWAIS M M, ELALFY E S M. AdaBoostbased artificial neural network learning[J]. Neurocomputing, 2017, 248(C): 120-126.
  [14] MINZ A, MAHOBIYA C. MR image classification using Adaboost for brain tumor type[C]// Proceedings of the 2017 IEEE 7th International Advance Computing Conference. Washington, DC: IEEE Computer Society, 2017:701-705.
  [15] 王軍,费凯,程勇.基于改进的AdaboostBP模型在降水中的预测[J]. 计算机应用, 2017, 37(9):2689-2693.(WANG J,FEI K,CHENG Y. Prediction of rainfall based on improved AdaboostBP model[J]. Journal of Computer Applications, 2017, 37(9):2689-2693.)
  [16] SCHAPIRE R E, FREUND Y, BARTLETT P, et al. Boosting the margin: a new explanation for the effectiveness of voting methods[J]. Annals of Statistics, 1998, 26(5): 1651-1686.
  [17] GAO W, ZHOU Z H. On the doubt about margin explanation of boosting[J]. Artificial Intelligence, 2013,203:1-18.
  [18] BACHE K, LICHMAN M. UCI repository of machine learning databases[DB/OL].[2018-06-20].http://www.ics.uci.edu/~mlearn/MLRepository.html.
  [19] van HULSE J, KHOSHGOFTAAR T M, NAPOLITANO A. Expertimental perspectives on learning from imbalanced data[C]// Proceedings of the 24th International Conference on Machine Learing. New York: ACM, 2007: 935-942.
  [20] LIU N, WEI L W, AUNG Z. Handling class imbalance in customer behavior prediction[C]// Proceedings of the 2014 International Conference on Collaboration Technologies and Systems. Piscataway, NJ: IEEE, 2014: 100-103.
转载注明来源:https://www.xzbu.com/8/view-14941557.htm