Please wait a minute...
文章检索
复杂系统与复杂性科学  2020, Vol. 17 Issue (3): 47-51    DOI: 10.13306/j.1672-3813.2020.03.004
  本期目录 | 过刊浏览 | 高级检索 |
基于孤立森林采样策略的企业异常用水模式检测
林青轩1, 郭强1, 邓春燕1, 王雅静1, 刘建国2
1.上海理工大学复杂系统科学研究中心,上海 200093;
2.上海财经大学会计与财务研究院,上海 200433
Detecting Abnormal Water Consumption Pattern of Enterprise Based on Isolation Forest Sampling
LIN Qingxuan1, GUO Qiang1, DENG Chunyan1, WANG Yajing1, LIU Jianguo2
1. Research Center for Complex Systems Science, University of Shanghai for Science & Technology, Shanghai 200093,China;
2. Institute of Accounting and Finance, Shanghai University of Finance and Economics, Shanghai 200433, China
全文: PDF(1015 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 为解决企业异常用水模式检测过程中的低频短时间序列数据和不平衡分类问题,提出了一种基于孤立森林采样策略的二分类预测方法。首先构造用水波动性特征和统计性特征,利用孤立森林算法计算多数类中样本点的“孤立”程度以衡量每个样本的“代表性”,再按样本“代表性”排序,对“代表性”高的样本优先进行采样;然后将抽取出的样本与少数类合并,构成较平衡的训练样本集;最后利用较平衡的数据集训练XGBoost分类器并进行预测。在某市的7604家企业13个月的用水量数据集上,该方法对企业异常用水模式的预测结果AUC和查全率可达到0.927和0.891,比基于随机欠采样的XGBoost方法的0.885和0.733分别提升了4.7%和21.6%。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
林青轩
郭强
邓春燕
王雅静
刘建国
关键词 异常用水模式检测不平衡分类孤立森林XGBoost    
Abstract:To solve the low-frequency short-sequence data and unbalanced classification problem in detecting the abnormal water consumption pattern of enterprises, this paper proposes a two-class prediction method based on Isolation Forest sampling. Firstly, the volatility and statistical features of water consumption are constructed. The Isolated Forest algorithm is used to calculate the degree of isolation of samples in the large class to measure the representation of each sample, and the samples are extracted according to their representation. Then the extracted samples are merged with the small class to form a balanced training dataset. Finally, the XGBoost classifier is trained with the balanced dataset and predicting the abnormal pattern. On the dataset of 7,604 enterprises' 13-month water consumption in a city, the AUC and recall ratio of the method proposed by this paper can reach 0.927 and 0.891, and those of XGBoost method based on random under sampling are 0.855 and 0.733, which are improved by 4.7% and 21.6% respectively.
Key wordsabnormal water consumption pattern detection    unbalanced classification    isolation forest    XGBoost
收稿日期: 2020-01-20      出版日期: 2020-09-23
ZTFLH:  N94  
  TP391  
基金资助:国家自然科学基金(61773248, 71771152);国家社科重大基金(18ZDA088,20ZDA060)
作者简介: 林青轩(1992-),男,浙江温州人,硕士研究生,主要研究方向为复杂网络、数据挖掘。
引用本文:   
林青轩, 郭强, 邓春燕, 王雅静, 刘建国. 基于孤立森林采样策略的企业异常用水模式检测[J]. 复杂系统与复杂性科学, 2020, 17(3): 47-51.
LIN Qingxuan, GUO Qiang, DENG Chunyan, WANG Yajing, LIU Jianguo. Detecting Abnormal Water Consumption Pattern of Enterprise Based on Isolation Forest Sampling. Complex Systems and Complexity Science, 2020, 17(3): 47-51.
链接本文:  
http://fzkx.qdu.edu.cn/CN/10.13306/j.1672-3813.2020.03.004      或      http://fzkx.qdu.edu.cn/CN/Y2020/V17/I3/47
[1] Pozzolo A D, Boracchi G, Caelen O, et al. Credit card fraud detection: a realistic modeling and a novel learning strategy[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(8): 3784-3797.
[2] Zareapoor M, Shamsolmoali P. Application of credit card fraud detection: based on bagging ensemble classifier[J]. Procedia Computer Science, 2015, 48: 679-685.
[3] Gao S, Zhang H, Zheng X, et al. Improving SVM classifiers with link structure for Web spam detection[J]. Journal of Computational Information Systems, 2014, 10(6): 2435-2443.
[4] 卢晓勇, 陈木生. 基于随机森林和欠采样集成的垃圾网页检测[J]. 计算机应用, 2016, 36(3):731-734.
Lu Xiaoyong, Chen Musheng. Web spam detection based on random forest and under-sampling ensemble[J] , Journal of Computer Applications, 2016, 36(3):731-734.
[5] 庄池杰,张斌,胡军,等. 基于无监督学习的电力用户异常用电模式检测[J]. 中国电机工程学报, 2016, 36(2):379-387.
Zhuang Chijie, Zhang Bin, Hu Jun, et al. Anomaly detection for power consumption patterns based on unsupervised learning[J] , 2016, 36(2):379-387.
[6] 林舒杨,李翠华,江弋,等.不平衡数据的降采样方法研究[J].计算机研究与发展,2011,48(S3): 47-53.
Lin Shuyang, Li Cuihua, Jiang Ge, et al. Under-sampling method research in class-imbalanced data[J]. Journals of Computer Research and Development, 2011, 48(S3): 47-53.
[7] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
[8] He H, Bai Y, Garcia E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning[C]. Proceeding of 2008 IEEE International Joint Conference on Neural Networks. Hong Kong, 2008: 1322-1328.
[9] 赵楠,张小芳,张利军.不平衡数据分类研究综述[J]. 计算机科学,2018,46(6A):22-27.
Zhao Nan, Zhang Xiaofang, Zhang Lijun. Overview of imbalanced data classification[J]. Computer Science, 2018, 46(6A):22-27.
[10] Liu F T, Kai M T, Zhou Z H. Isolation-based anomaly detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): 1-39.
[11] Goldstein M, Dengel A. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm[J]. KI-2012: Poster and Demo Track, 2012: 59-63.
[12] Breunig M M, Kriegel H P, Ng R T, et al. LOF: identifying density-based local outliers[C]. ACM Sigmod Record. Dallas, 2000, 29(2): 93-104.
[13] Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C]. Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. San Francisco, 2016: 785-794.
[14] 周志华.机器学习[M].北京:清华大学出版社,2016:23-36.
[15] Fawcett T. An introduction to ROC analysis[J]. Pattern Recognition Letters, 2006 ,27(8): 861-874.
[1] 全吉, 周亚文, 王先甲. 社会困境博弈中群体合作行为演化研究综述[J]. 复杂系统与复杂性科学, 2020, 17(1): 1-14.
[2] 韩定定, 祁婷, 李德志. 克隆植物生长扩散的生态复杂性[J]. 复杂系统与复杂性科学, 2020, 17(1): 15-20.
[3] 田兴华, 张纪会, 李阳. 基于混沌映射的自适应退火型粒子群算法[J]. 复杂系统与复杂性科学, 2020, 17(1): 45-54.
[4] 周双, 宾晟, 孙更新. 融合多关系的矩阵分解社会化推荐算法[J]. 复杂系统与复杂性科学, 2020, 17(1): 30-36.
[5] 刘晓露, 贾书伟. 用户—产品二部分网络中用户声誉实证研究[J]. 复杂系统与复杂性科学, 2020, 17(1): 37-44.
[6] 钟丽君, 宾晟, 袁敏, 孙更新. 多功能复杂网络模型及其应用[J]. 复杂系统与复杂性科学, 2019, 16(2): 31-40.
[7] 李阳, 田兴华, 张纪会. 基于改进BA网络的遗传算法[J]. 复杂系统与复杂性科学, 2019, 16(2): 69-76.
[8] 朱萌萌, 宋运忠. 基于勒贝格采样的非线性系统优化控制[J]. 复杂系统与复杂性科学, 2019, 16(1): 83-93.
[9] 黄毅, 张胜, 戴维凯, 王硕, 杨芳. 加权网络的体积维数[J]. 复杂系统与复杂性科学, 2018, 15(3): 47-55.
[10] 钱晓东, 杨贝. 基于复杂网络模型的供应链企业合作演化研究[J]. 复杂系统与复杂性科学, 2018, 15(3): 1-10.
[11] 应尚军, 纪小妹, 吴婷婷. 国际资本流动网络复杂性研究的总体框架[J]. 复杂系统与复杂性科学, 2018, 15(1): 38-44.
[12] 周荣荣, 李志勇, 郭非非, 许海玉, 唐仕欢. 补气药人参、黄芪防治心脑疾病的网络药理学研究[J]. 复杂系统与复杂性科学, 2018, 15(1): 18-23.
[13] 潘园园, 张力, 段玲玲, 段法兵. 离散Hopfield神经网络的手写数字识别研究[J]. 复杂系统与复杂性科学, 2018, 15(1): 75-79.
[14] 吴宗柠, 吕俊宇, 蔡宏波, 樊瑛. 双曲空间下国际贸易网络建模与分析——以小麦国际贸易为例[J]. 复杂系统与复杂性科学, 2018, 15(1): 31-37.
[15] 李云, 宋运忠. 基于混合模式的BA无标度网络同步研究[J]. 复杂系统与复杂性科学, 2017, 14(4): 89-96.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed