Please wait a minute...
文章检索
复杂系统与复杂性科学  2020, Vol. 17 Issue (3): 47-51    DOI: 10.13306/j.1672-3813.2020.03.004
  本期目录 | 过刊浏览 | 高级检索 |
基于孤立森林采样策略的企业异常用水模式检测
林青轩1, 郭强1, 邓春燕1, 王雅静1, 刘建国2
1.上海理工大学复杂系统科学研究中心,上海 200093;
2.上海财经大学会计与财务研究院,上海 200433
Detecting Abnormal Water Consumption Pattern of Enterprise Based on Isolation Forest Sampling
LIN Qingxuan1, GUO Qiang1, DENG Chunyan1, WANG Yajing1, LIU Jianguo2
1. Research Center for Complex Systems Science, University of Shanghai for Science & Technology, Shanghai 200093,China;
2. Institute of Accounting and Finance, Shanghai University of Finance and Economics, Shanghai 200433, China
全文: PDF(1015 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 为解决企业异常用水模式检测过程中的低频短时间序列数据和不平衡分类问题,提出了一种基于孤立森林采样策略的二分类预测方法。首先构造用水波动性特征和统计性特征,利用孤立森林算法计算多数类中样本点的“孤立”程度以衡量每个样本的“代表性”,再按样本“代表性”排序,对“代表性”高的样本优先进行采样;然后将抽取出的样本与少数类合并,构成较平衡的训练样本集;最后利用较平衡的数据集训练XGBoost分类器并进行预测。在某市的7604家企业13个月的用水量数据集上,该方法对企业异常用水模式的预测结果AUC和查全率可达到0.927和0.891,比基于随机欠采样的XGBoost方法的0.885和0.733分别提升了4.7%和21.6%。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
林青轩
郭强
邓春燕
王雅静
刘建国
林青轩
郭强
邓春燕
王雅静
刘建国
关键词 异常用水模式检测不平衡分类孤立森林XGBoost    
Abstract:To solve the low-frequency short-sequence data and unbalanced classification problem in detecting the abnormal water consumption pattern of enterprises, this paper proposes a two-class prediction method based on Isolation Forest sampling. Firstly, the volatility and statistical features of water consumption are constructed. The Isolated Forest algorithm is used to calculate the degree of isolation of samples in the large class to measure the representation of each sample, and the samples are extracted according to their representation. Then the extracted samples are merged with the small class to form a balanced training dataset. Finally, the XGBoost classifier is trained with the balanced dataset and predicting the abnormal pattern. On the dataset of 7,604 enterprises' 13-month water consumption in a city, the AUC and recall ratio of the method proposed by this paper can reach 0.927 and 0.891, and those of XGBoost method based on random under sampling are 0.855 and 0.733, which are improved by 4.7% and 21.6% respectively.
Key wordsabnormal water consumption pattern detection    unbalanced classification    isolation forest    XGBoost
收稿日期: 2020-01-20      出版日期: 2020-09-23
:  N94  
  TP391  
基金资助:国家自然科学基金(61773248, 71771152);国家社科重大基金(18ZDA088,20ZDA060)
作者简介: 林青轩(1992-),男,浙江温州人,硕士研究生,主要研究方向为复杂网络、数据挖掘。
引用本文:   
林青轩, 郭强, 邓春燕, 王雅静, 刘建国. 基于孤立森林采样策略的企业异常用水模式检测[J]. 复杂系统与复杂性科学, 2020, 17(3): 47-51.
LIN Qingxuan, GUO Qiang, DENG Chunyan, WANG Yajing, LIU Jianguo. Detecting Abnormal Water Consumption Pattern of Enterprise Based on Isolation Forest Sampling[J]. Complex Systems and Complexity Science, 2020, 17(3): 47-51.
链接本文:  
https://fzkx.qdu.edu.cn/CN/10.13306/j.1672-3813.2020.03.004      或      https://fzkx.qdu.edu.cn/CN/Y2020/V17/I3/47
[1] Pozzolo A D, Boracchi G, Caelen O, et al. Credit card fraud detection: a realistic modeling and a novel learning strategy[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(8): 3784-3797.
[2] Zareapoor M, Shamsolmoali P. Application of credit card fraud detection: based on bagging ensemble classifier[J]. Procedia Computer Science, 2015, 48: 679-685.
[3] Gao S, Zhang H, Zheng X, et al. Improving SVM classifiers with link structure for Web spam detection[J]. Journal of Computational Information Systems, 2014, 10(6): 2435-2443.
[4] 卢晓勇, 陈木生. 基于随机森林和欠采样集成的垃圾网页检测[J]. 计算机应用, 2016, 36(3):731-734.
Lu Xiaoyong, Chen Musheng. Web spam detection based on random forest and under-sampling ensemble[J] , Journal of Computer Applications, 2016, 36(3):731-734.
[5] 庄池杰,张斌,胡军,等. 基于无监督学习的电力用户异常用电模式检测[J]. 中国电机工程学报, 2016, 36(2):379-387.
Zhuang Chijie, Zhang Bin, Hu Jun, et al. Anomaly detection for power consumption patterns based on unsupervised learning[J] , 2016, 36(2):379-387.
[6] 林舒杨,李翠华,江弋,等.不平衡数据的降采样方法研究[J].计算机研究与发展,2011,48(S3): 47-53.
Lin Shuyang, Li Cuihua, Jiang Ge, et al. Under-sampling method research in class-imbalanced data[J]. Journals of Computer Research and Development, 2011, 48(S3): 47-53.
[7] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
[8] He H, Bai Y, Garcia E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning[C]. Proceeding of 2008 IEEE International Joint Conference on Neural Networks. Hong Kong, 2008: 1322-1328.
[9] 赵楠,张小芳,张利军.不平衡数据分类研究综述[J]. 计算机科学,2018,46(6A):22-27.
Zhao Nan, Zhang Xiaofang, Zhang Lijun. Overview of imbalanced data classification[J]. Computer Science, 2018, 46(6A):22-27.
[10] Liu F T, Kai M T, Zhou Z H. Isolation-based anomaly detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): 1-39.
[11] Goldstein M, Dengel A. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm[J]. KI-2012: Poster and Demo Track, 2012: 59-63.
[12] Breunig M M, Kriegel H P, Ng R T, et al. LOF: identifying density-based local outliers[C]. ACM Sigmod Record. Dallas, 2000, 29(2): 93-104.
[13] Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C]. Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. San Francisco, 2016: 785-794.
[14] 周志华.机器学习[M].北京:清华大学出版社,2016:23-36.
[15] Fawcett T. An introduction to ROC analysis[J]. Pattern Recognition Letters, 2006 ,27(8): 861-874.
[1] 聂廷远, 王艳伟, 聂晶晶, 刘鹏飞. 基于注意力机制和复杂网络的FPGA可布性预测[J]. 复杂系统与复杂性科学, 2026, 23(1): 53-59.
[2] 户佐安, 杨江浩, 邓锦程. 考虑多元变量的世界航空网络综合鲁棒性研究[J]. 复杂系统与复杂性科学, 2026, 23(1): 60-69.
[3] 潘文祥, 李东艳, 孙思翔, 佟宁. 一种基于社团外围节点的网络鲁棒性优化策略[J]. 复杂系统与复杂性科学, 2026, 23(1): 70-78.
[4] 任翠萍, 张佳倩. 基于元网络模型的危险品运输事故致因分析[J]. 复杂系统与复杂性科学, 2026, 23(1): 45-52.
[5] 胡金梅, 邹艳丽, 王鸿俊, 张海. 基于二阶邻居负载再分配的电网级联故障研究[J]. 复杂系统与复杂性科学, 2026, 23(1): 1-9.
[6] 牟奇锋, 李晓倩. 基于邻接矩阵的复杂网络演化融合迭代方法[J]. 复杂系统与复杂性科学, 2026, 23(1): 79-86.
[7] 张禧若, 廖元, 彭佳琴, 杨宇航, 黄丽亚. 基于传播模型的加权有向网络评估算法[J]. 复杂系统与复杂性科学, 2026, 23(1): 10-16.
[8] 孟卫臣, 王庆芝, 刘永超, 傅保增. 间歇测量下具有外源干扰的二阶多智能体系统的包容控制[J]. 复杂系统与复杂性科学, 2025, 22(4): 109-117.
[9] 樊辉锦, 陈青华, 巫银花. 基于Agent建模的海上反无人集群作战效能分析研究[J]. 复杂系统与复杂性科学, 2025, 22(4): 89-98.
[10] 刘学娟, 张静怡, 曹辉. 无标度网络下ESG评分对信用风险传染的影响[J]. 复杂系统与复杂性科学, 2025, 22(4): 8-14.
[11] 余文倩, 马福祥, 陈阳, 马秀娟. 基于自适应的高阶网络鲁棒性分析[J]. 复杂系统与复杂性科学, 2025, 22(4): 15-23.
[12] 韩世翔, 闫光辉, 裴华艳. 复杂网络上双向免疫对传染病传播的影响[J]. 复杂系统与复杂性科学, 2025, 22(4): 55-62.
[13] 卢新彪, 刘泽诚, 陈贵允, 杨铁流, 高兴. 基于图卷积网络的复杂网络能控性提升方法[J]. 复杂系统与复杂性科学, 2025, 22(4): 24-28.
[14] 朱瑞斌, 王立杰. 基于观测器的多智能体系统有限时间预设性能一致性控制[J]. 复杂系统与复杂性科学, 2025, 22(3): 113-121.
[15] 霍宣蓉, 肖玉芝, 韩佳新, 黄涛, 胡泽宇. 基于节点特征增强的信息溯源模型[J]. 复杂系统与复杂性科学, 2025, 22(3): 1-10.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed