Detecting Abnormal Water Consumption Pattern of Enterprise Based on Isolation Forest Sampling
LIN Qingxuan1, GUO Qiang1, DENG Chunyan1, WANG Yajing1, LIU Jianguo2
1. Research Center for Complex Systems Science, University of Shanghai for Science & Technology, Shanghai 200093,China; 2. Institute of Accounting and Finance, Shanghai University of Finance and Economics, Shanghai 200433, China
Abstract:To solve the low-frequency short-sequence data and unbalanced classification problem in detecting the abnormal water consumption pattern of enterprises, this paper proposes a two-class prediction method based on Isolation Forest sampling. Firstly, the volatility and statistical features of water consumption are constructed. The Isolated Forest algorithm is used to calculate the degree of isolation of samples in the large class to measure the representation of each sample, and the samples are extracted according to their representation. Then the extracted samples are merged with the small class to form a balanced training dataset. Finally, the XGBoost classifier is trained with the balanced dataset and predicting the abnormal pattern. On the dataset of 7,604 enterprises' 13-month water consumption in a city, the AUC and recall ratio of the method proposed by this paper can reach 0.927 and 0.891, and those of XGBoost method based on random under sampling are 0.855 and 0.733, which are improved by 4.7% and 21.6% respectively.
林青轩, 郭强, 邓春燕, 王雅静, 刘建国. 基于孤立森林采样策略的企业异常用水模式检测[J]. 复杂系统与复杂性科学, 2020, 17(3): 47-51.
LIN Qingxuan, GUO Qiang, DENG Chunyan, WANG Yajing, LIU Jianguo. Detecting Abnormal Water Consumption Pattern of Enterprise Based on Isolation Forest Sampling. Complex Systems and Complexity Science, 2020, 17(3): 47-51.
[1] Pozzolo A D, Boracchi G, Caelen O, et al. Credit card fraud detection: a realistic modeling and a novel learning strategy[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(8): 3784-3797. [2] Zareapoor M, Shamsolmoali P. Application of credit card fraud detection: based on bagging ensemble classifier[J]. Procedia Computer Science, 2015, 48: 679-685. [3] Gao S, Zhang H, Zheng X, et al. Improving SVM classifiers with link structure for Web spam detection[J]. Journal of Computational Information Systems, 2014, 10(6): 2435-2443. [4] 卢晓勇, 陈木生. 基于随机森林和欠采样集成的垃圾网页检测[J]. 计算机应用, 2016, 36(3):731-734. Lu Xiaoyong, Chen Musheng. Web spam detection based on random forest and under-sampling ensemble[J] , Journal of Computer Applications, 2016, 36(3):731-734. [5] 庄池杰,张斌,胡军,等. 基于无监督学习的电力用户异常用电模式检测[J]. 中国电机工程学报, 2016, 36(2):379-387. Zhuang Chijie, Zhang Bin, Hu Jun, et al. Anomaly detection for power consumption patterns based on unsupervised learning[J] , 2016, 36(2):379-387. [6] 林舒杨,李翠华,江弋,等.不平衡数据的降采样方法研究[J].计算机研究与发展,2011,48(S3): 47-53. Lin Shuyang, Li Cuihua, Jiang Ge, et al. Under-sampling method research in class-imbalanced data[J]. Journals of Computer Research and Development, 2011, 48(S3): 47-53. [7] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357. [8] He H, Bai Y, Garcia E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning[C]. Proceeding of 2008 IEEE International Joint Conference on Neural Networks. Hong Kong, 2008: 1322-1328. [9] 赵楠,张小芳,张利军.不平衡数据分类研究综述[J]. 计算机科学,2018,46(6A):22-27. Zhao Nan, Zhang Xiaofang, Zhang Lijun. Overview of imbalanced data classification[J]. Computer Science, 2018, 46(6A):22-27. [10] Liu F T, Kai M T, Zhou Z H. Isolation-based anomaly detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): 1-39. [11] Goldstein M, Dengel A. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm[J]. KI-2012: Poster and Demo Track, 2012: 59-63. [12] Breunig M M, Kriegel H P, Ng R T, et al. LOF: identifying density-based local outliers[C]. ACM Sigmod Record. Dallas, 2000, 29(2): 93-104. [13] Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C]. Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. San Francisco, 2016: 785-794. [14] 周志华.机器学习[M].北京:清华大学出版社,2016:23-36. [15] Fawcett T. An introduction to ROC analysis[J]. Pattern Recognition Letters, 2006 ,27(8): 861-874.