• • 上一篇    

大数据背景下分布式支持向量回归模型研究

蔡超, 冉晓婷, 薛伟, 田育鑫   

  1. 山东工商学院统计学院, 烟台 264005
  • 收稿日期:2022-05-09 修回日期:2022-10-27 发布日期:2023-05-18
  • 通讯作者: 薛伟, Email:bestxuewei@163.com
  • 基金资助:
    山东省社会科学规划项目(19BYSJ40)资助课题.

蔡超, 冉晓婷, 薛伟, 田育鑫. 大数据背景下分布式支持向量回归模型研究[J]. 系统科学与数学, 2023, 43(4): 1081-1092.

CAI Chao, RAN Xiaoting, XUE Wei, TIAN Yuxin. Research on Distributed Support Vector Regression Model Under the Background of Big Data[J]. Journal of Systems Science and Mathematical Sciences, 2023, 43(4): 1081-1092.

Research on Distributed Support Vector Regression Model Under the Background of Big Data

CAI Chao, RAN Xiaoting, XUE Wei, TIAN Yuxin   

  1. School of Statistics, Shandong Technology and Business University, Yantai 264005
  • Received:2022-05-09 Revised:2022-10-27 Published:2023-05-18
为解决传统的支持向量回归模型在处理大规模数据时计算效率较低的局限,文章将交互有效方法与支持向量回归模型相结合,提出了基于交互有效方法的分布式支持向量回归模型(CE-SVR).该模型首先采用分布式存储方式将大规模数据随机分配给多台机器,其次采用交互有效方法构建支持向量回归的近似损失函数替代全局损失函数获得近似预测结果,能够有效地分析大规模数据. 数值模拟和应用研究的结果表明:在线性模型中,文章所提出模型的预测性能与全局支持向量回归模型基本一致,且显著优于基于单轮型方法的分布式支持向量回归模型(OS-SVR);在非线性模型中, 文章所提出模型的预测性能会随着机器数的增加而降低,但其预测性能显著优于OS-SVR模型.
To address the computationally inefficient problem of the classical support vector regression model when processing the large-scale data, this paper combines the communication-efficient method with the support vector regression model to propose a distributed support vector regression model (CE-SVR). CE-SVR model firstly uses distributed storage to randomly distribute the large-scale data into multiple machines, then uses a communication-efficient method to construct an approximate loss function of support vector regression instead of the global loss function and obtain an approximate prediction result, which can effectively solve the limitations of the classical support vector machine regression model. The results of the numerical simulation and applied research show: In the linear model, the prediction performance of the CE-SVR model is basically consistent with the global support vector regression model, and is significantly better than the distributed support vector regression model (OS-SVR) based on one-shot method; in the nonlinear model, the prediction performance of CE-SVR model decreases as the number of machines increasing, but its prediction performance is significantly better than that of the OS-SVR model.

MR(2010)主题分类: 

()
[1] Cortes C, Vapnik V. Support-vector networks. Machine Learning, 1995, 20(3):273-297.
[2] Bledsoe J C, Xiao C, Chaovalitwongse A, et al. Diagnostic classification of ADHD versus control:Support vector machine classification using brief neuropsychological assessment. Journal of Attention Disorders, 2020, 24(11):1547-1556.
[3] Zhang H, Shi Y, Yang X, et al. A firefly algorithm modified support vector machine for the credit risk assessment of supply chain finance. Research in International Business and Finance, 2021,58:101482.
[4] 胡雪梅, 李佳丽, 蒋慧凤. 机器学习方法研究肝癌预测问题. 系统科学与数学, 2022, 42(2):417-433. (Hu X M, Li J L, Jiang H F. Machine learning methods investigate liver cancer prediction problem. Journal of Systems Science and Mathematical Sciences, 2022, 42(2):417-433.)
[5] Chang H H, Kyung H S. Bayesian model selection for support vector regression using the evidence framework. Communications for Statistical Applications and Methods, 1999, 6(3):813-820.
[6] Yan L, Wu C, Liu J. Visual analysis of odor interaction based on support vector regression method. Sensors, 2020, 20(6):1707.
[7] Gupta D, Natarajan N. Prediction of uniaxial compressive strength of rock samples using density weighted least squares twin support vector regression. Neural Computing and Applications, 2021, 33(22):15843-15850.
[8] 张婷婷, 王沫然, 魏得胜, 等. 季节调整FWA-SVR模型及其在旅游经济预测中的应用. 系统科学与数学, 2021, 41(6):1572-1584. (Zhang T T, Wang M R, Wei D S, et al. Seasonally-adjusted FWA-SVR model and its application in tourism economic forecast. Journal of Systems Science and Mathematical Sciences, 2021, 41(6):1572-1584.)
[9] Li R, Lin D K J, Li B. Statistical inference in massive data sets. Applied Stochastic Models in Business and Industry, 2013, 29(5):399-409.
[10] Zhang Y, Wainwright M J, Duchi J C. Communication-efficient algorithms for statistical optimization. Advances in Neural Information Processing Systems, 2013, 14(11):3321-3363.
[11] Chen X, Xie M. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 2014, 24:1655-1684.
[12] Zhao T, Cheng G, Liu H. A partially linear framework for massive heterogeneous data. The Annals of Statistics, 2016, 44(4):1400-1437.
[13] Xu Q, Cai C, Jiang C, et al. Block average quantile regression for massive dataset. Statistical Papers, 2020, 61(3):141-165.
[14] Lee D J, Qiang L, Yuekai S, et al. Communication-efficient sparse regression. Journal of Machine Learning Research, 2017, 18(1):115-144.
[15] Minsker S. Distributed statistical estimation and rates of convergence in normal approximation. Electronic Journal of Statistics, 2019, 13(2):5213-5252.
[16] Jordan M I, Lee J D, Yang Y. Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 2019, 114(526):668-681.
[17] Fan J, Guo Y, Wang K. Communication-efficient accurate statistical estimation. Journal of the American Statistical Association, 2021, DOI:10.1080/01621459.2021.1969238.
[18] Chen X, Liu W, Zhang Y. First-order Newton-type estimator for distributed estimation and inference. Journal of the American Statistical Association, 2022, 117(540):1858-1874.
[19] Wang X, Yang Z, Chen X, et al. Distributed inference for linear support vector machine. Journal of Machine Learning Research, 2019, 20(113):1-41.
[20] Chen X, Liu W, Mao X, et al. Distributed high-dimensional regression under a quantile loss function. Journal of Machine Learning Research, 2020, 21(182):1-43.
[21] Pan Y. Distributed optimization and statistical learning for large-scale penalized expectile regression. Journal of the Korean Statistical Society, 2021, 50(1):290-314.
[22] Wang K, Li S. Robust distributed modal regression for massive data. Computational Statistics & Data Analysis, 2021, 160:107225.
[1] 于翠翠, 寇红红, 孙少龙. 基于多语言搜索引擎数据预测旅游需求研究[J]. 系统科学与数学, 2022, 42(9): 2383-2398.
[2] 金勇进, 刘晓宇. 大数据背景下的抽样调查[J]. 系统科学与数学, 2022, 42(1): 2-16.
[3] 李莉莉, 靳世檑, 周楷贺. 基于岭回归模型大数据最优子抽样算法研究[J]. 系统科学与数学, 2022, 42(1): 50-63.
[4] 石峻驿. 大数据背景下平台类企业开展抽样调查的应用研究[J]. 系统科学与数学, 2022, 42(1): 100-108.
[5] 宗先鹏, 王彤彤. 大规模数据下子抽样模型平均估计理论[J]. 系统科学与数学, 2022, 42(1): 109-132.
[6] 张璇, 赵静, 丁文兴. 大数据背景下产品质量抽样调查的样本量设计[J]. 系统科学与数学, 2022, 42(1): 133-140.
[7] 杜梅慧,李莉莉,张璇. 基于两步子抽样算法的P2P信用风险预测研究[J]. 系统科学与数学, 2021, 41(2): 566-576.
[8] 凌立文, 陈诗欣, 张大斌, 张博婷. 基于多时间尺度组合策略的我国猪肉价格预测研究[J]. 系统科学与数学, 2021, 41(10): 2829-2842.
[9] 刘丽萍, 吕政. 基于高维高频金融数据的最小方差组合风险的估计及应用[J]. 系统科学与数学, 2021, 41(10): 2948-2964.
[10] 张大斌,蔡超敏,凌立文,陈善盈. 基于CEEMD与GA-SVR的猪肉价格集成预测模型[J]. 系统科学与数学, 2020, 40(6): 1061-1073.
[11] 李晨露.  大数据下广义线性模型的参数估计算法[J]. 系统科学与数学, 2020, 40(5): 927-940.
[12] 琚春华,傅小康,邹江波. 融入社会关系强度的个人信用价值度量模型研究[J]. 系统科学与数学, 2020, 40(3): 448-468.
[13] 王勇,董恒新. 大数据背景下中国季度失业率的预测研究------基于网络搜索数据的分析[J]. 系统科学与数学, 2017, 37(2): 460-472.
[14] 石雪涛,朱帮助. 基于相空间重构和最小二乘支持向量回归模型参数同步优化的碳市场价格预测[J]. 系统科学与数学, 2017, 37(2): 562-572.
[15] 贾效伟,李梦,贾忠伟. 从健康系统工程谈口腔影像学大数据研究伦理[J]. 系统科学与数学, 2016, 36(2): 219-.
阅读次数
全文


摘要