Analyzing the Dissemination of News by Model Averaging and Subsampling

ZOU Jiahui

Journal of Systems Science & Complexity ›› 2024, Vol. 37 ›› Issue (5) : 2104-2131.

PDF(621 KB)
PDF(621 KB)
Journal of Systems Science & Complexity ›› 2024, Vol. 37 ›› Issue (5) : 2104-2131. DOI: 10.1007/s11424-023-3176-7

Analyzing the Dissemination of News by Model Averaging and Subsampling

  • ZOU Jiahui
Author information +
History +

Abstract

The dissemination of news is a vital topic in management science, social science and data science. With the development of technology, the sample sizes and dimensions of digital news data increase remarkably. To alleviate the computational burden in big data, this paper proposes a method to deal with massive and moderate-dimensional data for linear regression models via combing model averaging and subsampling methodologies. The author first samples a subsample from the full data according to some special probabilities and split covariates into several groups to construct candidate models. Then, the author solves each candidate model and calculates the model-averaging weights to combine these estimators based on this subsample. Additionally, the asymptotic optimality in subsampling form is proved and the way to calculate optimal subsampling probabilities is provided. The author also illustrates the proposed method via simulations, which shows it takes less running time than that of the full data and generates more accurate estimations than uniform subsampling. Finally, the author applies the proposed method to analyze and predict the sharing number of news, and finds the topic, vocabulary and dissemination time are the determinants.

Key words

Asymptotic optimality / dissemination of news / linear regression models / model averaging / optimal subsampling

Cite this article

Download Citations
ZOU Jiahui. Analyzing the Dissemination of News by Model Averaging and Subsampling. Journal of Systems Science & Complexity, 2024, 37(5): 2104-2131 https://doi.org/10.1007/s11424-023-3176-7

References

[1] Szabó G and Huberman B A, Predicting the popularity of online content, Communications of the ACM, 2010, 53(8): 80-88.
[2] Elena H, Ilias F, and Nello C, Modelling and predicting news popularity, Pattern Analysis and Applications, 2013, 16: 623-635.
[3] Fernandes K, Vinagre P, and Cortez P, A proactive intelligent decision support system for predicting the popularity of online news, Proceedings of the 17th EPIA 2015-Portuguese Conference on Artificial Intelligence, Coimbra, Portugal, 2015.
[4] Rizos G, Papadopoulos S, and Kompatsiaris Y, Predicting news popularity by mining online discussions, Proceedings of the 25th International Conference Companion on World Wide Web, Montral, 2016.
[5] Akaike H, Information Theory and an Extension of the Maximum Likelihood Principle, Springer, New York, 1973.
[6] Schwarz G, Estimating the dimension of a model, The Annals of Statistics, 1978, 6(2): 461-464.
[7] Hjort N L and Claeskens G, Focused information criteria and model averaging for the cox hazard regression model, Journal of the American Statistical Association, 2006, 101(476): 1449-1464.
[8] Tibshirani R, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society (Series B), 1996, 58(1): 267-288.
[9] Fan J Q and Li R Z, Varialbe selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 2001, 96(456): 1348-1360.
[10] Zou H and Hastie T, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society (Series B), 2005, 67(5): 301-320.
[11] Ando T and Li K C, A model-averaging approach for high-dimensional regression, Journal of the American Statistical Association, 2014, 109: 254-265.
[12] Ando T and Li K C, A weight-relaxed model averaging approach for high-dimensional generalized linear models, The Annals of Statistics, 2017, 45: 2654-2679.
[13] He B, Liu Y, Wu Y, et al., Functional martingale residual process for high-dimensional cox regression with model averaging, The Journal of Machine Learning Research, 2020, 21(1): 8553-8589.
[14] Zhao Z and Zou G, Average estimation of semiparametric models for high-dimensional longitudinal data, Journal of Systems Science and Complexity, 2020, 33(6): 2013-2047.
[15] Lin N and Xi R, Aggregated estimating equation estimation, Statistics and Its Interface, 2011, 4(1): 73-83.
[16] Demmel J, Grigori L, Hoemmen M, et al., Communication-optimal parallel and sequential QR and LU factorizations, SIAM Journal on Scientific Computing, 2012, 34(1): 206-239.
[17] Jordan M I, Lee J D, and Yang Y, Communication-efficient distributed statistical inference, Journal of the American Statistical Association, 2019, 114(526): 668-681.
[18] Drineas P, Mahoney M W, and Muthukrishnan S, Sampling algorithms for l2 regression and applications, Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, 2006, 1: 1127-1136.
[19] Drineas P, Mahoney M W, and Muthukrishnan S, Relative-error cur matrix decompositions, SIAM Journal on Matrix Analysis and Applications, 2008, 30(2): 844-881.
[20] Ma P, Mahoney M W, and Yu B, A statistical perspective on algorithmic leveraging, Journal of Machine Learning Research, 2015, 16(1): 861-911.
[21] Wang H Y, Zhu R, and Ma P, Optimal subsampling for large sample logistic regression, Journal of the American Statistical Association, 2018, 113(522): 829-844.
[22] Rosenblatt J D and Nadler B, On the optimality of averaging in distributed statistical learning, Information & Inference: A Journal of the IMA, 2016, 5(4): 379-404.
[23] Chen X, Liu W, and Zhang Y, Quantile regression under memory constraint, The Annals of Statistics, 2019, 47(6): 3244-3273.
[24] Huang C and Huo X, A distributed one-step estimator, Mathematical Programming, 2019, 174: 41-76.
[25] Drineas P, Ismail M M, Mahoney M W, et al., Faster approximation of matrix coherence and statistical leverage, Journal of Machine Learning Research, 2012, 13: 3475-3506.
[26] Mahoney M W and Drineas P, CUR matrix decompositions for improved data analysis, Proceedings of the National Academy of Sciences, 2009, 106(3): 697-702.
[27] Ma P and Sun X, Leveraging for big data regression, Wiley Interdisciplinary Reviews: Computational Statistics, 2015, 7(1): 70-76.
[28] Wang H, Yang M, and Stufken J, Information-based optimal subdata selection for big data linear regression, Journal of the American Statistical Association, 2019, 114(525): 393-405.
[29] Ai M, Yu J, Zhang H, et al., Optimal subsampling algorithms for big data regressions, Statistica Sinica, 2021, 31: 749-772.
[30] William F and Trevor H, Local case-control sampling: Efficient subsampling in imbalanced data sets, The Annals of Statistics, 2014, 42(5): 1693-1724.
[31] Hu G and Wang H, Most likely optimal subsampled Markov chain Monte Carlo, Journal of Systems Science and Complexity, 2021, 34(3): 1121-1134.
[32] Wang H and Ma Y, Optimal subsampling for quantile regression in big data, Biometrika, 2021, 108(1): 99-112.
[33] Yu J, Wang H, Ai M, et al., Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data, Journal of the American Statistical Association, 2022, 117(537): 265-276.
[34] Hansen B E, Least squares model averaging, Econometrica, 2007, 75: 1175-1189.
[35] Wan A T K, Zhang X, and Zou G, Least squares model averaging by Mallows criterion, Journal of Econometrics, 2010, 156: 277-283.
[36] Hansen B E and Racine J, Jackknife model averaging, Journal of Econometrics, 2012, 167: 38-46.
[37] Zhang X, Zou G, Liang H, et al., Parsimonious model averaging with a diverging number of parameters, Journal of the American Statistical Association, 2019, 115(530): 972-984.
[38] Zhang X, Model averaging and its application, PhD's degree thesis, Chinese Academy of Sciences, Beijing, 2009.
[39] Wang H, More efficient estimation for logistic regression with optimal subsamples, Journal of Machine Learning Research, 2019, 20(132): 1-59.
[40] Wang J, Zou J, and Wang H, Sampling with replacement vs poisson sampling: A comparative study in optimal subsampling, IEEE Transactions on Information Theory, 2022, 68(10): 6605- 6630.
[41] Xiong S and Li G, Some results on the convergence of conditional distributions, Statistics & Probability Letters, 2008, 78(18): 3249-3253.

Funding

This research was supported by the National Natural Science Foundation of China under Grant No. 12201431, and the Young Teacher Foundation of Capital University of Economics and Business under Grant Nos. XRZ2022 -070 and 00592254413070.
PDF(621 KB)

Accesses

Citation

Detail

Sections
Recommended

/