中国科学院数学与系统科学研究院期刊网

Collections

大数据背景下的抽样理论与质量管理
Sort by Default Latest Most read  
Please wait a minute...
  • Select all
    |
  • JIN Yongjin, LIU Xiaoyu
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 2-16. https://doi.org/10.12341/jssms21449
    {Big data is characterized by large volume, rich types, and rapid growth, but it also has problems such as low value density and poor representativeness, which brings opportunities and challenges to sampling survey. In the context of big data, how does sampling survey adapt to new changes and what kind of development and application does it have? This paper discusses it from three perspectives. First, there are some new sampling methods with strong adaptability in the data stream environment, which can obtain representative samples efficiently and accurately, and take into account the storage space, processing time and ability. Secondly, some non-probability sampling methods without sampling frame have been developed by means of internet survey or social network data collection, which can obtain a large number of analysis samples in a short time at low cost. Third, the advantages of big data and sampling survey are integrated to integrate online and offline survey data. In the case that online sample is non-probability sample and offline sample is probability sample, this article puts forward the basic idea of data integration: On the one hand, probability samples are used to carry out the ``probability test'' for non-probability samples; on the other hand, the information of probability samples is extracted and make inferences based on model or pseudo-randomization.
  • YANG Haoyu, QIN Yichen, LI Yang
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 17-34. https://doi.org/10.12341/jssms21515
    Sampling survey is still an essential tool in the era of big data. However, traditional sampling survey faces the dual challenges of increasing execution cost and decreasing data quality. Split questionnaire design can has been paid more attention by researchers as an effective way to reduce the cost and improve the data quality. In this paper, we discuss the sub-questionnaire assignment process in the split questionnaire design. Based on the assumption that participants arrive in accordance with the Poisson process, the sequential randomization method considering covariates balance is designed with the goal of improving the similarity among sub-samples and the population. Both theoretical and numerical results show that the proposed method has superior performance compared with the existing methods on sub-sample balancing and estimation accuracy.
  • MENG Jie , YANG Guijun, FENG Guolei, HUA Mengke
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 35-49. https://doi.org/10.12341/jssms21520
    CSCD(1)
    Due to the influence of various factors, the census results inevitably deviate from the real total population. How to construct an estimation of the total population with good statistical properties and wide application scope and to accurately grasp the trend of population change is an important issue for the government's statistical work. In this paper, we explore the empirical method of the census annual population estimation of the UK Statistics Office, and propose a method of population estimation based on the combination of triple system estimation and ratio estimation. The simulation results show that the new method can better overcome the interaction bias caused by the independence of two systems, and improve the accuracy of the total population estimation based on the reasonable stratification of the population. At the same time, the research of this paper also puts forward the construction idea of ``the third set of demographic data resources'', which is not only the data foundation of constructing and applying triple system estimation, but also helpful to further promote the statistical modernization reform.
  • LI Lili, JIN Shilei, ZHOU Kaihe
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 50-63. https://doi.org/10.12341/jssms21494
    With the advent of the big data era, in order to improve computational efficiency, Wang, et al.(2018) proposed an optimal subsampling algorithm for logistic regression, which provides a better tradeoff between estimation efficiency and computational efficiency. To solve the problem of multicollinearity among variables, this paper proposes an optimal subsampling algorithm in the context of ridge regression, and proves the consistency and asymptotic normality of the estimator from optimal subsampling algorithm. Numerical experiments are carried out on both simulated and real data to evaluate the proposed methods. Results show that the optimal subsampling algorithm produces similar results compared with the full data analysis, while significantly reducing the computational costs.
  • LIU Yawen, MA Wenbo, DU Zifang
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 64-71. https://doi.org/10.12341/jssms21482
    Statistical inference usually measures the estimation accuracy by two indicators: Confidence coefficient and bias, but when confidence coefficient and bias are different, the comparison of accuracy between estimators will be difficult. In this paper, a widely used dimensionless accuracy statistic is proposed. The function of confidence coefficient and bias can compare the accuracy when the estimation bias and the degree of confidence are different at the same time. In addition, starting from the factors affecting the interpretation accuracy and its action mechanism, the logical consistency between the sample size determination formula and the Shannon theorem of information theory is found. A new explanation of the physical meaning of the sample size determination formula is given.
  • NIU Xiaoyang, ZOU Jiahui
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 72-84. https://doi.org/10.12341/jssms21475
    In this paper, we extend the subsampling method under the linear model to the nonparametric regression model and propose two subsampling methods for the nonparametric local polynomial regression model. First, we derive the convergence rate of subsampling based weighted least squares parameter estimation to full sample weighted least squares parameter estimation, and the asymptotic normality of the subsample parameter estimation are derived. Then, we use the criterion of minimizing the asymptotic variance, and two subsampling methods of OPT and PL under nonparametric local polynomial regression model are proposed. Finally, numerical simulation of OPT subsampling and PL subsampling, uniform subsampling and Basic Leveraging subsampling are carried out respectively, in terms of mean square error, fitting effect and computational cost. The results show that the subsampling method based on OPT criterion and PL criterion has great advantages in improving estimation accuracy and reducing calculation burden.
  • JIANG Yan, MENG Zhufeng, WANG Tianjia, LIU Xiaoyu
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 85-99. https://doi.org/10.12341/jssms21627
    In Big Data era, Respondent-Driven Sampling (RDS) is more often applied network sampling with general population. Such optimization offers a possible solution for problems in traditional sampling investigation, including the difficulties to obtains usable sampling frames, respondents or the responds themselves. Moreover, it also enables network survey to be probabilistic and obtain overall parameter estimation within a certain error range. However, homogeneity in statistical research always deviates RDS estimating result (when recommending a companion for the research, the respondent is more likely to introduce someone with whom he/she shares similar qualities). In order to offer a practical solution, this paper assumes that population obeys the Degree-Corrected Stochastic Block Models (DCSBM). We post-stratify the sample based on transition probability and propose an inverse probability weighted PS-IPW estimator. By simulation analysis, we compare the relative efficiency between different network population with varied homogeneity. By empirical study, we sort out stratifies variables of our sample based on the characteristics of spectral of block matrix, which further verifies the usability of RDS sampling and the efficiency of PS-IPW estimator.
  • SHI Junyi
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 100-108. https://doi.org/10.12341/jssms21538
    Under the background of big data, there still exists certain controversy over the necessity and significance of sampling surveys. This article defines two types of big data scenarios: one is the situation of the existing massive amount of data; the other is the situation of the existing massive sampling frame directory. In the case of massive sampling frame directories, sampling surveys are both necessary and important. Based on a massive sampling frame directory of a platform company, this article uses catalog sampling to conduct a sample survey study over the issues that the platform company is concerned about, and considers sample size allocation, target volume estimation and evaluation in the case of sample rotation, to try to provide a useful reference for future similar sampling survey applications.
  • ZONG Xianpeng, WANG Tongtong
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 109-132. https://doi.org/10.12341/jssms21524
    With the development of information age, how to mine useful information from massive data quickly and effectively is a new challenge. As an effective tool for large scale data analysis, sub-sampling method has attracted extensive attention of scholars at home and abroad. However, the traditional sub-sampling method usually does not take into account the uncertainty of the model. When the assumed model is incorrect, the conclusions may be wrong. In order to solve this problem, a sub-sampling model averaging estimator (SSMA estimator) is constructed by the sampled data. Theoretically, we prove that the SSMA estimator is an asymptotically unbiased and consistent estimator of the model averaging estimator based on full data. In addition, we propose a weight choice criterion for the SSMA estimator, which is based on the Mallows' criterion proposed by Hansen (2007), and derive the asymptotic optimality of the weight estimator. It is worth mentioning that, in the proofs of these theoretical properties, we consider the double randomness brought by the model and sampling design. Finally, numerical analysis further shows the effectiveness of the proposed method.
  • ZHANG Xuan, ZHAO Jing , DING Wenxing
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 133-140. https://doi.org/10.12341/jssms21654
    Product quality sampling survey is an important means for government quality supervision department to know the product quality status. In the sampling survey of product quality over the years, a large number of actual data have also been accumulated. In this paper, the prior information provided by big data is effectively combined with the sample size design in sampling survey. The valuable information provided by big data is used as auxiliary information, and the survey objects are stratified by clustering method. According to the different characteristics of each stratum, the relationship between relative error limits of each stratum is determined by the series of preferred numbers, and then the stratified random sampling sample size is determined, which makes the determination method of sample size be taken into account the advantages of science and practicality. At the same time, by selecting different levels of parameters for different strata of supervision objects, the effectiveness of supervision is improved under the condition of limited cost.
  • CHEN Meng, CHEN Wangxue, DENG Cuihong, YANG Rui
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 141-152. https://doi.org/10.12341/jssms21498
    In this article, Fisher information in the corresponding samples about the scale parameter $\theta$ from Inverse Rayleigh distribution under simple random sampling and ranked set sampling will be respectively studied. The numerical results show ranked set sample carry more information about $\theta$ than a simple random sample of equivalent size. Then we respectively use the simple random sample and ranked set sample to construct some optimal estimators of $\theta$. The numerical results of these estimators are compared.
  • LIU Shengnan, LIU Wenjun, ZOU Guohua
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 153-174. https://doi.org/10.12341/jssms21660
    It is very common that data are mixed with some interference data. For the easily recognized interference data, there are many methods of analysis in literature. However, if the interference data is difficult to identify, then the traditional methods will not be valid. Propensity scores can be used to characterize multidimensional data, and data with the same propensity values are similar and likely come from the same population. Therefore, by applying the propensity score matching, this paper proposes a new method to detect interference data. Two algorithms are developed for this purpose. One is used to obtain the proportion of real data in the original data, and estimate the population mean of real data. The second aims to extract the pseudo-real data and conduct modeling analysis. An extensive simulation study shows good performance of the proposed algorithms.
  • LI Lei, MA Yulin, HU Gang, KONG Xuefeng, YANG Jun, XU Yanwei
    Journal of Systems Science and Mathematical Sciences. 2022, 42(1): 175-192. https://doi.org/10.12341/jssms21526
    To accurately identify the head defects of the GH159 bolt after hot upsetting, this paper proposes a defect recognition method based on transfer learning, where datasets under scenes with different brightness are set as the source domain and target domain in transfer learning, respectively. First, considering the multi clusters of the conditional distribution in the domain, this paper adopts the K-means algorithm to cluster samples with the same defect and determine the cluster centers in this defect, then a novel measurement of the distribution discrepancy can be constructed on the cluster centers. Second, based on the distances between cluster centers and the distances between each cluster center and the samples belonging to the cluster, a new intra-class discrepancy can be established for improving the computational efficiency of transfer learning. Finally, the optimization objective of the proposed method is built on minimizing the weighted sum of the constructed distribution discrepancy and intra-class discrepancy to effectively identify defects under scenes with different brightness. According to the requirement on partial parameters setting of the proposed method, the pseudo-accuracy is designed using the reverse verification strategy, then the parameters are set as the parameters’ combination with the highest pseudo-accuracy. Using the collected dataset on head defects of the GH159 bolt after hot upsetting, the analysis and application of the defect recognition are carried out to verify the effectiveness of the proposed method.