Skip to main content
QUICK REVIEW

[论文解读] On Spatial Lag Models estimated using crowdsourcing, web-scraping or other unconventionally collected data

Giuseppe Arbia, Vincenzo Nardelli|arXiv (Cornell University)|Oct 11, 2020
Spatial and Panel Data Analysis参考文献 17被引用 1
一句话总结

本文提出一种事后抽样校正方法,以减少在使用非概率性、便利抽样数据(如众包或网络爬取的空间数据)时空间滞后模型(SLM)估计中的偏差,同时承认存在与估计量方差增加之间的权衡。该文推导出一种均方误差(MSE)最小化的策略,用于选择最优事后抽样参数,通过蒙特卡洛模拟验证,并应用于米兰房地产享乐价格模型。

ABSTRACT

The Big Data revolution is challenging the state-of-the-art statistical and econometric techniques not only for the computational burden connected with the high volume and speed which data are generated, but even more for the variety of sources through which data are collected (Arbia, 2021). This paper concentrates specifically on this last aspect. Common examples of non traditional Big Data sources are represented by crowdsourcing (data voluntarily collected by individuals) and web scraping (data extracted from websites and reshaped in a structured dataset). A common characteristic to these unconventional data collections is the lack of any precise statistical sample design, a situation described in statistics as 'convenience sampling'. As it is well known, in these conditions no probabilistic inference is possible. To overcome this problem, Arbia et al. (2018) proposed the use of a special form of post-stratification (termed 'post-sampling'), with which data are manipulated prior their use in an inferential context. In this paper we generalize this approach using the same idea to estimate a Spatial Lag Model (SLM). We start showing through a Monte Carlo study that using data collected without a proper design, parameters' estimates can be biased. Secondly, we propose a post sampling strategy to tackle this problem. We show that the proposed strategy indeed achieves a bias-reduction, but at the price of a concomitant increase in the variance of the estimators. We thus suggest an MSE-correction operational strategy. The paper also contains a formal derivation of the increase in variance implied by the post-sampling procedure and concludes with an empirical application of the method in the estimation of a hedonic price model in the city of Milan using web scraped data.

研究动机与目标

  • 解决在使用非概率性、便利抽样数据(如众包或网络爬取的数据集)时,空间计量经济模型中参数估计存在偏差的挑战。
  • 将原本为总体均值估计而开发的事后抽样技术推广至空间滞后模型(SLM)的估计中。
  • 量化在SLM估计中应用事后抽样时,偏差减少与方差增加之间的权衡。
  • 提出并实现一种基于MSE的校正策略,用于在实证应用中选择最优事后抽样参数。
  • 通过使用网络爬取的房地产数据在米兰进行享乐价格模型估计,展示该方法的可行性。

提出的方法

  • 采用一种事后抽样策略,基于辅助总体信息对数据进行重加权,以校正便利样本中的选择偏差。
  • 使用修改后的似然函数,将抽样权重纳入SLM参数估计中,以调整不等的包含概率。
  • 推导SLM对数似然函数的海塞矩阵,以计算在事后抽样下估计量的渐近方差-协方差矩阵。
  • 利用估计的费雪信息矩阵,计算在不同事后抽样权重ζ下系数估计量β̂的渐近方差。
  • 提出一种MSE最小化程序,通过权衡偏差与方差,选择最优事后抽样参数ζ。
  • 通过蒙特卡洛模拟研究验证该方法,比较不同抽样条件和事后抽样水平下的偏差与MSE。

实验结果

研究问题

  • RQ1便利抽样如何影响空间滞后模型中参数估计的偏差与方差?
  • RQ2当数据在无正式抽样设计下收集时,事后抽样重加权能否减少SLM估计中的偏差?
  • RQ3在SLM中应用事后抽样时,偏差减少与方差增加之间的权衡是什么?
  • RQ4如何选择最优事后抽样参数ζ,以最小化系数估计量的均方误差(MSE)?
  • RQ5该方法在使用非概率性空间数据的真实世界应用中,能在多大程度上提高估计精度?

主要发现

  • 蒙特卡洛研究结果表明,使用便利抽样数据时,事后抽样能显著降低SLM参数估计的偏差。
  • 事后抽样程序增加了估计量的方差,证实了偏差与精度之间存在根本性权衡。
  • 所提出的MSE校正策略成功识别出能最小化系数估计量MSE的最优事后抽样参数ζ。
  • 在米兰房地产市场的实证应用表明,事后抽样提高了基于网络爬取数据的享乐价格模型估计的可靠性。
  • 通过似然函数的海塞矩阵,显式推导出系数估计量β̂的渐近方差,从而支持基于MSE的ζ优化。
  • 该方法在单预测变量SLM中有效,但其向多预测变量模型及空间相关性参数估计的扩展仍是开放的研究问题。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。