QUICK REVIEW

[Paper Review] Evolutionary Dynamics of the World Wide Web

Bernardo A. Huberman, Lada A. Adamic|ArXiv.org|Jan 8, 1999

Web visibility and informetrics1 references47 citations

TL;DR

This paper proposes a stochastic evolutionary model of the World Wide Web that explains the power-law distribution of pages per site by accounting for variable growth rates and differing creation times of websites. Using a log-normal growth process and time-weighted mixture models, it predicts a universal power law with exponent β ≈ 1.7–1.9, confirmed by large-scale crawls from Alexa and Infoseek, enabling estimation of rare large sites without exhaustive crawling.

ABSTRACT

We present a theory for the growth dynamics of the World Wide Web that takes into account the wide range of stochastic growth rates in the number of pages per site, as well as the fact that new sites are created at different times. This leads to the prediction of a universal power law in the distribution of the number of pages per site which we confirm experimentally by analyzing data from large crawls made by the search engines Alexa and Infoseek. The existence of this power law not only implies the lack of any length scale for the Web, but also allows one to determine the expected number of sites of any given size without having to exhaustively crawl the Web.

Motivation & Objective

To develop a stochastic model explaining the observed distribution of pages per website on the World Wide Web.
To account for variable growth rates across sites and differing creation times in the Web's evolution.
To predict the existence of a universal power law in site size distribution, independent of scale.
To validate the theoretical model using empirical data from large-scale web crawls by Alexa and Infoseek.
To enable estimation of the number of very large sites without exhaustive crawling, leveraging the power law.

Proposed method

Model site growth as a stochastic process where the number of pages increases proportionally to existing pages, with time-varying growth rates g(t) = g₀ + ξ(t), where ξ(t) is uncorrelated noise with zero mean.
Derive the log-normal distribution of site size over time using the solution to the stochastic differential equation dn/dt = [g₀ + ξ(t)]n, leading to n(t) = n(0)exp(g₀t + wₜ), where wₜ is a Wiener process.
Account for the time-dependent creation of new sites by integrating over an exponential distribution of creation times, resulting in a mixture of log-normal distributions.
Derive the asymptotic power law P(n) ∝ n⁻ᵝ by analytically solving the time-weighted integral of the mixture, yielding an exponent β dependent on g₀, σ², and the creation rate λ.
Account for heterogeneous growth rates across sites by summing over individual power laws P(n|gᵢ) ∝ n⁻ᵝ⁽ᵍⁱ⁾, leading to an overall power law with exponent β determined by the smallest β in the mixture.
Validate the model by fitting the theoretical power law to empirical data from two large web crawls (Alexa and Infoseek), using linear regression on log-log plots of site frequency vs. size.

Experimental results

Research questions

RQ1Does the distribution of pages per website follow a power law, and if so, what mechanism explains its universality?
RQ2How do variable growth rates and differing creation times of websites jointly shape the observed size distribution?
RQ3Can a stochastic growth model based on proportional growth and uncorrelated fluctuations reproduce the empirical power law in site sizes?
RQ4What is the functional form of the site size distribution when accounting for both time since creation and stochastic growth?
RQ5Can the power law be used to reliably estimate the number of very large websites without full web crawls?

Key findings

The distribution of pages per site follows a universal power law P(n) ∝ n⁻ᵝ with exponent β in the range [1.647, 1.853] for the Alexa crawl and [1.775, 1.909] for the Infoseek crawl, confirming theoretical predictions.
The power law is robust across two independent large-scale web crawls, indicating a fundamental structural property of the Web’s growth dynamics.
The model predicts that the number of sites of any given size can be estimated via extrapolation using P(n₂) = P(n₁)(n₁/n₂)⁻ᵝ, enabling estimation of rare large sites.
The power law emerges from a mixture of log-normal distributions weighted by site creation times, with the exponent β determined by g₀, σ², and the creation rate λ.
The model explains the observed drop-off in site frequency at ~10⁵ pages, attributed to crawler limitations rather than true distributional changes.
The existence of a power law implies no characteristic scale in the Web, supporting the idea of self-similar, scale-free growth dynamics.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.