Skip to main content
QUICK REVIEW

[Paper Review] The Tower of Babel Meets Web 2.0: User-Generated Content and its Applications in a Multilingual Context

Brent Hecht, Darren Gergle|arXiv (Cornell University)|Apr 2, 2019
Wikis in Education and Collaboration26 references25 citations
TL;DR

This paper investigates linguistic and cultural diversity in user-generated content by analyzing 25 Wikipedia language editions, revealing significant variation in knowledge representation across languages. It demonstrates that this diversity—beyond mere translation differences—significantly impacts multilingual applications and proposes leveraging it to build culturally-aware and hyperlingual systems.

ABSTRACT

This study explores language's fragmenting effect on user-generated content by examining the diversity of knowledge representations across 25 different Wikipedia language editions. This diversity is measured at two levels: the concepts that are included in each edition and the ways in which these concepts are described. We demonstrate that the diversity present is greater than has been presumed in the literature and has a significant influence on applications that use Wikipedia as a source of world knowledge. We close by explicating how knowledge diversity can be beneficially leveraged to create "culturally-aware applications" and "hyperlingual applications".

Motivation & Objective

  • To examine how language-specific cultural and linguistic perspectives shape knowledge representation in user-generated content.
  • To quantify the extent of diversity in concepts and their descriptions across multilingual Wikipedia editions.
  • To assess the implications of this diversity for applications relying on Wikipedia as a source of world knowledge.
  • To explore opportunities for designing applications that leverage linguistic and cultural diversity rather than treating it as noise.
  • To propose new application paradigms—'culturally-aware' and 'hyperlingual'—that benefit from multilingual knowledge variation.

Proposed method

  • Systematic comparison of 25 Wikipedia language editions across multiple language groups.
  • Identification and analysis of unique concepts present in one edition but absent in others.
  • Examination of differences in descriptive approaches (e.g., structure, depth, focus) for shared concepts.
  • Use of linguistic and cultural metadata to correlate content variation with sociolinguistic factors.
  • Application of natural language processing techniques to detect and categorize representational differences.
  • Development of a framework for identifying and utilizing multilingual knowledge diversity in system design.

Experimental results

Research questions

  • RQ1To what extent do different Wikipedia language editions represent distinct sets of concepts?
  • RQ2How do descriptions of shared concepts vary across language editions in terms of content, structure, and focus?
  • RQ3What are the implications of this knowledge diversity for multilingual applications relying on Wikipedia as a knowledge source?
  • RQ4How can applications be designed to benefit from, rather than be hindered by, linguistic and cultural variation in user-generated content?
  • RQ5What design principles enable the creation of 'hyperlingual' and 'culturally-aware' applications using multilingual knowledge?

Key findings

  • Significant divergence exists in the set of concepts covered across Wikipedia language editions, with many concepts appearing in only one or a few languages.
  • Even for shared concepts, descriptive approaches vary widely in depth, structure, and cultural framing across languages.
  • The diversity in knowledge representation exceeds what is typically assumed in multilingual NLP applications.
  • This diversity poses challenges for applications using Wikipedia as a universal knowledge source, particularly in cross-lingual tasks.
  • The variation can be systematically leveraged to build hyperlingual systems that integrate multiple linguistic perspectives.
  • Culturally-aware applications can be developed by embedding language-specific knowledge representations to improve relevance and inclusivity.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.