Skip to main content
QUICK REVIEW

[论文解读] Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

Stephen Gadd|arXiv (Cornell University)|Jan 11, 2026
Geographic Information Systems Studies被引用 0
一句话总结

Symphonym 是一个教师-学生神经嵌入系统,可将任意文字中的地名映射到统一的128维发音空间,实现跨脚本的名称匹配,而无需运行时的发音资源;在希伯来语-阿拉伯语地名的Recall@1上达到最先进水平,并将为世界历史地名词典(WHG)中的发音搜索提供支持。

ABSTRACT

Linking names across historical sources, languages, and writing systems remains a fundamental challenge in digital humanities and geographic information retrieval. Existing approaches require language-specific phonetic algorithms or fail to capture phonetic relationships across different scripts. This paper presents Symphonym, a neural embedding system that maps names from any script into a unified 128-dimensional phonetic space, enabling direct similarity comparison without runtime phonetic conversion. Symphonym uses a Teacher-Student architecture where a Teacher network trained on articulatory phonetic features produces target embeddings, while a Student network learns to approximate these embeddings directly from characters. The Teacher combines Epitran (extended with 100 new language-script mappings), Phonikud for Hebrew, and CharsiuG2P for Chinese, Japanese, and Korean. Training used 32.7 million triplet samples of toponyms spanning 20 writing systems from GeoNames, Wikidata, and Getty Thesaurus of Geographic Names. On the MEHDIE Hebrew-Arabic historical toponym benchmark, Symphonym achieves Recall@10 of 97.6% and MRR of 90.3%, outperforming Levenshtein and Jaro-Winkler baselines (Recall@1: 86.7% vs 81.5% and 78.5%). Evaluation on 12,947 real cross-script training pairs shows 82.6% achieve greater than 0.75 cosine similarity, with best performance on Arabic-Cyrillic (94--100%) and Cyrillic-Latin (94.3%) combinations. The fixed-length embeddings enable efficient retrieval in digital humanities workflows, with a case study on medieval personal names demonstrating effective transfer from modern place names to historical orthographic variation.

研究动机与目标

  • Address cross-script toponym matching without language-specific phonetic resources.
  • Learn a unified phonetic embedding space for 20+ scripts that supports scalable inference.
  • Transfer phonetic knowledge from articulatory features to character-based inference via distillation.
  • Mitigate false cognates and OCR/spelling noise through a three-phase curriculum and noise augmentation.
  • Enable integration with WHG for phonetic search across 67M+ toponyms.

提出的方法

  • Teacher network encodes IPA-based articulatory features (Epitran G2P + PanPhon) into 128-d embeddings.
  • Student network learns to approximate Teacher embeddings directly from raw characters, enabling inference without phonetic resources.
  • Three-phase training curriculum: Phase 1 triplet loss on phonetic features, Phase 2 distillation to Student, Phase 3 hard negative training.
  • Script-aware input with 20-script detection and script-token embeddings to achieve script-agnostic unified embedding space.
  • Cosine similarity used for inference and incorporated into training loss, addressing limitations of L1/Manhattan distances in phonetic space.
  • Noise augmentation during Student training to mimic OCR/spelling variation and transcription inconsistencies.

实验结果

研究问题

  • RQ1Can cross-script toponym matching be achieved without language identification or runtime phonetic conversion?
  • RQ2How well do unified 128-d phonetic embeddings perform across 20 scripts compared to traditional string metrics or single-script methods?
  • RQ3Does teacher-student distillation enable robust cross-script generalization to low-resource scripts?
  • RQ4What is the impact of noise augmentation and hard negative mining on embedding quality and retrieval performance?

主要发现

  • On the MEHDIE Hebrew-Arabic benchmark, Symphonym attains 87.5% Recall@1.
  • It outperforms Levenshtein (81.5% Recall@1) and Jaro-Winkler (78.5% Recall@1) baselines on the benchmark.
  • The model supports cross-script matching such as 北京 vs Beijing where standard string metrics fail, and generalizes to scripts with limited training data due to phonetic grounding.
  • Training data comprise 5,088,419 unique training pairs after stratification and deduplication, derived from GeoNames, Wikidata, and Getty TGN across 66.9M toponyms.
  • The WHG deployment will enable phonetic search and reconciliation across 67M+ toponyms within the World Historical Gazetteer.
  • Embeddings are 128-dimensional and learned via a Teacher (articulatory features) and a Student (character sequences) with a 3-phase curriculum.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。