QUICK REVIEW

[Paper Review] SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task

Ziije Zhong, Linqing Zhong|arXiv (Cornell University)|Jun 15, 2024

Topic Modeling5 citations

TL;DR

SyntheT2C creates two synthetic data pipelines to generate Query-Cypher pairs for Neo4j, enabling effective fine-tuning of LLMs on Text2Cypher. It validates and scales the data, improving Cypher writing performance.

ABSTRACT

Integrating Large Language Models (LLMs) with existing Knowledge Graph (KG) databases presents a promising avenue for enhancing LLMs' efficacy and mitigating their "hallucinations". Given that most KGs reside in graph databases accessible solely through specialized query languages (e.g., Cypher), it is critical to connect LLMs with KG databases by automating the translation of natural language into Cypher queries (termed as "Text2Cypher" task). Prior efforts tried to bolster LLMs' proficiency in Cypher generation through Supervised Fine-Tuning (SFT). However, these explorations are hindered by the lack of annotated datasets of Query-Cypher pairs, resulting from the labor-intensive and domain-specific nature of such annotation. In this study, we propose SyntheT2C, a methodology for constructing a synthetic Query-Cypher pair dataset, comprising two distinct pipelines: (1) LLM-based prompting and (2) template-filling. SyntheT2C is applied to two medical KG databases, culminating in the creation of a synthetic dataset, MedT2C. Comprehensive experiments demonstrate that the MedT2C dataset effectively enhances the performance of backbone LLMs on Text2Cypher task via SFT. Both the SyntheT2C codebase and the MedT2C dataset are released in https://github.com/ZGChung/SyntheT2C.

Motivation & Objective

Bridge the gap between natural language and Cypher queries for Neo4j databases.
Generate high-quality synthetic Question-Cypher pairs without manual annotation.
Enable effective fine-tuning of backbone LLMs to write executable Cypher queries.
Provide validation tools and datasets to support Text2Cypher research.

Proposed method

Two complementary pipelines generate synthetic QCy pairs: (i) LLM-based prompting to create semantically diverse Cyphers, (ii) template-filling to produce syntactically complex Cyphers.
Extraction of database metadata and schema grounding to ensure executable Cypher generation.
Automatic validators (Grammatical, Semantic, Entity, Schema, Coherence) screen Cyphers before manual validation.
Manual validation with consensus voting to ensure high-quality ground-truth data.
Fine-tuning backbone LLMs with LoRA using MedT2C on two Neo4j medical databases (LHY and Hetionet).

Experimental results

Research questions

RQ1Can synthetic Query-Cypher pairs produced by SyntheT2C effectively train LLMs to generate executable Cypher queries?
RQ2How do the two synthetic data pipelines complement each other in improving Cypher writing performance?
RQ3What is the impact of data validation (automatic and manual) on the quality and execution accuracy of generated Cyphers?
RQ4How does scaling the synthetic dataset affect Cypher quality and execution accuracy?
RQ5Is the MedT2C dataset effective for fine-tuning different LLM families (open and closed) on Text2Cypher?

Key findings

MedT2C improves Cypher writing quality across several backbone LLMs after fine-tuning.
Combining both pipelines yields the best overall performance compared to using a single pipeline.
All five validators together provide the strongest gains in Cypher quality and execution accuracy during ablations.
Scaling results show improved performance up to a dataset size similar to MedT2C, with diminishing returns beyond that size.
Template-based data alone can hinder performance if used without complementary semantic data from prompting.
The MedT2C dataset (synthetic data generated from LHY and Hetionet) enables LLMs to approach or exceed human-annotated Cyphers in quality.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.