QUICK REVIEW

[論文レビュー] Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime

Haiyu Yang, Sumit Sharma|arXiv (Cornell University)|Mar 18, 2026

Smart Agriculture and AI被引用数 0

ひとこと要約

The paper systematically compares training-from-scratch, frozen features, and PEFT (QLoRA and DoRA) for DINOv3 on nine dairy-cow behaviors, showing PEFT achieves state-of-the-art accuracy with far less training data and resources under a 98:1 test-to-train regime.

ABSTRACT

Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.

研究の動機と目的

Assess generalization of vision models to agricultural imagery with limited labeled data.
Systematically compare three learning paradigms: training from scratch, frozen feature extraction, and PEFT on a billion-parameter foundation model.
Evaluate PEFT hyperparameters (rank and target modules) for QLoRA and DoRA.
Provide practical deployment guidelines for industrial livestock monitoring using PEFT-based foundation models.

提案手法

Fine-tune DINOv3 (6.7B params) with PEFT using QLoRA and DoRA.
Quantize backbone to 4-bit and inject low-rank adapters; vary rank r in {8,16,64} and target modules {q_proj, all-linear}.
Train with 80 epochs, batch size 4 (effective 32 with gradient accumulation), learning rate 1e-4, warmup and cosine annealing; use mixed precision and gradient checkpointing.
Data: 2,160 training images (80% per class across 9 behaviors) with augmented training set; 540 validation and 211,800 test samples from two sources (MMCows, PlayBehaviour).
Evaluation: accuracy, weighted F1-score, per-class metrics; latency and throughput for inference.

実験結果

リサーチクエスチョン

RQ1Can PEFT enable competitive performance on billion-parameter vision models with very limited training data in agricultural image classification?
RQ2How do QLoRA and DoRA compare in terms of accuracy, training efficiency, and stability under different adapter configurations?
RQ3What is the effect of adapter capacity (rank) and scope (q_proj vs all-linear) on generalization vs. overfitting in this domain?
RQ4Do pretrained feature extractors or full fine-tuning offer advantages over PEFT for livestock behavior classification?
RQ5What practical guidelines emerge for deploying PEFT-based foundation models in precision livestock farming?

主な発見

Method	Target	Rank	Trainable Params	Training Time	Test Acc	Test F1
ResNet-18 (scratch)	—	—	11.2M (100%)	16h 45m	72.87%	0.7526
ViT-Small (scratch)	—	—	21.7M (100%)	18h 39m	61.91%	0.6600
DINOv3 (frozen)	—	—	4.7M (0.07%)	17h 27m	76.56%	0.7691
QLoRA	q_proj	8	2.6M (0.04%)	6h 32m	77.17%	0.7646
QLoRA	q_proj	16	5.2M (0.08%)	7h 16m	78.38%	0.7753
QLoRA	all-linear	16	46.8M (0.70%)	4h 43m	80.40%	0.8069
QLoRA	all-linear	64	183.0M (2.72%)	5h 46m	83.16%	0.8380
DoRA	q_proj	8	2.8M (0.04%)	11h 31m	81.53%	0.8182
DoRA	q_proj	16	5.4M (0.08%)	10h 27m	81.03%	0.8153
DoRA	all-linear	16	48.4M (0.72%)	11h 51m	81.23%	0.8139
DoRA	all-linear	64	184.5M (2.75%)	10h 59m	83.14%	0.8338

PEFT substantially outperforms training from scratch and frozen feature extraction for nine-class dairy cow behavior; best QLoRA configuration (all-linear, rank=64) achieves 83.16% test accuracy.
DoRA achieves comparable performance (83.14% test accuracy) with slightly longer training time compared to the best QLoRA setup.
Increasing adapter capacity (higher rank or broader target modules) consistently improves generalization rather than causing overfitting, indicating underfitting as the main challenge.
QLoRA and DoRA with optimal settings reach around 83% test accuracy using only a small fraction of trainable parameters (as low as 0.04% to 2.72% of total).
Training time for PEFT configurations is substantially reduced (e.g., 5h 46m for QLoRA all-linear 64) versus full fine-tuning baselines (up to ~18h).
Across configurations, QLoRA and DoRA show different sensitivity to adapter choices, with DoRA often offering more stable performance

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。