QUICK REVIEW

[논문 리뷰] SCENE: Semantic-aware Codec Enhancement with Neural Embeddings

Han-Yu Lin, Li-Wei Chen|arXiv (Cornell University)|2026. 01. 29.

Image and Video Quality Assessment인용 수 0

한 줄 요약

SCENE은 비전-언어 임베딩으로 조합된 컨볼루션을 조절하고 differentiable codec proxy로 학습하여 인지적 영상 품질을 향상시키는 경량의 의미-가이드 사전 처리 프레임워크로, 추론 시 실시간 독립형 프리프로세서로 작동합니다.

ABSTRACT

Compression artifacts from standard video codecs often degrade perceptual quality. We propose a lightweight, semantic-aware pre-processing framework that enhances perceptual fidelity by selectively addressing these distortions. Our method integrates semantic embeddings from a vision-language model into an efficient convolutional architecture, prioritizing the preservation of perceptually significant structures. The model is trained end-to-end with a differentiable codec proxy, enabling it to mitigate artifacts from various standard codecs without modifying the existing video pipeline. During inference, the codec proxy is discarded, and SCENE operates as a standalone pre-processor, enabling real-time performance. Experiments on high-resolution benchmarks show improved performance over baselines in both objective (MS-SSIM) and perceptual (VMAF) metrics, with notable gains in preserving detailed textures within salient regions. Our results show that semantic-guided, codec-aware pre-processing is an effective approach for enhancing compressed video streams.

연구 동기 및 목표

표준 코덱에서의 지각적 품질 격차를 동기화하고 의미 인식 향상을 모색합니다.
SCENE를 도입하여 시각-언어 임베딩을 활용해 콘텐츠 인식 복원을 안내합니다.
훈련-배포 간의 차이를 differentiable codec proxy를 통해 연결합니다.
고해상도 벤치마크에서 실시간 성능과 개선을 입증합니다.

제안 방법

입력 프레임을 다운샘플링하고 3x3 컨볼루션 계층을 통해 저수준 특징을 추출합니다.
고정된 SigLIP 2 인코더를 사용해 의미 임베딩을 추출하고 이를 채널별 컨볼루션 계수로 변환합니다.
의미 계수로 모듈레이션된 콘텐츠 의존 커널을 사용한 조합된 컨볼루션을 적용합니다.
코덱 왜곡을 시뮬레이션하기 위해 differentiable JPEG 프록시로 훈련하고 다항식 손실을 최적화합니다.
추론은 코덱 프록시 없이 SCENE을 독립형 프리프로세서로 사용합니다.

Fig. 1 : Illustration of the proposed SCENE framework.

실험 결과

연구 질문

RQ1의미 인식 및 코덱 인식 프리-처리가 표준 코덱(H.264/H.265/AV1)에서 디코딩 파이프라인을 변경하지 않으면서 VMAF의 인지 품질과 중요한 영역의 무결성을 향상시킬 수 있나요?
RQ2비전-언어 모델 임베딩이 baseline 조합 컨볼루션을 넘어 콘텐츠 적응 복원을 효과적으로 안내하나요?
RQ3 differentiable codec proxy로 훈련하면 추론 시 실제 코덱 왜곡에 더 잘 일반화될 수 있나요?

주요 결과

SCENE은 H.264에서 AsConvSR에 비해 VMAF의 BD-레이트 감소가 더 큽니다(−32.0% 대 −29.4%).
H.265의 경우 SCENE은 VMAF에서 BD-레이트 감소가 −37.4%이고 AsConvSR은 −33.9%입니다.
MS-SSIM BD-레이트 변화는 작고 양수로 나타나( +6 ~ +11% ), 픽셀 수준의 손상은 제한적이지만 지각적 이득을 시사합니다.
AV1의 경우 SCENE은 최대 +10.6 포인트의 VMAF 이득을 제공하지만 비트레이트 증가로 결과가 코덱 전용 구간 밖으로 이동해 BD-레이트가 정의되지 않습니다.
SCENE은 AsConvSR과 유사한 MS-SSIM를 유지하면서 저비트레이트 구간에서 지각 지표를 개선합니다.
추론 지연은 RTX 4090에서 1080p 프레임당 약 27.74 ms(~36 fps)로 실시간 배치를 지원합니다.

Fig. 2 : Qualitative comparison between standard AV1 and our SCENE-enhanced AV1.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.