Skip to main content
QUICK REVIEW

[논문 리뷰] Video Classification with Channel-Separated Convolutional Networks

Du Tran, Heng Wang|arXiv (Cornell University)|2019. 04. 04.
Human Pose and Action Recognition참고 문헌 42인용 수 84
한 줄 요약

이 논문은 영상 분류를 위한 Channel-Separated Convolutional Networks (CSNs)을 제안하며, 채널 상호작용을 시공간 상호작용과 분리하여 정확도를 유지하면서 FLOPs를 2–3x 감소시키고, 다수의 데이터셋에서 기존의 3D CNN보다 우수함을 입증한다.

ABSTRACT

Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks. This paper studies the effects of different design choices in 3D group convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks. Our experiments suggest two main findings. First, it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions as this leads to improved accuracy and lower computational cost. Second, 3D channel-separated convolutions provide a form of regularization, yielding lower training accuracy but higher test accuracy compared to 3D convolutions. These two empirical findings lead us to design an architecture -- Channel-Separated Convolutional Network (CSN) -- which is simple, efficient, yet accurate. On Sports1M, Kinetics, and Something-Something, our CSNs are comparable with or better than the state-of-the-art while being 2-3 times more efficient.

연구 동기 및 목표

  • Motivate reducing computation in 3D CNNs for video classification without sacrificing accuracy.
  • Investigate the role of channel interactions versus spatiotemporal interactions in 3D group convolutions.
  • Propose CSNs that factorize channel and spatiotemporal processing to improve efficiency and regularization.
  • Evaluate CSNs against state-of-the-art on large-scale video datasets to establish performance and efficiency gains.

제안 방법

  • Introduce 3D channel-separated networks where all convolutions (except conv1) are either 1x1x1 for channel interaction or depthwise 3x3x3 for local spatiotemporal interaction.
  • Define and quantify channel interactions as the number of interacting channel pairs through filters.
  • Present interaction-preserving (ip-CSN) and interaction-reduced (ir-CSN) bottleneck blocks and corresponding architectures.
  • Compare traditional 3D convolutions with CSN variants in terms of FLOPs, parameters, and channel interactions.
  • Conduct ablations on Kinetics-400 to study the impact of block design, depth, and channel interactions on accuracy and regularization.
  • Evaluate CSNs on Sports1M, Kinetics-400, and Something-Something-v1, including finetuning from Sports1M.

실험 결과

연구 질문

  • RQ1How do channel interactions influence the accuracy of 3D group convolutional networks for video classification?
  • RQ2Can separating channel interactions from spatiotemporal interactions reduce computation while preserving or improving accuracy?
  • RQ3Do interaction-preserving and interaction-reducing CSN variants offer favorable accuracy vs. FLOPs tradeoffs compared with standard 3D CNNs?
  • RQ4Do channel-separated architectures exhibit regularization benefits that improve generalization on video datasets?

주요 결과

  • CSNs can achieve comparable or superior accuracy to state-of-the-art 3D CNNs while reducing FLOPs by about 2–3x when channel interactions are preserved.
  • Interaction-preserved CSNs (ip-CSN) maintain channel interactions and consistently outperform interaction-reduced CSNs (ir-CSN) in deeper models.
  • Channel separation acts as a regularizer, yielding higher training error but lower test error compared with dense 3D convolutions.
  • Bottleneck-based (ir-CSN) designs provide the best computation/accuracy tradeoff within the studied block designs.
  • On Sports1M, Kinetics, and Something-Something-v1, CSNs are comparable with or better than prior art and considerably faster.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.