ECCV 2024 Day3 - Oral: Recognition

2024. 11. 13. 23:29ใ†ArtificialIntelligence/ECCV2024

 

 

 

ECCV 2024. 10. 01. Tuesday 
Oral 2B: Recognition

1) Generative Models 

 

Gold Room์—์„œ ์ƒ์„ฑ๋ชจ๋ธ ์˜ค๋Ÿด ์„ธ์…˜์ด ์žˆ์—ˆ๋‹ค. 

์—„์ฒญ์—„์ฒญ์—„์ฒญ ๋„“์€ ํ™€์ด์—ˆ๋‹ค ๐Ÿ˜ฒ

 

 

 

์ง„ํ–‰ํ•ด์ฃผ์‹œ๋Š” ์‚ฌํšŒ์ž๋ถ„๋“ค

 

 

 

 

๋ฐœํ‘œ ๋“ฃ๋‹ค๊ฐ€ ํ‹ฐ๋ชจ์‹œ ์ƒฌ๋ผ๋ฉ”๊ฐ€ ๋“ฑ์žฅํ•ด์„œ ๋„ˆ๋ฌด ๋ฐ˜๊ฐ€์› ๋‹ค. 

๊ทธ๋Ÿฐ๋ฐ ์—ฐ์˜ˆ์ธ ์‚ฌ์ง„๋“ค์„ ์ €๋ ‡๊ฒŒ ๋ง‰ ์‚ฌ์šฉํ•ด๋„ ๊ดœ์ฐฎ์€๊ฑด๊ฐ€..? ์ด๋Ÿฐ ์ƒ๊ฐ๋„ ๋“ค์—ˆ๋‹ค. 

AI ๋ชจ๋ธ์— ๋Œ๋ฆด ๋•Œ, ์ดˆ์ƒ๊ถŒ์€ ๋ณ„๊ฐœ์˜ ๋ฌธ์ œ์ธ๊ฑด๊ฐ€ .?? ๋ฌดํŠผ .. 

 

 

 

์ƒ์„ฑ ๋ชจ๋ธ ๋ฐœํ‘œ ๋“ฃ๋‹ค๊ฐ€, recognition์œผ๋กœ ๋„˜์–ด๊ฐ”๋‹ค.

 

์•„๋ฌด๋ž˜๋„ ์‹œ๊ฐ์ ์œผ๋กœ ๋ณด์—ฌ์ง€๋Š”๊ฒŒ ๋งŽ์•„์„œ, ์žฌ๋ฏธ์žˆ๊ฒŒ ๋“ค์—ˆ์ง€๋งŒ,

๊ธฐ์ˆ  ์ž์ฒด์— ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค๊ณ  ๋Š๋ผ์ง€๋Š” ๋ชปํ–ˆ๋‹ค.

 

 

 

 

 

2) Recognition  

ํ‘น์‹ ํ•œ ์˜ํ™”๊ด€ ์˜์ž๋“ค ,, ,

 

 

 

 

# Google Mobile-Net

๊ตฌ๊ธ€๋งจ์€ ํŒŒ์›Œํฌ์ธํŠธ ์•ˆ์ผ๋‹ค๊ณ , ์กฐํฌ(?)๊ฐ™์€๊ฑธ ๋˜์ง€์…จ๋‹ค

 

Multi-HW ํผํฌ๋จผ์Šค ๋ถ„์„์˜ ์–ด๋ ค์›€

 

์ปดํ“จํ„ฐ๊ตฌ์กฐ์—์„œ ๋ณธ ๊ฒƒ ๊ฐ™์€ ์šฉ์–ด์™€ ์ˆ˜์‹๋“ค

 

 

์งˆ์˜์‘๋‹ต ์‹œ๊ฐ„

 

 

 

MobileNetV4: Universal Models for the Mobile Ecosystem - Google 

https://eccv.ecva.net/virtual/2024/oral/482

https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/05647.pdf

+ ํ•˜๋“œ์›จ์–ด์— ๋Œ€ํ•œ ๊ณ ๋ฏผ์ด ๋‹ด๊ฒจ์„œ ์žฌ๋ฏธ์žˆ์—ˆ๋‹ค. (์–ด๋–ป๊ฒŒ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•  ์ง€์— ๋Œ€ํ•œ ๋ฐฉ์•ˆ๋“ค) 

 

Abstract.
 We present the latest generation of MobileNets: MobileNetV4 (MNv4). They feature universally-efficient architecture designs for mobile devices. We introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe is also introduced which improves MNv4 search effectiveness.

 The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as accelerators like Apple Neural Engine and Google Pixel EdgeTPU. This performance uniformity is not found in any other models tested. We introduce performance modeling and analysis techniques to explain how this performance is achieved. Finally, to further boost accuracy, we introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of 3.8ms.

 

 

 

 

 

 

 

From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

https://eccv.ecva.net/virtual/2024/oral/498

https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/07557.pdf

 

 

 

 

 

 

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

https://eccv.ecva.net/virtual/2024/oral/502

์ด ๋‹ค์Œ์œผ๋กœ ๋ฐœํ‘œํ•˜์‹  ์—ฐ๊ตฌ ์ฃผ์ œ๊ฐ€ ํฅ๋ฏธ๋กœ์› ๋Š”๋ฐ, ์‹œ์ž‘ ์ „์— 'Still Me' ๋ผ๊ณ  ํ•˜์…”์„œ ์งฑ๋ฉ‹์กŒ๋‹ค. 

(๊ฐ™์€ ๋ถ„์ด ๋ฐœํ‘œ๋ฅผ ์—ฐ๋‹ฌ์•„์„œ ํ•˜์…จ๋‹ค.) 

 

 

 

 

 

 

Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation

https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/04051.pdf

๋‹ค๋ฅธ ์•ต๊ธ€์„ ์–ด๋–ป๊ฒŒ ํ‘œํ˜„ํ•  ๊ฒƒ์ธ๊ฐ€?์— ๋Œ€ํ•œ ์•„์ด๋””์–ด๊ฐ€ ํฅ๋ฏธ๋กœ์› ๋‹ค.

๊ฐ์ฒด์˜ ํ‚ค ํฌ์ธํŠธ๋“ค์„ ์•Œ์•„๋‚ด๋Š” ๊ฒƒ -> object detection

X, y๋ฅผ d๋กœ ๋ฐ”๊พผ๋‹ค -> direction information

 

Abstract.

 This paper introduces the point-axis representation for oriented object detection, as depicted in aerial images in Figure 1, emphasizing its flexibility and geometrically intuitive nature with two key components: points and axes. 1) Points delineate the spatial extent and contours of objects, providing detailed shape descriptions. 2) Axes define the primary directionalities of objects, providing essential orientation cues crucial for precise detection. The point-axis representation decouples location and rotation, addressing the loss discontinuity issues commonly encountered in traditional bounding box-based approaches.

 For effective optimization without introducing additional annotations, we propose the max-projection loss to supervise point set learning and the cross-axis loss for robust axis representation learning. Further, leveraging this representation, we present the Oriented DETR model, seamlessly integrating the DETR framework for precise point-axis prediction and end-to-end detection. Experimental results demonstrate significant performance improvements in oriented object detection tasks.

 

 

 

 

 

CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

https://eccv.ecva.net/virtual/2024/oral/488

  • ์•„๊นŒ ๋ฐ•๊ฒฝ๋ฌธ ๊ต์ˆ˜๋‹˜ ์—ฐ๊ตฌ๋„ ๊ทธ๋ ‡๊ณ , Continual learning์— ๋Œ€ํ•ด์„œ ๊ณต๋ถ€ํ•ด๋ณด๋ฉด, ๋” ์ดํ•ด๊ฐ€ ์ž˜ ๋˜์—ˆ์„ ๊ฒƒ ๊ฐ™๋‹ค. 

Abstract. Open-vocabulary object detection (OVD) utilizes image-level cues to expand the linguistic space of region proposals, thereby facilitating the detection of diverse novel classes. Recent works adapt CLIP embedding by minimizing the object-image and object-text discrepancy combinatorially in a discriminative paradigm. However, they ignore the underlying distribution and the disagreement between the image and text objective, leading to the misaligned distribution between the vision and language sub-space. To address the deficiency, we explore the advanced generative paradigm with distribution perception and propose a novel framework based on the diffusion model, coined Continual Latent Diffusion (CLIFF), which formulates a continual distribution transfer among the object, image, and text latent space probabilistically. CLIFF consists of a Variational Latent Sampler (VLS) enabling the probabilistic modeling and a Continual Diffusion Module (CDM) for the distribution transfer. Specifically, in VLS, we first establish a probabilistic object space with region proposals by estimating distribution parameters. Then, the object-centric noise is sampled from the estimated distribution to generate text embedding for OVD. To achieve this generation process, CDM con- ducts a short-distance object-to-image diffusion from the sampled noise to generate image embedding as the medium, which guides the long-distance diffusion to generate text embedding. Extensive experiments verify that CLIFF can significantly surpass state-of-the-art methods on benchmarks. The code is available at https://github.com/CUHK-AIM-Group/CLIFF.

 

https://github.com/CUHK-AIM-Group/CLIFF

 

On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines

 

Abstract. Reliable usage of object detectors require them to be calibrated - a crucial problem that requires careful attention. Recent approaches to- wards this involve (1) designing new loss functions to obtain calibrated detectors by training them from scratch, and (2) post-hoc Temperature Scaling (TS) that learns to scale the likelihood of a trained detector to output calibrated predictions. These approaches are then evaluated based

on a combination of Detection Expected Calibration Error (D-ECE) and Average Precision. In this work, via extensive analysis and insights, we highlight that these recent evaluation frameworks, evaluation metrics, and the use of TS have notable drawbacks leading to incorrect conclu- sions. As a step towards fixing these issues, we propose a principled evaluation framework to jointly measure calibration and accuracy of object detectors. We also tailor efficient and easy-to-use post-hoc calibration approaches such as Platt Scaling and Isotonic Regression specifically for object detection task. Contrary to the common notion, our experiments show that once designed and evaluated properly, post-hoc calibrators, which are extremely cheap to build and use, are much more powerful and effective than the recent train-time calibration methods. To illustrate, D-DETR with our post-hoc Isotonic Regression calibrator outperforms the recent train-time state-of-the-art calibration method Cal-DETR by more than 7 D-ECE on the COCO dataset. Additionally, we propose improved versions of the recently proposed Localization-aware ECE and show the efficacy of our method on these metrics. Code is available at: https://github.com/fiveai/detection_calibration.

 

https://github.com/fiveai/detection_calibration

 

์ž„ํŒฉํŠธ ์žˆ๋Š” ๋ฐœํ‘œ์ž๋ฃŒ๋ฅผ ๋งŒ๋“ค๋ ค๋ฉด ๊ตต์€ ๋ผ์ธ์„ ์‚ฌ์šฉํ•˜์ž 

์•ฝ๊ฐ„์˜ ์• ๋‹ˆ๋ฉ”์ด์…˜์œผ๋กœ ๋ชฐ์ž…์‹œํ‚ค๊ธฐ (๊ทธ๋ž˜ํ”„์—์„œ ์˜ฎ๊ฒจ๊ฐ€๊ธฐ) 

ํฐ ๋ชฉ์†Œ๋ฆฌ · ์ž์‹ ๊ฐ ์žˆ๋Š” ์ž์„ธ

์‚ฌ์šฉํ•˜๊ธฐ ์‰ฌ์›€์„ ๊ฐ•์กฐํ•˜์…จ๋‹ค 

 

 

 

 

 

'ArtificialIntelligence > ECCV2024' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

ECCV 2024 Day3 - SAM2: Meta Technical Presentation  (0) 2024.11.18
ECCV 2024 Day3 - Synthesia Keynote  (2) 2024.11.14
ECCV 2024 DAY3 - Demo Session  (7) 2024.10.18
ECCV 2024 DAY2 - Dataset Distillation Workshop (2)  (2) 2024.10.08
ECCV 2024 DAY2 - Dataset Distillation Workshop (1)  (0) 2024.10.08