Skip to yearly menu bar Skip to main content


Poster

Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection

Heng Zhang · Qiuyu Zhao · Linyu Zheng · Hao Zeng · Zhiwei Ge · Tianhao Li · Sulong Xu

Arch 4A-E Poster #230
[ ] [ Paper PDF ]
[ Poster
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT

Abstract:

Open-vocabulary object detection aims to detect novel categories that are independent from the base categories used during training. Most modern methods adhere to the paradigm of learning vision-language space from a large-scale multi-modal corpus and subsequently transferring the acquired knowledge to off-the-shelf detectors like Faster-RCNN. However, information attenuation or destruction may occur during the process of knowledge transfer due to the domain gap, hampering the generalization ability on novel categories. To mitigate this predicament, in this paper, we present a novel framework named BIND, standing for Bulit-IN Detector, to eliminate the need for module replacement or knowledge transfer to off-the-shelf detectors. Specifically, we design a two-stage training framework with an Encoder-Decoder structure. In the first stage, an image-text dual encoder is trained to learn region-word alignment from a corpus of image-text pairs. In the second stage, a DETR-style decoder is trained to perform detection on annotated object detection datasets. In contrast to conventional manually designed non-adaptive anchors, which generate numerous redundant proposals, we develop an anchor proposal network that generates anchor proposals with high likelihood based on candidates adaptively, thereby substantially improving detection efficiency. Experimental results on two public benchmarks, COCO and LVIS, demonstrate that our method stands as a state-of-the-art approach for open-vocabulary object detection. The code and models will be publicly available.

Chat is not available.