Oral Session
Orals 5A Datasets and evaluation
Summit Ballroom
Deep Generative Model based Rate-Distortion for Image Downscaling Assessment
yuanbang liang · Bhavesh Garg · Paul L. Rosin · Yipeng Qin
In this paper, we propose Image Downscaling Assessment by Rate-Distortion (IDA-RD), a novel measure to quantitatively evaluate image downscaling algorithms. In contrast to image-based methods that measure the quality of downscaled images, ours is process-based that draws ideas from rate-distortion theory to measure the distortion incurred during downscaling. Our main idea is that downscaling and super-resolution (SR) can be viewed as the encoding and decoding processes in the rate-distortion model, respectively, and that a downscaling algorithm that preserves more details in the resulting low-resolution (LR) images should lead to less distorted high-resolution (HR) images in SR. In other words, the distortion should increase as the downscaling algorithm deteriorates. However, it is non-trivial to measure this distortion as it requires the SR algorithm to be blind and stochastic. Our key insight is that such requirements can be met by recent SR algorithms based on deep generative models that can find all matching HR images for a given LR image on their learned image manifolds.Extensive experimental results show the effectiveness of our IDA-RD measure.
360+x: A Panoptic Multi-modal Scene Understanding Dataset
Hao Chen · Yuqi Hou · Chenyuan Qu · Irene Testini · Xiaohan Hong · Jianbo Jiao
Human perception of the world is shaped by a multitude of viewpoints and modalities. While many existing datasets focus on scene understanding from a certain perspective (e.g. egocentric or third-person views), our dataset offers a panoptic perspective (i.e. multiple viewpoints with multiple data modalities). Specifically, we encapsulate third-person panoramic and front views, as well as egocentric monocular/binocular views with rich modalities including video, multi-channel audio, directional audio, location data and textual scene descriptions within each scene captured, presenting comprehensive observation of the world. To the best of our knowledge, this is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world. Through our benchmark analysis, we presented 5 different scene understanding tasks on the proposed 360+x dataset to evaluate the impact and benefit of each data modality and perspective. Extensive experimental analysis reveals the effectiveness of each data modality and perspective in enhancing panoptic scene understanding. We hope the unique dataset could broaden the scope of comprehensive scene understanding and encourage the community to approach these problems from more diverse perspectives.
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Kristen Grauman · Andrew Westbury · Lorenzo Torresani · Kris Kitani · Jitendra Malik · Triantafyllos Afouras · Kumar Ashutosh · Vijay Baiyya · Siddhant Bansal · Bikram Boote · Eugene Byrne · Zachary Chavis · Joya Chen · Feng Cheng · Fu-Jen Chu · Sean Crane · Avijit Dasgupta · Jing Dong · Maria Escobar · Cristhian David Forigua Diaz · Abrham Gebreselasie · Sanjay Haresh · Jing Huang · Md Mohaiminul Islam · Suyog Jain · Rawal Khirodkar · Devansh Kukreja · Kevin Liang · Jia-Wei Liu · Sagnik Majumder · Yongsen Mao · Miguel Martin · Effrosyni Mavroudi · Tushar Nagarajan · Francesco Ragusa · Santhosh Kumar Ramakrishnan · Luigi Seminara · Arjun Somayazulu · Yale Song · Shan Su · Zihui Xue · Edward Zhang · Jinxu Zhang · Angela Castillo · Changan Chen · Fu Xinzhu · Ryosuke Furuta · Cristina González · Gupta · Jiabo Hu · Yifei Huang · Yiming Huang · Weslie Khoo · Anush Kumar · Robert Kuo · Sach Lakhavani · Miao Liu · Mi Luo · Zhengyi Luo · Brighid Meredith · Austin Miller · Oluwatumininu Oguntola · Xiaqing Pan · Penny Peng · Shraman Pramanick · Merey Ramazanova · Fiona Ryan · Wei Shan · Kiran Somasundaram · Chenan Song · Audrey Southerland · Masatoshi Tateno · Huiyu Wang · Yuchen Wang · Takuma Yagi · Mingfei Yan · Xitong Yang · Zecheng Yu · Shengxin Zha · Chen Zhao · Ziwei Zhao · Zhifan Zhu · Jeff Zhuo · Pablo ARBELAEZ · Gedas Bertasius · Dima Damen · Jakob Engel · Giovanni Maria Farinella · Antonino Furnari · Bernard Ghanem · Judy Hoffman · C.V. Jawahar · Richard Newcombe · Hyun Soo Park · James Rehg · Yoichi Sato · Manolis Savva · Jianbo Shi · Mike Zheng Shou · Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these activities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions---including a novel ``expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.
Rich Human Feedback for Text-to-Image Generation
Youwei Liang · Junfeng He · Gang Li · Peizhao Li · Arseniy Klimovskiy · Nicholas Carolan · Jiao Sun · Jordi Pont-Tuset · Sarah Young · Feng Yang · Junjie Ke · Krishnamurthy Dvijotham · Katherine Collins · Yiwen Luo · Yang Li · Kai Kohlhoff · Deepak Ramachandran · Vidhya Navalpakkam
Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/google-research/google-research/tree/master/richhf_18k.
BioCLIP: A Vision Foundation Model for the Tree of Life
Samuel Stevens · Jiaman Wu · Matthew Thompson · Elizabeth Campolongo · Chan Hee Song · David Carlyn · Li Dong · Wasila Dahdul · Charles Stewart · Tanya Berger-Wolf · Wei-Lun Chao · Yu Su
Images of the natural world, collected by a variety of cameras, from drones to individual phones, are increasingly abundant sources of biological information. There is an explosion of computational methods and tools, particularly computer vision, for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions, contexts, and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this, we curate and release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images. We then develop BioCLIP, a foundation model for the tree of life, leveraging the unique properties of biology captured by TreeOfLife-10M, namely the abundance and variety of images of plants, animals, and fungi, together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks, and find that BioCLIP consistently and substantially outperforms existing baselines (by 17% to 20% absolute). Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability. All data, code, and models will be publicly released upon acceptance.