[ West 207 ]
The goal of this workshop is to foster research on the next generation of visual perception systems that reason over label spaces that go beyond a list of simple category names. Modern applications of computer vision require systems that understand a full spectrum of labels, from plain category names (“person” or “cat” ), over modifying descriptions using attributes, actions, functions or relations (“women with yellow handbag” , “parked cars” , or “edible item” ), to specific referring descriptions (“the man in the white hat walking next to the fire hydrant” ). Natural language is a promising direction not only to enable such complex label spaces, but also to train such models from multiple datasets with different, and potentially conflicting, label spaces. Besides an excellent list of invited speakers from both academia and industry, the workshop will present the results of the OmniLabel challenge, which we held with our newly collected benchmark dataset that subsumes generic object detection, open-vocabulary detection, and referring expression comprehension into one unified and challenging task.
[ West 217 - 219 ]
Workshop on Fair, Data Efficient and Trusted Computer Vision will address four critical issues in enhancing user trust in AI and computer vision systems namely: (i) Fairness, (ii) Data Efficient learning and critical aspects of trust including (ii) explainability, (iii) mitigating adversarial attacks robustly and (iv) improve privacy and security in model building with right level of credit assignment to the data sources along with transparency in lineage.
[ West 116 ]
The purpose of this workshop is to challenge the computer vision community to develop robust image captioning models that advance the state-of-the-art both in terms of accuracy and fairness (i.e. mitigating societal biases). Both of these issues must be addressed fully before image captioning technology can be reliably deployed in a large-scale setting.
The workshop will focus on testing the true limits of image captioning models under the zero-shot image captioning setting. It aims to challenge the models by providing a large-scale evaluation dataset that includes a larger variety of visual concepts from many domains (including new concepts such as COVID-19) as well as various image types (photographs, illustrations, graphics). To accomplish this task, the models need to broadly understand language-vision relations and also learn how to combine language components for a new concept of image. Before the workshop, a challenge on zero-shot image captioning will be processed, and the results will be shared in the workshop. By providing results only on the limited evaluation dataset, the submitted models will be challenged to understand new concepts and unseen environments.
Throughout the workshop and challenge, we will cover a broad range of topics on understanding language and image together, so that …
[ East 17 ]
Content moderation (CM) is a rapidly growing need in today’s industry, with a high societal impact, where automated CM systems can discover discrimination, violent acts, hate/toxicity, and much more, on a variety of signals (visual, text/OCR, speech, audio, language, generated content, etc.). Leaving or providing unsafe content on social platforms and devices can cause a variety of harmful consequences, including brand damage to institutions and public figures, erosion of trust in science and government, marginalization of minorities, geo-political conflicts, suicidal thoughts and more. Besides user-generated content, content generated by powerful AI models such as DALL-E and GPT present additional challenges to CM systems.
With the prevalence of multimedia social networking and online gaming, the problem of sensitive content detection and moderation is by nature multimodal. The Hateful memes dataset [1] highlights the multimodal nature of content moderation, for example, an image of a skunk and a sentence “you smell good” are benign/neutral separately, but can be hateful when interpreted together. Another aspect is the complementary nature of multimodal analysis where there may be ambiguity in interpreting individual modalities separately. Moreover, content moderation is contextual and culturally multifaceted, for example, different cultures have different conventions about gestures. This requires CM approach …
[ West 103 - 104 ]
Computer vision technologies like generative image models are rapidly being integrated into creative domains to, for example, aid in artistic content retrieval and curation, generate synthetic media, or enable new forms of artistic methods and creations. However, creative AI technologies bring with them a host of ethical concerns, ranging from representational harms associated culturally sensitive matter to impact on artistic practices and copyright and ownership concerns. In particular, it is unclear what kinds of performance failures and biases these models bring when deployed in cross-cultural and non-western settings.
We encourage retrospective discussions, position papers examining the cross-cultural and social impacts of creative applications of computer vision, ethical considerations in this domain including but not limited to artwork attributions, inequity in cultural performance, cultural appropriation, environmental impacts of generative arts, biases embedded in generative arts, dynamics of art marketplaces/platforms, and policy perspectives on creative AI.
Our aim is to create a platform for interdisciplinary discussions on these issues among computer vision researchers, socio-technical researchers, policy makers, social scientists, artists, and other cultural stakeholders. This year our Generative Art Demo will invite artists to use computer vision technologies to create art pieces that center questions and topics of cultural significance and create …
[ East 3 ]
[ East 1 ]
The CVPR MCV workshop provides a unique forum for researchers and developers in academia, industry and healthcare to present, discuss and learn about cutting-edge advances in machine learning and computer vision for medical image analysis and computer assisted interventions. The workshop offers a venue for potential new collaborative efforts, encouraging more dataset and information exchanges for important clinical applications.
The ultimate goal of the MCV workshop is to bring together stakeholders interested in leveraging medical imaging data, machine learning and computer vision algorithms to build the next generation of tools and products to advance image-based healthcare. It is time to deliver!
The program features invited talks from leading researchers from academia and industry and clinicians. There will be no paper submissions at this year's workshop.
[ West 115 ]
The 5th International Workshop on Gaze Estimation and Prediction in the Wild (GAZE 2023) at CVPR 2023 aims to encourage and highlight novel strategies for eye gaze estimation and prediction with a focus on robustness and accuracy in extended parameter spaces, both spatially and temporally. This is expected to be achieved by applying novel neural network architectures, incorporating anatomical insights and constraints, introducing new and challenging datasets, and exploiting multi-modal training. Specifically, the workshop topics include (but are not limited to):
- Reformulating eye detection, gaze estimation, and gaze prediction pipelines with deep networks.
- Applying geometric and anatomical constraints into the training of (sparse or dense) deep networks.
- Leveraging additional cues such as contexts from face region and head pose information.
- Developing adversarial methods to deal with conditions where current methods fail (illumination, appearance, etc.).
- Exploring attention mechanisms to predict the point of regard.
- Designing new accurate measures to account for rapid eye gaze movement.
- Novel methods for temporal gaze estimation and prediction including Bayesian methods.
- Integrating differentiable components into 3D gaze estimation frameworks.
- Robust estimation from different data modalities such as RGB, depth, head pose, and eye region landmarks.
- Generic …
[ West 121 - 122 ]
[ West 208 - 209 ]
Monocular depth estimation (MDE) is an important low-level vision task, with application in fields such as augmented reality, robotics and autonomous vehicles. Recently, there has been an increased interest in self-supervised systems capable of predicting the 3D scene structure without requiring ground-truth LiDAR training data. Automotive data has accelerated the development of these systems, thanks to the vast quantities of data, the ubiquity of stereo camera rigs and the mostly-static world. However, the evaluation process has also remained focused on only the automotive domain and has been largely unchanged since its inception, relying on simple metrics and sparse LiDAR data.
This workshop seeks to answer the following questions:
1. How well do networks generalize beyond their training distribution relative to humans?
2. What metrics provide the most insight into the model’s performance? What is the relative weight of simple cues, e.g. height in the image, in networks and humans?
3. How do the predictions made by the models differ from how humans perceive depth? Are the failure modes the same?
The workshop will therefore consist of two parts: invited keynote talks discussing current developments in MDE and a challenge organized around a novel benchmarking procedure using the SYNS dataset.
[ East 2 ]
Incorporating new knowledge in existing models to adapt to novel problems is a fundamental challenge of computer vision. Humans and animals continuously assimilate new experiences to survive in new environments and to improve in situations already encountered in the past. Moreover, while current computer vision models have to be trained with independent and identically distributed random variables, biological systems incrementally learn from non-stationary data distributions. This ability to learn from continuous streams of data, without interfering with previously acquired knowledge and exhibiting positive transfer is called Continual Learning. The CVPR Workshop on “Continual Learning in Computer Vision” (CLVision) aims to gather researchers and engineers from academia and industry to discuss the latest advances in Continual Learning. In this workshop, there are regular paper presentations, invited speakers, and a technical benchmark challenges to present the current state of the art, as well as the limitations and future directions for Continual Learning, arguably one of the most challenging milestones of AI.
[ East 16 ]
[ West 210 ]
Fine-grained categorization, the precise differentiation between similar plant or animal species, disease of the retina, architectural styles, etc., is an extremely challenging problem, pushing the limits of both human and machine ability. In these domains expert knowledge is typically required, and the question that must be addressed is how can we develop systems that can efficiently discriminate between large numbers of highly similar visual concepts. The 10th Workshop on Fine-Grained Visual Categorization (FGVC10) explores topics related to supervised learning, self- supervised learning, semi-supervised learning, matching, localization, domain adaptation, transfer learning, few-shot learning, machine teaching, multimodal learning (e.g., audio and video), 3D- vision, crowd-sourcing, image captioning and generation, out-of- distribution detection, open-set recognition, human-in-the-loop learning, etc., all through the lens of fine-grained understanding. Topics relevant for FGVC10 are neither restricted to vision nor categorization. FGVC10 consists of invited talks from world- renowned computer vision experts and domain experts (e.g., art), poster sessions, challenges, and peer-reviewed extended abstracts. To mark FGVC’s 10th anniversary, we have confirmed five panellists for a discussion of the history and future of FGVC. We aim to stimulate debate and to expose the wider computer vision community to new challenging problems which have the potential for large societal …
[ West 111 - 112 ]
[ West 110 ]
End-to-end autonomous driving, as a relatively new paradigm (compared to the modular design) yet with great potential, has already attracted attention from both academia and industry. This workshop serves a brand-new perspective to discuss broad areas of end-to-end framework design for autonomous driving on a system-level consideration. Central to the program is a series of invited talks and four new challenges in the self-driving domain. Each challenge combines new perspectives of multiple components in perception and planning compared to conventional pipelines.
[ East 19 - 20 ]
[ West 220 - 222 ]
[ East 7 ]
[ East 11 ]
[ West 223 - 224 ]
The exploitation of the power of big data in the last few years led to a big step forward in many applications of Computer Vision. However, most of the tasks tackled so far are involving visual modality only, mainly due to the unbalanced number of labelled samples available among modalities (e.g., there are many huge labelled datasets for images while not as many for audio or IMU based classification), resulting in a huge gap in performance when algorithms are trained separately.
Recently, a few works have started to exploit the synchronization of multimodal streams (e.g., audio/video, RGB/depth, RGB/Lidar, visual/text, text/audio) to transfer semantic information from one modality to another reaching surprising results. Interesting applications are also proposed in a self-supervised fashion, where multiple modalities are learning correspondences without need of manual labelling, resulting in a more powerful set of features compared to those learned processing the two modalities separately. Other works have also shown that particular training paradigms allow neural networks to perform well when one of the modalities is missing due to sensor failure or unfavorable environmental conditions. These topics are gaining lots of interest in computer vision community in the recent years.
The information fusion from multiple sensors …
[ East Ballroom C ]
The CVPR 2023 Workshop on Autonomous Driving (WAD) aims to gather researchers and engineers from academia and industry to discuss the latest advances in perception for autonomous driving. In this full-day workshop, we will host speakers as well as technical benchmark challenges to present the current state of the art, limitations and future directions in the field - arguably one of the most promising applications of computer vision and artificial intelligence. The previous chapters of this workshop attracted hundreds of researchers to attend. This year, multiple industry sponsors are also joining our organizing efforts to push it to a new level.
[ East 9 ]
Recent years have seen the stunning powers of Visual Language Pre-training (VLP) models. Although VLPs have revolutionalized some fundamental principles of visual language reasoning (VLR), the other remaining problems prevent them from “thinking” like a human being: how to reason the world from breaking into parts (compositionality), how to achieve the generalization towards novel concepts provided a glimpse of demonstrations in context (prompts), and how to debias visual language reasoning by imagining what would have happened in the counterfactual scenarios (causality).
The workshop provides the opportunity to gather researchers from different fields to review the technology trends of the three lines, to better endow VLPs with these reasoning abilities. Our workshop also consists of two multi-modal reasoning challenges under the backgrounds of cross-modal math-word calculation and proving problems. The challenges are practical and highly involved with our issues, therefore, shedding more insights into the new frontiers of visual language reasoning.
[ East 4 ]
Many biological organisms are evolved to exhibit diverse quintessential behaviors via physical and social interactions with surroundings, and understanding these behaviors is a fundamental goal of multiple disciplines including neuroscience, biology, medicine, behavior science, and sociology. For example, ethogramming characterizes the behavioral states and their transitions, which further provides a scientific basis to understand innate human behaviors, e.g., decision-making, attention, and group behaviors. These analyses require objective, repeatable, and scalable measurements of animal behaviors that are not possible with existing methodologies that leverage manual encoding from animal experts and specialists. Recently, computer vision has been making a groundbreaking impact by providing a new tool that enables computational measurements of the behaviors.
The workshop offers invited talks, orals, and poster sessions by the leading scientists in the field, coming computer vision, neuro science, and biology. Our webpage list the full schedule, accepted papers, and posters.
[ East Exhibit Hall A ]
[ West 302 - 305 ]
[ East 12 ]
With the recent advances in AR/VR recently a wider range of applications such as virtual touring, Building Information Modeling (BIM), e.g. floorplan generation and 3D holistic understanding have been emerging. Such applications have attracted a lot of interest from both academia and industry and motivated a lot of investments in the form of dataset collection, research, publications and products. A few recent examples of such datasets are Zillow Indoor Dataset (ZInD), Apple’s ARKit Scenes dataset and Facebook’s Habitat-Matterport dataset. The size and unique type of annotations provided by each of these datasets provide a huge opportunity for CV/ML researchers to focus on different aspects of scene and environment understanding beyond what was possible before.
Motivated by the recent release of datasets such as Zillow Indoor Dataset (ZInD), Apple's ARKit Scenes dataset and Facebook's Habitat-Matterport dataset, in this workshop we would like to bring industry and academia together and encourage both to focus on specific under explored aspects of environment understanding. We encourage researchers to go beyond "scene understanding" and explore "environment understanding" with a focus on understanding structure through tasks such as 2D/3D room layout estimation, understanding relation of "rooms" for floorplan generation, localization of media within rooms and floorplans, …
[ East 13 ]
[ West 207 ]
Vision-based detection and recognition studies have been recently achieving highly accurate performance and were able to bridge the gap between research and real-world applications. Beyond these well-explored detection and recognition capabilities of modern algorithms, vision-based forecasting will likely be one of the next big research topics in the field of computer vision. Vision-based prediction is one of the critical capabilities of humans, and the potential success of automatic vision-based forecasting will empower and unlock human-like capabilities in machines and robots.
One important application is in autonomous driving technologies, where a vision-based understanding of a traffic scene and prediction of the movement of traffic actors is a critical piece of the autonomous puzzle. Various sensors such as cameras and lidar are used as the "eyes" of a vehicle, and advanced vision-based algorithms are required to allow safe and effective driving. Another area where vision-based prediction is used is the medical domain, allowing deep understanding and prediction of future medical conditions of patients. However, despite its potential and relevance for real-world applications, visual forecasting or precognition has not been the focus of new theoretical studies and practical applications as much as detection and recognition problems.
Through the organization of this workshop, we …
[ East Ballroom B ]
Creative domains render a big part of modern society, having a strong influence on the economy and cultural life. Much effort within creative domains, such as fashion, art and design, center around the creation, consumption, manipulation and analytics of visual content. In recent years, there has been an explosion of research in applying machine learning and computer vision algorithms to various aspects of the creative domains. For four years in a row, CVFAD workshop series have been capturing important trends and new ideas in this area. At CVPR 2023, we will continue to bring together artists, designers, and computer vision researchers and engineers. We will keep growing the workshop itself to be a space for conversations and idea exchanges at the intersection of computer vision and creative applications.
[ West 111 - 112 ]
The goal of this workshop is to gather researchers, students, and advocates who work at the intersection of accessibility, computer vision, and autonomous and intelligent systems. In particular, we plan to use the workshop to identify challenges and pursue solutions for the current lack of shared and principled development tools for vision-based accessibility systems. For instance, there is a general lack of vision-based benchmarks and methods relevant to accessibility (e.g., people using mobility aids are currently mostly absent from largescale datasets in pedestrian detection). Towards building a community of accessibility-oriented research in computer vision conferences, we also introduce a large-scale fine-grained computer
vision challenge. The challenge involves visual recognition tasks relevant to individuals with disabilities. We aim to use the challenge to uncover research opportunities and spark the interest of computer vision and AI researchers working on more robust and broadly usable visual reasoning models in the future. An interdisciplinary panel of speakers will further provide an
opportunity for fostering a mutual discussion between accessibility, computer vision, and robotics researchers and practitioners.
[ West 212 ]
[ West 301 ]
Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Since the real-world is actually video-based rather than a static state, learning to perform video semantic/panoptic segmentation is more reasonable and practical for realistic applications. To advance the semantic/panoptic segmentation task from images to videos, we present two large-scale datasets (VSPW and VIPSeg) and a competition in this workshop, aiming at performing the challenging yet practical Pixel-level Video Understanding in the Wild (PVUW).
[ West 215 - 216 ]
4D Light fields can capture both intensity and directions of light rays, and record 3D geometry in a convenient and efficient manner. In the past few years, various areas of research are trying to use light fields to obtain superior performance internal structure information. Light fields have been widely used with remarkable results in some applications like depth estimation, super-resolution and so on. While the attempts in other applications like object detection and semantic segmentation are still in preliminary stage due to the lack of corresponding datasets, and incompatibility between redundant context information and limited memory. Meanwhile, more and more novel and powerful technologies like Neural Radiance Fields and Multiplane Image have been introduced into computer vision, there will be plenty of opportunities and challenges to incorporate them with light fields. To this end, this workshop focuses on two brand new topics. The first is to introduce the light field into more application areas, break through the bottleneck between rich structural information and limited memory, and achieve stable performance. The second is to explore how to introduce emerging technologies from other research fields into light fields to create new technological effects and drive competition. Besides, this workshop also hosts competitions …
[ East 14 ]
[ East 10 ]
High-throughput microscopy enables researchers to acquire thousands of images automatically over a matter of hours. This makes it possible to conduct large-scale, image-based experiments for biological discovery. The main challenge and bottleneck in such experiments is the conversion of “big visual data” into interpretable information and hence discoveries. Visual analysis of large-scale image data is a daunting task. Cells need to be located and their phenotype (e.g., shape) described. The behaviors of cell components, cells, or groups of cells need to be analyzed. The cell lineage needs to be traced. Not only do computers have more “stamina” than human annotators for such tasks, they also perform analysis that is more reproducible and less subjective. The post-acquisition component of high-throughput microscopy experiments calls for effective and efficient computer vision techniques.
This workshop will bring together computer vision experts from academia, industry, and government who have made progress in developing computer vision tools for microscopy image analysis. It will provide a comprehensive forum on this topic and foster in-depth discussion of technical and application issues as well as cross-disciplinary collaboration. It will also serve as an introduction to researchers and students curious about this important and fertile field.
[ West 212 ]
[ Virtual (AM); West 114 - 115 (PM) ]
Over the past years, mobile AI-based applications are becoming more and more ubiquitous. Various deep learning models can now be found on any mobile device, starting from smartphones running portrait segmentation, image enhancement, face recognition and natural language processing models, to smart-TV boards coming with sophisticated image super-resolution algorithms. The performance of mobile NPUs and DSPs is also increasing dramatically, making it possible to run complex deep learning models and to achieve fast runtime in the majority of tasks.
While many research works targeted at efficient deep learning models have been proposed recently, the evaluation of the obtained solutions is usually happening on desktop CPUs and GPUs, making it nearly impossible to estimate the actual inference time and memory consumption on real mobile hardware. To address this problem, we introduce the first Mobile AI Workshop, where all deep learning solutions are developed for and evaluated on mobile devices.
Due to the performance of the last-generation mobile AI hardware, the topics considered in this workshop will go beyond the simple classification tasks, and will include such challenging problems as image denoising, HDR photography, accurate depth estimation, learned image ISP pipeline, real-time image and video super-resolution. All information about the challenges, papers, …
[ West 209 ]
This workshop is dedicated to event-based cameras, smart cameras, and algorithms processing data from these sensors. Event-based cameras are bio-inspired sensors with the key advantages of microsecond temporal resolution, low latency, very high dynamic range, and low power consumption. Because of these advantages, event-based cameras open frontiers that are unthinkable with standard frame-based cameras (which have been the main sensing technology for the past 60 years). These revolutionary sensors enable the design of a new class of algorithms to track a baseball in the moonlight, build a flying robot with the agility of a bee, and perform structure from motion in challenging lighting conditions and at remarkable speeds. These sensors became commercially available in 2008 and are slowly being adopted in computer vision and robotics. In recent years they have received attention from large companies, e.g., the event-sensor company Prophesee collaborated with Intel and Bosch on a high spatial resolution sensor, Samsung announced mass production of a sensor to be used on hand-held devices, and they have been used in various applications on neuromorphic chips such as IBM’s TrueNorth and Intel’s Loihi. The workshop also considers novel vision sensors, such as pixel processor arrays (PPAs), which perform massively parallel processing …
[ West 306 ]
The Workshop has a unique aspect of fostering cross-pollination of different disciplines, bringing together experts (from academia & industry) and researchers of computer vision and pattern recognition, AI, machine learning, HCI, multimedia, robotics and psychology. The diversity of human behavior, the richness of multi-modal data that arises from its analysis, and the multitude of applications that demand rapid progress in this area ensure that our event provides a timely and relevant discussion and dissemination platform.
The workshop includes keynote talks from Prof. Gunes and Prof. Lapedriza, as well as presentations from experts and researchers within academia and industry on topics related to affective computing and behavior analysis.
The detailed agenda of the workshop can be found on the workshop's website.
[ West 205 - 206 ]
The half-day Women in Computer Vision (WiCV) workshop is a gathering for researchers of all genders and career stages. All are welcome and encouraged to attend the workshop. Topics include - wide range of areas, including object recognition, image understanding, video analysis, 3D reconstruction, etc.
Virtual Poster Session from 12:15 - 1:00 pm at https://topia.io/wicvcvpr2023
[ West 208 ]
The VISION workshop aims to provide a platform for the exchange of scholarly innovations and emerging practical challenges in Vision-based Industrial Inspection. Through a series of keynote talks, technical presentations, and challenge competition, this workshop is intended to (i) bring together researchers from the interdisciplinary research communities related to computer vision-based inspection; (ii) connect researchers and industry practitioners to synergize recent research progress and current needs in industrial practice.
[ West 213 ]
Embedded vision is an active field of research, bringing together efficient learning models with fast computer vision and pattern recognition algorithms. We tackle many areas of robotics and intelligent systems and enjoy an impressive growth today.
[ West 217 - 219 ]
Federated Learning (FL) has become an important privacy-preserving paradigm in various machine learning tasks. However, the potential of FL in computer vision applications, such as face recognition, person re-identification, and action recognition, is far from being fully exploited. Moreover, FL has rarely been demonstrated effectively in advanced computer vision tasks such as object detection and image segmentation, compared to the traditional centralized training paradigm. This workshop aims at bringing together researchers and practitioners with common interests in FL for computer vision and studying the different synergistic relations in this interdisciplinary area. The day-long event will facilitate interaction among students, scholars, and industry professionals from around the world to discuss future research challenges and opportunities.
[ West 107 - 108 ]
The rapid development of computer vision algorithms increasingly allows automatic visual recognition to be incorporated into a suite of emerging applications. Some of these applications have less-than-ideal circumstances such as low-visibility environments, causing image captures to have degradations. In other more extreme applications, such as imagers for flexible wearables, smart clothing sensors, ultra-thin headset cameras, implantable in vivo imaging, and others, standard camera systems cannot even be deployed, requiring new types of imaging devices. Computational photography addresses the concerns above by designing new computational techniques and incorporating them into the image capture and formation pipeline. This raises a set of new questions. For example, what is the current state-of-the-art for image restoration for images captured in non-ideal circumstances? How can inference be performed on novel kinds of computational photography devices?
Continuing the success of the 1st (CVPR'18), 2nd (CVPR'19), 3rd (CVPR'20), 4th (CVPR'21), and 5th (CVPR'22) UG2 Prize Challenge workshops, we provide its 6th version for CVPR 2023. It will inherit the successful benchmark dataset, platform and evaluation tools used by the previous UG2 workshops, but will also look at brand new aspects of the overall problem, significantly augmenting its existing scope.
[ West 111 - 112 ]
This joint full-day workshop is the longstanding event that brings together the strongly growing egocentric computer vision community, offering the 3rd Ego4D edition and the 11th Egocentric Perception, Interaction and Perception (EPIC) edition. This year, 17 Ego4D benchmark and 9 EPIC benchmark winners and findings will be presented throughout the day, ranging from social interactions, episodic memory, hand-object interactions, long-term tracking, video object segmentations and audio-based interaction recognition. In addition to the recurring Ego4D and EPIC challenges, new challenges are associated with recently released benchmarks EgoTracks, PACO, EPIC-KTICHENS VISOR and EPIC-Sounds.
Additionally, the day will include accepted abstracts, invited CVPR papers and 5 keynotes by Andrea Vedaldi (Oxford and Meta), Hyun Soo Park (UMinnesota), David Fouhey (UMich) and Suraj Nair (Stanford). Check the program for details.
[ East Ballroom B ]
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concepts.
Recent works show that learning from large-scale image-text data is a promising approach to building transferable visual models that can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. For example, CLIP, ALIGN and Florence for image classification, ViLD, RegionCLIP, GLIP and OWL-ViT for object detection, GroupViT, OpenSeg, MaskCLIP, X-Decoder, Segment Anything (SAM) and SEEM for segmentation, Multimodal GPT-4, LLaVA and MiniGPT4 for langauge-and-image instruction-following chat assistants. These vision models with language or interactive interface are naturally open-vocabulary recogntion models, showing superior zero-shot and few-shot adaption performance on various real-world scenarios.
We host this "Computer Vision in the Wild (CVinW)" workshop, aiming to gather academic and industry communities to work on CV and MM problems in real-world scenarios, focusing on the challenge of open-set/domain visual recognition at different granularities and efficient task-level transfer. To measure the progress of CVinW, we develop new benchmarks for image classification, object detection and segmentation to measure the task-level transfer ablity of …
[ East 12 ]
The Visual Copy Detection Workshop (VCDW) explores the task of identifying copied images and videos, robust to common transformations. This task is central to social problems facing online services where users share media, such as combating misinformation and exploitative imagery, as well as enforcing copyright. Recently, copy detection methods have been used to identify and promote original content, and to reduce memorization in both predictive and generative models.
The workshop will explore technical advances in copy detection as well as the applications that motivate this research. The workshop will feature the Video Similarity Challenge, a copy detection challenge in the video domain, including presentations by challenge participants.
[ East 3 ]
[ East 13 ]
The workshop focuses on bringing together researchers, engineers, and practitioners from academia, industry, and government to exchange ideas, share their latest research, and discuss the latest trends and challenges in this field. The workshop also aims to foster collaboration between different stakeholders, including computer vision researchers, machine learning experts, robotics engineers and safety experts, to create a comprehensive framework for developing safe AI systems for all domains.
Overall, the SAIAD workshop aims to advance the state-of-the-art in safe AI, address the most pressing challenges, and provide a platform for networking and knowledge sharing among the experts in this field.
[ East 2 ]
Our objective is to provide a venue for novel research in omnidirectional computer vision with an eye toward actualizing these ideas for commercial or societal benefit. As omnidirectional cameras become more widespread, we want to bridge the gap between the research and application of omnidirectional vision technologies. Omnidirectional cameras are already widespread in a number of application areas such as automotive, surveillance, photography, simulation and other use-cases that benefit from large field of view. More recently, they have garnered interest for use in virtual and augmented reality. We want to encourage the development of new models that natively operate on omnidirectional imagery as well as close the performance gap between perspective-image and omnidirectional algorithms.
[ West 116 - 117 ]
[ West 109 - 110 ]
[ West 211 ]
PCV2023 is a half-day workshop at CVPR2023 which provides a forum for original research in computer vision and photogrammetry. PCV2023 invites submissions of high-quality research papers concerning the generation, processing, and analysis of images, 3D point clouds and surface models, with the goal of enhancing accuracy and completeness. Topics of interest include, but are not limited to:
- Feature extraction, matching, and sensor orientation and sensor fusion
- Structure from motion and SLAM
- Stereo (multi-view) and surface reconstruction
- 3D point cloud processing, segmentation, and classification
- Multi-temporal analysis, dynamic scene understanding
- 3D scene analysis and semantic segmentation
[ West 223 - 224 ]
Reconstruction of general dynamic scenes is motivated by potential applications in film and broadcast production together with the ultimate goal of automatic understanding of real-world scenes from distributed camera networks. With recent advances in hardware and the advent of virtual and augmented reality, dynamic scene reconstruction is being applied to more complex scenes with applications in Entertainment, Games, Film, Creative Industries and AR/VR/MR. We welcome contributions to this workshop in the form of oral presentations and posters.