Skip to yearly menu bar Skip to main content



Tutorials
Tutorial
Matteo Poggi

[ Arch 213 ]

Abstract

For decades, stereo matching has been approached by developing hand-crafted algorithms, focused on measuring the visual appearance between local patterns in the two images and propagating this information globally. Since 2015, deep learning led to a paradigm shift in this field, driving the community to the design of end-to-end deep networks capable of matching pixels. The results of this revolution brought stereo matching to a whole new level of accuracy, yet not without any drawbacks. Indeed, some hard challenges remained unsolved by the first generation of deep stereo models, as they were often not capable of properly generalizing across different domains -- e.g., from synthetic to real, from indoor to outdoor -- or dealing with high-resolution images.

This was, however, three years ago. These and other challenges have been faced by the research community in the Twenties, making deep stereo matching even more mature and suitable to be a practical solution for everyday applications. For instance, now we have networks capable of generalizing much better from synthetic to real images, as well as handling high-resolution images or even estimating disparity correctly in the presence of non-Lambertian surfaces -- known to be among the ill-posed challenges for stereo. Accordingly, in this …

Tutorial
Sijia Liu · Yang Liu · Nathalie Baracaldo · Eleni Triantafillou

[ Arch 305 ]

Abstract

This tutorial aims to offer a comprehensive understanding of emerging machine unlearning (MU) techniques. These techniques are designed to accurately assess the impact of specific data points, classes, or concepts (e.g., related to copyrighted information, biases and stereotypes, and personally identifying data) on model performance and efficiently eliminate their potentially harmful influence within a pre-trained model. With the recent shift to foundation models, MU has become indispensable, as re-training from scratch is prohibitively costly in terms of time, computational resources, and finances. Despite increasing research interest, MU for vision tasks remains significantly underexplored compared to its prominence in the security and privacy (SP) field. Within this tutorial, we will delve into the algorithmic foundations of MU methods, including techniques such as localization-informed unlearning, unlearning-focused finetuning, and vision model-specific optimizers. We will provide a comprehensive and clear overview of the diverse range of applications for MU in CV. Furthermore, we will emphasize the importance of unlearning from an industry perspective, where modifying the model during its life-cycle is preferable to re-training it entirely, and where metrics to verify the unlearning process become paramount. Our tutorial will furnish the general audience with sufficient background information to grasp the motivation, research progress, opportunities, …

Tutorial
Xin Jin · Wenjun Zeng · Tao Yang · Yue Song · Nicu Sebe · Xingyi Yang · Xinchao Wang · Shuicheng Yan

[ Arch 2B ]

Abstract

This tutorial aims to explore the concepts of disentanglement and compositionality in the field of computer vision. These concepts play a crucial role in enabling machines to understand and interpret visual information with more sophistication and human-like reasoning. Participants will learn about advanced techniques and models that allow for the disentanglement of visual factors in images and the compositionality of these factors to produce more meaningful representations. All in all, Disentanglement and Composition are believed to be one of the possible ways for AI to fundamentally understand the world, and eventually achieve Artificial General Intelligence (AGI).

Tutorial
Zhengyuan Yang · Linjie Li · Zhe Gan · Chunyuan Li · Jianwei Yang

[ Summit 437- 439 ]

Abstract

This tutorial covers the advanced topics in designing and training vision foundation models, including the state-of-the-art approaches and principles in (i) learning vision foundation models for multimodal understanding and generation, (ii) benchmarking and evaluating vision foundation models, and (iii) agents and other advanced systems based on vision foundation models.

Tutorial
Edward Kim · Sanjit Seshia · Daniel Fremont · Jinkyu Kim · Kimin Lee · Hazem Torfah · Necmiye Ozay · Parasara Sridhar Duggirala · Marcell Vazquez-Chanlatte

[ Arch 307-308 ]

Abstract

Autonomous systems, such as self-driving cars or intelligent robots, are increasingly operating in complex, stochastic environments where they dynamically interact with multiple entities (human and robot). There is a need to formally model and generate such environments in simulation, for use cases that span synthetic training data generation and rigorous evaluation of safety. In this tutorial, we provide an in-depth tutorial on Scenic, a simulator-agnostic probabilistic programming language to model complex multi-agent, physical environments with stochasticity and spatio-temporal constraints. Scenic has been used in a variety of domains such as self-driving, aviation, indoor robotics, multi-agent systems, and augmented/virtual reality. Using Scenic and associated open source tools, one can (1) model and sample from distributions with spatial and temporal constraints, (2) generate synthetic data in a controlled, programmatic fashion to train and test machine learning components, (3) reason about the safety of AI-enabled autonomous systems, (4) automatically find edge cases, (5) debug and root-cause failures of AI components including for perception, and (6) bridge the sim-to-real gap in autonomous system design. We will provide a hands-on tutorial on the basics of Scenic and its applications, how to create Scenic programs and your own new applications on top of Scenic, and to …

Tutorial
Orhun Aydin · Philipe Ambrozio Dias · Dalton Lunga

[ Summit 448 ]

Abstract

The 5Vs of big data, volume, value, variety, velocity, and veracity pose immense opportunity and challenges on implementing local and planet-wide solution from Earth observation (EO) data. EO data, residing at the center of various multidisciplinary problems, primarily obtained through satellite imagery, aerial photography, and UAV-based platforms. Understanding Earth Observation data unlocks this immense data source to address planet-scale problems with computer vision and machine learning techniques for geospatial analysis. This workshop introduces current EO data sources, problems, and image-based analysis techniques. The most recent advances in data, models, and open-source analysis ecosystem related to computer vision and deep learning for EO data will be introduced.

Tutorial
Benjamin Kimia · Timothy Duff · Ricardo Fabbri · Hongyi Fan

[ Summit 447 ]

Abstract

Minimal problems and their solvers play an important role in RANSAC-based approaches to several estimation problems in vision. Minimal solvers solve systems of equations, depending on data, which obey a “conservation of number principle”: for sufficiently generic data, the number of solutions over the complex numbers is constant. Homotopy continuation (HC) methods exploit not just this conservation principle, but also the smooth dependence of solutions on problem data. The classical solution of polynomial systems using Grobner basis, resultants, elimination templates, etc. has been largely successful in smaller problems, but these methods are not able to tackle larger polynomials systems with a larger number of solutions. While HC methods can solve these problems, they have been notoriously slow. Recent research by the presenters and other researchers has enabled efficient HC solvers with the ability for real-time solutions.

The main objective of this tutorial is to make this technology more accessible to the computer vision community. Specifically, after an overview of how such methods can be useful for solving problems in vision (e.g., absolute/relative pose, triangulation), we will describe some of the basic theoretical apparatus underlying HC solvers, including both local and global “probability-1” aspects. On the practical side, we will describe …

Tutorial
Mohit Prabhushankar · Ghassan AlRegib

[ Summit 440 - 441 ]

Abstract

Neural networks provide generalizable and task independent representation spaces that have garnered widespread applicability in image understanding applications. The complicated semantics of feature interactions within image data has been broken down into a set of non-linear functions, convolution parameters, attention, as well as multi-modal inputs among others. The complexity of these operations has introduced multiple vulnerabilities within neural network architectures. These vulnerabilities include adversarial and out-of-distribution samples, confidence calibration issues, and catastrophic forgetting among others. Given that AI promises to herald the fourth industrial revolution, it is critical to understand and overcome these vulnerabilities. Doing so requires creating robust neural networks that drive the AI systems. Defining robustness, however, is not trivial. Simple measurements of invariance to noise and perturbations are not applicable in real life settings. In this tutorial, we provide a human-centric approach to understanding robustness in neural networks that allow AI systems to function in society. Doing so allows us to state the following: 1) All neural networks must provide contextual and relevant explanations to humans, 2) Neural networks must know when and what they don’t know, 3) Neural Networks must be amenable to being intervened upon by humans at decision-making stage. These three statements call for …

Tutorial
Yanwei Fu · Francesco Locatello · Tianjun Xiao · Tong He · Ke Fan

[ Elliott Bay ]

Abstract

This tutorial discusses the evolution of object-centric representation in computer vision and deep learning. Initially inspired by decomposing visual scenes into surfaces and objects, recent developments focus on learning causal variables from high-dimensional observations like images or videos. The tutorial covers the objectives of OCL, its development, and connections with machine learning fields, emphasizing object-centric approaches, especially in unsupervised segmentation. Advances in encoder, decoder, and self-supervised learning objectives are explored, with a focus on real-world applications and challenges. The tutorial also introduces open-source tools and showcases breakthroughs in video-based object-centric learning. This tutorial will have four talks covering the basic ideas, learning good features for object-centric learning, video based object-centric representation, and more diverse real-world applications.

Tutorial
Fabricio Narcizo · Elizabete Munzlinger · Anuj Dutt · Shan Shaffi · Sai Narsi Reddy Donthi Reddy

[ Summit 446 ]

Abstract

Edge AI refers to artificial intelligence applied to edge devices like smartphones, tablets, laptops, cameras, sensors, and drones. It enables these devices to handle AI tasks autonomously, without cloud or central server connections, offering higher speed, lower latency, greater privacy, and reduced power consumption. Edge AI presents challenges and opportunities in model development and deployment, including size reduction, compression, quantization, and distillation, and involves integrating and communicating between edge devices and the cloud or other devices in a hybrid and distributed architecture. This tutorial provides practical guidance on developing and deploying optimized models for edge AI, covering theoretical and technical aspects, best practices, and real-world case studies focused on computer vision and deep learning models. We demonstrate tools and frameworks like TensorFlow, PyTorch, ONNX, OpenVINO, Google Mediapipe, and Qualcomm SNPE. We will also discuss multi-modal AI applications such as head pose estimation, person segmentation, hand gesture recognition, sound localization, and more. These applications use images, videos, and sounds to create interactive edge AI experiences. The presentation will include developing and deploying these models on Jabra collaborative business cameras and exploring integration with devices like Luxonis OAK-1 MAX, Neural Compute Engine Myriad X, and NVIDIA Jetson Nano Developer Kit.

Tutorial
Hsin-Ying Lee · Peiye Zhuang · Chaoyang Wang

[ Summit 440 - 441 ]

Abstract

In today's metaverse, where the digital and physical worlds blend seamlessly, capturing, representing, and analyzing 3D structures is vital. Advances in 3D and 4D tech have revolutionized gaming, AR, and VR, offering immersive experiences. 3D modeling bridges reality and virtuality, enabling realistic simulations and AR overlays. Adding time enhances experiences with lifelike animations and object tracking, shaping digital interactions.

Traditionally, 3D generation involved directly manipulating data, evolving alongside 2D techniques. Recent breakthroughs in 2D diffusion models have enhanced 3D tasks using large-scale image datasets. Methods like Score Distillation Sampling improve quality. However, biases in 2D data and limited 3D info pose challenges.

Generating 3D scenes and reducing biases in 2D data for realistic synthesis are ongoing challenges. Our tutorial explores techniques for diverse scenes and realism, including 3D/4D reconstruction from images and videos. Attendees learn about various generation methods, from 3D data training to leveraging 2D models, gaining a deep understanding of modern 3D modeling.

In summary, our tutorial covers the breadth of 3D/4D generation, from basics to the latest. By tackling scene-level complexities and using 2D data for realism, attendees gain insight into the evolving 3D modeling landscape in the metaverse.

Tutorial
Samet Akcay · Paula Ramos Giraldo · Ria Cheruvu · Alexander Kozlov · Zhen Zhao · Zhuo Wu · Raymond Lo · Yury Gorbachev

[ Summit 436 ]

Abstract

This tutorial aims to guide researchers and practitioners in navigating the complex deep learning (DL) landscape, focusing on data management, training methodologies, optimization strategies, and deployment techniques. It highlights open-source libraries like the OpenVINO toolkit, OpenVINO Training eXtensions (OTX), and Neural Network Compression Frameworks (NNCF) in streamlining DL development. The tutorial covers how OTX 2.0 simplifies the DL ecosystem (Computer Vision) by integrating various frameworks and ensuring a consistent experience across different platforms (MMLab, Lightning, or Anomalib). It also demonstrates how to fine-tune generative AI models, specifically Stable Diffusion SD with LoRA, and the benefits of customized models in reducing latency and enhancing efficiency. The tutorial explores fine-tuning visual prompting tasks, including Segment Anything Model (SAM). It explains how to fine-tune a SD model with custom data using multiple acceleration methods, and how to deploy the fine-tuned model using OpenVINO Transformation Passes API. Lastly, the tutorial focuses on model optimization capabilities for the inference phase, with the OpenVINO toolkit and OTX library integrating with NNCF to refine neural networks and improve inference speed, especially on edge devices with limited resources. The tutorial includes demos showcasing how OpenVINO runtime API enables real-time inference on various devices.

Tutorial
Naoki Wake · Zane Durante · Ran Gong · Jae Sung Park · Bidipta Sarkar · Rohan Taori · Yusuke Noda · Yejin Choi · Demetri Terzopoulos · Katsushi Ikeuchi · Hoi Vo · Li Fei-Fei · Jianfeng Gao · Qiuyuan Huang

[ Summit 446 ]

Abstract

Generalist Agent AI (GAA) is a family of systems that generate effective actions in an environment based on the understanding of multimodal sensory input. While these systems are expanding into various fields with the advent of large foundation models, they share common interests such as data collection, benchmarking, and ethical perspectives. In this tutorial, we focus on several representative research areas of GAA, including gaming, robotics, and healthcare, and aim to provide comprehensive knowledge on the common concerns discussed in these fields. We expect the participants to learn the fundamentals of GAA and gain insights to further advance their research. Specific learning outcomes include: - GAA Overview: A deep dive into its principles and roles in contemporary applications, providing attendees with a thorough grasp of its importance and uses. - Methodologies: Detailed examples of how LLMs and VLMs enhance GAAs, illustrated through case studies. - Performance Evaluation: Guidance on the assessment of GAAs with relevant datasets. - Ethical Considerations: A discussion on the societal impacts and ethical challenges of deploying Agent AI, highlighting responsible development practices. - Future Challenges: A categorization of the latest developments in each domain and a discussion of future directions. Led by experts from academia and …

Tutorial
Xiaoyang Wu · Hengshuang Zhao · Fuxin Li · Zhijian Liu

[ Summit 444 ]

Abstract

Unstructured point clouds serve as a sparse representation of the 3D world, playing pivotal roles in 3D perception, generation, autonomous driving, virtual/augmented reality, and robotics. Despite their significance, there lacks a comprehensive resource covering state-of-the-art approaches and engineering nuances in deep point cloud networks. This tutorial aims to fill this gap by offering an comprehensive exploration of the subject. It features lectures that progress from classical point cloud backbones to state-of-the-art point transformers, large-scale 3D representation learning (including pre-training technologies), efficient libraries for sparse systems, and diverse applications for deep point cloud networks. Participants will acquire systematic and practical knowledge on managing and extracting robust deep feature representations from point cloud data. They'll also learn to make informed decisions regarding model architectures and data structures when dealing with point cloud data. Armed with these skills, attendees will be well-equipped to comprehend and leverage these models in real-world applications across various fields, including autonomous driving, embodied AI, and other domains grappling with sparse data in low-dimensional Euclidean spaces.

Tutorial
Wenjin Wang · Daniel Mcduff · Xuyu Wang

[ Arch 307- 308 ]

Abstract

Understanding people and extracting health-related metrics is an emerging research topic in computer vision that has grown rapidly recently. Without the need of any physical contact of the human body, cameras have been used to measure vital signs remotely (e.g. heart rate, heart rate variability, respiration rate, blood oxygenation saturation, pulse transit time, body temperature, etc.) from an image sequence of the skin or body, which leads to contactless, continuous and comfortable heath monitoring. The use of cameras also enables the measurement of human behaviors/activities and high-level visual semantic/contextual information leveraging computer vision and machine learning techniques. Understanding of the environment around the people is also a unique advantage of cameras compared to the contact bio-sensors (e.g., wearables), which facilitates better understanding of human and scene for health monitoring. In addition to camera based approach, Radio Frequency (RF) based methods for health monitoring have also been proposed, using Radar, WiFi, RFID, and acoustic signals. The contactless monitoring of camera and RF will bring a rich set of compelling healthcare applications that directly improve upon contact-based monitoring solutions and improve people’s care experience and quality of life, called “AI health monitoring”. In this tutorial, we will give an overview of recent …

Tutorial
Amir Zamir · Andrei Atanov · Andrew Spielberg

[ Summit 344 ]

Abstract

Animals exhibit a wide variety of morphologies and sensors, believed to have appeared through billions of years of evolution. Common examples relevant to vision include differences in pupil shapes, the positioning of eyes, various types of eyes, and a varying level of multimodality across animals. Such adaptations are hypothesized to be instances of the so-called Ecological Theory, which posits a strong connection between the specifics of vision and the environment surrounding the agent, its objectives, and its body. How can we replicate this diversity and achieve adaptive design in robotics and vision systems?

In this tutorial, we discuss I) alternative forms of visual sensors that can be useful for real-world robots and II) computational approaches to robot and vision design that can achieve the goal of adaptive design automatically, effectively, and efficiently. The tutorial covers topics in sensing, control, simulation, optimization, and learning-based design for various rigid and soft robots and visual sensors. The material is drawn from state-of-the-art breakthroughs in the field and insights from other disciplines.

This material is accessible to individuals of all backgrounds and levels of expertise.

Tutorial
Qing Qu · Zhihui Zhu · Yuqian Zhang · Yi Ma · Sam Buchanan · Beidi Chen · Mojan Javaheripi · Liyue Shen · Zhangyang Wang

[ Summit 442 ]

Abstract

Over the past decade, the advent of machine learning and large-scale computing has immeasurably changed the ways we process, interpret, and predict with data in imaging and computer vision. The “traditional” approach to algorithm design, based around parametric models for specific structures of signals and measurements—say sparse and low-rank models—and the associated optimization toolkit, is now significantly enriched with data-driven learning-based techniques, where large-scale networks are pre-trained and then adapted to a variety of specific tasks. Nevertheless, the successes of both modern data-driven and classic model-based paradigms rely crucially on correctly identifying the low-dimensional structures present in real-world data, to the extent that we see the roles of learning and compression of data processing algorithms—whether explicit or implicit, as with deep networks—as inextricably linked.

As such, this tutorial provides a timely tutorial that uniquely bridges low-dimensional models with deep learning in imaging and vision. This tutorial will show how: 1. Low-dimensional models and principles provide a valuable lens for formulating problems and understanding the behavior of modern deep models in imaging and computer vision; 2. How ideas from low-dimensional models can provide valuable guidance for designing new parameter efficient, robust, and interpretable deep learning models for computer vision problems in …

Tutorial
Li Chen · Andreas Geiger · Huijie Wang · Jiajie Xu

[ Summit 447 ]

Abstract

In this tutorial, we explore the intersection of AGI technologies and the advancement of autonomous systems, specifically in the field of robotics. We invite participants to embark on an investigative journey that covers essential concepts, frameworks, and challenges. Through discussion, we aim to shed light on the crucial role of fundamental models in enhancing the cognitive abilities of autonomous agents. Through cooperation, we aim to chart a path for the future of robotics, where the integration of AGI enables autonomous systems to push the limits of their capabilities and intelligence, ushering in a new era of intelligent autonomy.

Tutorial
Raquel Urtasun · Sergio Casas · Abbas Sadat · Sivabalan Manivasagam · Ioan Andrei Bârsan

[ Summit 445 ]

Abstract

A full day tutorial covering all aspects of autonomous driving. This tutorial will provide the necessary background for understanding the different tasks and associated challenges, the different sensors and data sources one can use and how to exploit them, as well as how to formulate the relevant algorithmic problems such that efficient learning and inference is possible. We will first introduce the self-driving problem setting and a broad range of existing solutions, both top-down from a high-level perspective, as well as bottom-up from technological and algorithmic points of view. We will then extrapolate from the state of the art and discuss where the challenges and open problems are, and where we need to head towards to provide a scalable, safe and affordable self-driving solution for the future.

Since last year’s instance (https://waabi.ai/cvpr-2023/), countless new and promising avenues of research have started gaining traction, and we have updated our tutorial accordingly. To name a few example, this includes topics like occupancy forecasting, self-supervised learning, foundation models, the rise of Gaussian Splatting and diffusion models for simulation as well as the study of closed-loop vs. open-loop evaluation.

Tutorial
Maying Shen · Danny Yin · Jason Clemons · Pavlo Molchanov · Jan Kautz · Jose M. Alvarez

[ Summit 447 ]

Abstract

This tutorial focuses on describing techniques to allow deep learning practitioners to accelerate the training and inference of large deep networks while also reducing memory requirements across a spectrum of off-the-shelf hardware for important applications such as autonomous driving and large language models. Topics include, but are not limited to: - Deep learning specialized hardware overview. We review the architecture of the most used deep learning acceleration hardware, including the main computational processors and memory modules. We will also cover aspects of algorithmic intensity and an overview of theoretical aspects of computing.

  • Best practices for acceleration. We provide an overview of best practices for designing efficient neural networks including channel number selection, compute heavy operations, or reduction operations among others.

  • Existing tools for model acceleration. In this part we will focus on existing tools to accelerate a trained neural network on GPU devices. We will particularly discuss operation folding, TensorRT, ONNX graph optimization, sparsity.

  • Foundation models. Here we will focus on best practices for training and deploying foundation models efficiently.

  • Research overview of recent techniques. In the last part, we will focus on recent advanced techniques for post training model optimization including pruning, quantization, model distillation or NAS among others.

Tutorial
Hao Fei · Yuan Yao · Ao Zhang · Haotian Liu · Fuxiao Liu · Zhuosheng Zhang · Shuicheng Yan

[ Summit 446 ]

Abstract

Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on three key areas: MLLM architecture design, instructional learning, and multimodal reasoning of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research. All the resources and materials will be made available online: https://mllm2024.github. io/CVPR2024

Tutorial
Long Chen · Oleg Sinavski · Fergal Cotter · Vassia Simaiaki · Elahe Arani · Gianluca Corrado · Nikhil Mohan · Jamie Shotton

[ Summit 444 ]

Abstract

A comprehensive half-day tutorial focused on End-to-End Autonomous Driving (E2EAD), reflecting the significant shift in focus towards this approach within both industry and academia. Traditional modular approaches in autonomous driving, while effective in specific contexts, often struggle with scalability, long-tail scenarios, and compounding errors from different modules, thereby paving the way for the end-to-end paradigm. This tutorial aims to dissect the complexities and nuances of end-to-end autonomy, covering theoretical foundations, practical implementations and validations, and future directions of this evolving technology.A comprehensive half-day tutorial focused on End-to-End Autonomous Driving (E2EAD), reflecting the significant shift in focus towards this approach within both industry and academia. Traditional modular approaches in autonomous driving, while effective in specific contexts, often struggle with scalability, long-tail scenarios, and compounding errors from different modules, thereby paving the way for the end-to-end paradigm. This tutorial aims to dissect the complexities and nuances of end-to-end autonomy, covering theoretical foundations, practical implementations and validations, and future directions of this evolving technology.

Tutorial
Zhiqian Chen · Lei Zhang · Liang Zhao

[ Summit 440 - 441 ]

Abstract

Over recent years, Graph Neural Networks (GNNs) have garnered significant attention. However, the proliferation of diverse GNN models, underpinned by various theoretical approaches, complicates the process of model selection, as they are not readily comprehensible within a uniform framework. Specifically, early GNNs were implemented using spectral theory, while others were developed based on spatial theory . This divergence between spectral and spatial methodologies renders direct comparisons challenging. Moreover, the multitude of models within each domain further complicates the evaluation of their respective strengths and weaknesses.

In this half-day tutorial, we examine the state-of-the-art in GNNs and introduce a comprehensive framework that bridges the spatial and spectral domains, elucidating their complex interrelationship. This emphasis on a comprehensive framework enhances our understanding of GNN operations. The tutorial’s objective is to explore the interplay between key paradigms, such as spatial and spectral-based methods, through a synthesis of spectral graph theory and approximation theory. We provide an in-depth analysis of the latest research developments in GNNs in this tutorial, including discussions on emerging issues like over-smoothing. A range of well-established GNN models will be utilized to illustrate the universality of our proposed framework.

Tutorial
Mike Zheng Shou · Jay Zhangjie Wu · Deepti Ghadiyaram

[ Summit 437 - 439 ]

Abstract

In the past year, the landscape of video generation has transformed dramatically, achieving remarkable strides from rudimentary outputs to strikingly realistic videos. Central to this evolution are diffusion models, which have become a cornerstone technology in pushing the boundaries of what's possible in video generation. This tutorial will delve into the critical role of diffusion models in video generation and modeling.

Participants will engage in a deep dive into the broad spectrum of topics related to video generative models. We will start with the foundational elements, including the core principles of video foundation models. The session will then extend to explore specific applications such as image-to-video animation, video editing, and motion customization. A significant focus will also be placed on the evaluation of video diffusion models, as well as on safety technologies to mitigate the potential risks of using these models.

Attendees will leave this tutorial with a comprehensive understanding of both fundamental techniques and the cutting-edge advancements in diffusion-based video modeling, fully equipped to navigate and contribute to the rapidly evolving field in the GenAI era.