CVPR 2024 Schedule

Filter Events

Filter Rooms:

MON 17 JUN

7 a.m.

Registration / Badge Pickup

(ends 5:00 PM)

Break:

Breakfast

(ends 9:00 AM)

8 a.m.

Workshop:

Efficient Large Vision Models

(ends 12:35 PM)

Workshop:

Computer Vision for Mixed Reality

(ends 12:45 PM)

Workshop:

9th New Trends in Image Restoration and Enhancement Workshop and Challenges

(ends 6:00 PM)

Workshop:

8th AI City Challenge

(ends 5:30 PM)

Workshop:

Domain adaptation, Explainability and Fairness in AI for Medical Image Analysis (DEF-AI-MIA)

(ends 1:00 PM)

8:25 a.m.

Workshop:

Multimodal Algorithmic Reasoning Workshop

(ends 12:15 PM)

Workshop:

SyntaGen: Harnessing Generative Models for Synthetic Visual Datasets

(ends 12:35 PM)

8:30 a.m.

Workshop:

2nd Workshop on Scene Graphs and Graph Representation Learning

(ends 12:00 PM)

Workshop:

The 3rd International Workshop on Federated Learning for Computer Vision (FedVision-2024)

(ends 5:30 PM)

Workshop:

1st Workshop on Dataset Distillation for Computer Vision

(ends 5:30 PM)

Workshop:

VAND 2.0: Visual Anomaly and Novelty Detection

(ends 5:30 PM)

Workshop:

First Joint Egocentric Vision (EgoVis) Workshop

(ends 6:15 PM)

Workshop:

4th International Workshop on Long-form Video Understanding: Towards Multimodal AI Assistant and Copilot

(ends 5:30 PM)

Workshop:

5th International Workshop on Large Scale Holistic Video Understanding

(ends 12:00 PM)

Workshop:

New Challenges in 3D Human Understanding

(ends 1:00 PM)

Workshop:

ViLMa – Visual Localization and Mapping

(ends 5:30 PM)

Workshop:

4th Workshop on Physics Based Vision meets Deep Learning (PBDL2024)

(ends 5:30 PM)

Workshop:

The 7th Workshop and Challenge Bridging the Gap between Computational Photography and Visual Recognition (UG2+)

(ends 12:00 PM)

Workshop:

4th Mobile AI Workshop and Challenges

(ends 5:30 PM)

Workshop:

AI for Content Creation (AI4CC)

(ends 5:30 PM)

Workshop:

AI for 3D Generation

(ends 5:30 PM)

Workshop:

Workshop on Computer Vision for Fashion, Art, and Design

(ends 12:00 PM)

Workshop:

The 5th Face Anti-Spoofing Workshop

(ends 12:00 PM)

Workshop:

2nd Workshop on Multimodal Content Moderation

(ends 6:00 PM)

Workshop:

The Fifth Workshop on Fair, Data-efficient, and Trusted Computer Vision

(ends 5:30 PM)

Workshop:

1st Workshop on Urban Scene Modeling: Where Vision Meets Photogrammetry and Graphics

(ends 5:30 PM)

Workshop:

CV4Science 2025: Using Computer Vision for the Sciences

(ends 5:30 PM)

Workshop:

4th Workshop on CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling

(ends 5:30 PM)

Workshop:

Foundation Models for Medical Vision

(ends 6:00 PM)

Workshop:

2nd Workshop on Foundation Models

(ends 5:30 PM)

Workshop:

The 4th Workshop of Adversarial Machine Learning on Computer Vision: Robustness of Foundation Models

(ends 5:30 PM)

Workshop:

First Workshop on Efficient and On-Device Generation (EDGE)

(ends 5:30 PM)

Workshop:

AIS: Vision, Graphics and AI for Streaming

(ends 5:30 PM)

Workshop:

Computer Vision in the Wild

(ends 5:30 PM)

Workshop:

MetaFood Workshop (MTF)

(ends 12:30 PM)

8:45 a.m.

Workshop:

Tool-Augmented VIsion Workshop

(ends 12:45 PM)

Workshop:

Second Workshop for Learning 3D with Multi-View Supervision

(ends 5:30 PM)

9 a.m.

Workshop:

Sight and Sound

(ends 6:00 PM)

Workshop:

Prompting in Vision

(ends 5:30 PM)

Workshop:

Causal and Object-Centric Representations for Robotics

(ends 5:00 PM)

Workshop:

CVPR 2024 Biometrics Workshop

(ends 5:30 PM)

Workshop:

7th Workshop on Autonomous Driving (WAD)

(ends 6:00 PM)

Workshop:

AI4Space 2024

(ends 12:10 PM)

Workshop:

Foundation Models for Autonomous Systems

(ends 5:00 PM)

Workshop:

EarthVision: Large Scale Computer Vision for Remote Sensing Imagery

(ends 5:30 PM)

Tutorial:

Disentanglement and Compositionality in Computer Vision

(ends 12:00 PM)

Tutorial:

Deep Stereo Matching in the Twenties

(ends 12:00 PM)

Tutorial:

Recent Advances in Vision Foundation Models

(ends 5:00 PM)

Tutorial:

Machine Unlearning in Computer Vision: Foundations and Applications

(ends 12:00 PM)

Tutorial:

SCENIC: An Open-Source Probabilistic Programming System for Data Generation and Safety in AI-Based Autonomy

(ends 12:00 PM)

10 a.m.

Break:

Coffee Break

(ends 11:00 AM)

noon

Break:

Lunch

(ends 1:45 PM)

12:45 p.m.

Workshop:

CV 20/20: A Retrospective Vision

(ends 6:05 PM)

1 p.m.

Workshop:

Image Matching: Local Features and Beyond

(ends 5:45 PM)

Workshop:

Workshop on TDLCV: Topological Deep Learning for Computer Vision

(ends 5:45 PM)

Workshop:

2nd Workshop on Embodied "Humans": Symbiotic Intelligence between Virtual Humans and Humanoid Robots

(ends 5:45 PM)

Workshop:

Data Curation and Augmentation in Enhancing Medical Imaging Applications

(ends 6:00 PM)

Workshop:

GenAI Media Generation Challenge for Computer Vision Workshop

(ends 5:30 PM)

1:20 p.m.

Workshop:

Rhobin 2024: The second Rhobin challenge on Reconstruction of Human-Object Interaction

(ends 6:00 PM)

1:30 p.m.

Workshop:

Fifth Workshop on Neural Architecture Search

(ends 5:30 PM)

Workshop:

Multimodalities for 3D Scenes

(ends 5:30 PM)

Workshop:

Pixel-level Video Understanding in the Wild Challenge

(ends 5:30 PM)

Workshop:

Neural Rendering Intelligence

(ends 5:30 PM)

Workshop:

Workshop on Graphic Design Understanding and Generation (GDUG)

(ends 5:30 PM)

Workshop:

2nd Workshop and Challenge on DeepFake Analysis and Detection

(ends 5:30 PM)

Workshop:

Ethical Considerations in Creative Applications of Computer Vision

(ends 5:30 PM)

Workshop:

The Seventh International Workshop on Computer Vision for Physiological Measurement (CVPM)

(ends 5:30 PM)

Workshop:

3rd Workshop on Vision Datasets Understanding and DataCV Challenge

(ends 5:30 PM)

Workshop:

Workshop on Virtual Try-On

(ends 6:00 PM)

Tutorial:

Geospatial Computer Vision and Machine Learning for Large-Scale Earth Observation Data

(ends 5:00 PM)

Tutorial:

Object-centric Representations in Computer Vision

(ends 6:00 PM)

Tutorial:

Robustness at Inference: Towards Explainability, Uncertainty, and Intervenability

(ends 5:30 PM)

Tutorial:

Efficient Homotopy Continuation for Solving Polynomial Systems in Computer Vision Applications

(ends 6:00 PM)

2 p.m.

Tutorial:

Edge AI in Action: Practical Approaches to Developing and Deploying Optimized Models

(ends 5:30 PM)

3 p.m.

Break:

Coffee Break

(ends 4:00 PM)

TUE 18 JUN

7 a.m.

Registration / Badge Pickup

(ends 5:00 PM)

Break:

Breakfast

(ends 9:00 AM)

7:50 a.m.

Workshop:

The 3rd Workshop on Transformers for Vision

(ends 6:00 PM)

8 a.m.

Workshop:

VizWiz Grand Challenge: Describing Images and Videos Taken by Blind People

(ends 12:05 PM)

Workshop:

Agriculture-Vision: Challenges & Opportunities for Computer Vision in Agriculture

(ends 6:00 PM)

8:10 a.m.

Workshop:

2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn)

(ends 12:00 PM)

Workshop:

Computer Vision for Materials Science Workshop

(ends 12:50 PM)

8:20 a.m.

Workshop:

The Future of Generative Visual Art

(ends 5:40 PM)

8:30 a.m.

Workshop:

Equivariant Vision: From Theory to Practice

(ends 5:30 PM)

Workshop:

The 7th Workshop on Efficient Deep Learning for Computer Vision

(ends 5:30 PM)

Workshop:

1st Workshop on Test-Time Adaptation: Model, Adapt Thyself! (MAT)

(ends 12:45 PM)

Workshop:

5th Workshop on Continual Learning in Computer Vision (CLVISION)

(ends 5:30 PM)

Workshop:

Computer Vision with Humans in the Loop

(ends 5:45 PM)

Workshop:

Visual Perception via Learning in an Open World

(ends 5:30 PM)

Workshop:

What is Next in Video Understanding?

(ends 1:00 PM)

Workshop:

Workshop on Human Motion Generation

(ends 12:00 PM)

Workshop:

7th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues

(ends 5:30 PM)

Workshop:

2nd Workshop on Compositional 3D Vision

(ends 5:30 PM)

Workshop:

XRNeRF: Second Workshop on Advances in Radiance Fields for the Metaverse

(ends 12:30 PM)

Workshop:

Women in Computer Vision

(ends 1:30 PM)

Workshop:

LatinX in Computer Vision Research Workshop

(ends 6:00 PM)

Workshop:

The 5th Omnidirectional Computer Vision Workshop

(ends 12:00 PM)

Workshop:

Third Workshop of Mobile Intelligent Photography & Imaging

(ends 12:20 PM)

Workshop:

The 3rd Explainable AI for Computer Vision (XAI4CV) Workshop

(ends 5:30 PM)

Workshop:

Workshop on Responsible Data

(ends 5:30 PM)

Workshop:

Data-Driven Autonomous Driving Simulation (DDASD)

(ends 5:30 PM)

Workshop:

9th Workshop on Computer Vision for Microscopy Image Analysis

(ends 6:00 PM)

Workshop:

2nd Workshop on ``What is Next in Multimodal Foundation Models?''

(ends 1:00 PM)

Workshop:

2nd Workshop on Generative Models for Computer Vision

(ends 5:30 PM)

Workshop:

ReGenAI: First Workshop on Responsible Generative AI

(ends 12:30 PM)

Workshop:

Synthetic Data for Computer Vision

(ends 5:30 PM)

Workshop:

RetailVision - Field Overview and Amazon Deep Dive

(ends 6:00 PM)

Workshop:

GAZE 2024: The 6th International Workshop on Gaze Estimation and Prediction in the Wild

(ends 12:30 PM)

Workshop:

10th IEEE International Workshop on Computer Vision in Sports (CVsports)

(ends 5:30 PM)

Tutorial:

3D/4D Generation and Modeling with Generative Priors

(ends 12:00 PM)

Tutorial:

Edge-Optimized Deep Learning: Harnessing Generative AI and Computer Vision with Open-Source Libraries

(ends 5:00 PM)

Tutorial:

Generalist Agent AI

(ends 12:00 PM)

8:45 a.m.

Workshop:

FGVC11: 11th Workshop on Fine-grained Visual Categorization

(ends 4:45 PM)

Workshop:

IEEE International Workshop on Computational Cameras and Displays

(ends 4:55 PM)

Workshop:

8th Workshop on Media Forensics

(ends 5:00 PM)

8:50 a.m.

Workshop:

ScanNet++ Novel View Synthesis and 3D Semantic Understanding Challenge

(ends 12:30 PM)

Workshop:

The 5th Annual Embodied AI Workshop

(ends 5:30 PM)

Workshop:

Towards 3D Foundation Models: Progress and Prospects

(ends 5:00 PM)

9 a.m.

Workshop:

7th MUltimodal Learning and Applications

(ends 6:00 PM)

Workshop:

New frontiers for zero-shot Image Captioning Evaluation (NICE)

(ends 5:00 PM)

Workshop:

L3D-IVU: 3rd Workshop on Learning with Limited Labelled Data for Image and Video Understanding

(ends 6:10 PM)

Workshop:

The Sixth Workshop on Deep Learning for Geometric Computing (DLGC 2024)

(ends 5:00 PM)

Workshop:

Safe Artificial Intelligence for All Domains (SAIAD)

(ends 5:00 PM)

Workshop:

Vision and Language for Autonomous Driving and Robotics (VLADR)

(ends 6:00 PM)

Workshop:

4th Workshop and Challenge on Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings

(ends 5:00 PM)

Tutorial:

Towards Building AGI in Autonomy and Robotics

(ends 12:00 PM)

Tutorial:

Contactless AI Healthcare using Cameras and Wireless Sensors

(ends 12:00 PM)

Tutorial:

All You Need to Know about Self-Driving

(ends 6:00 PM)

Tutorial:

All You Need To Know About Point Cloud Understanding

(ends 12:15 PM)

Tutorial:

Learning Deep Low-dimensional Models from High-Dimensional Data: From Theory to Practice

(ends 6:00 PM)

Tutorial:

Computational Design of Diverse Morphologies and Sensors for Vision and Robotics

(ends 5:00 PM)

9:30 a.m.

Workshop:

Social Presence with Codec Avatars

(ends 5:30 PM)

10 a.m.

Break:

Coffee Break

(ends 11:00 AM)

noon

Break:

Lunch

(ends 1:45 PM)

1 p.m.

Workshop:

Implicit Neural Representation for Vision

(ends 6:30 PM)

Workshop:

The First Workshop on the Evaluation of Generative Foundation Models

(ends 6:30 PM)

1:30 p.m.

Workshop:

The Sixth Workshop on Precognition: Seeing through the Future

(ends 5:30 PM)

Workshop:

5th Workshop on Robot Visual Perception in Human Crowded Environments

(ends 5:30 PM)

Workshop:

OpenSUN3D: 2nd Workshop on Open-Vocabulary 3D Scene Understanding

(ends 5:30 PM)

Workshop:

Representation Learning with Very Limited Images: Zero-shot, Unsupervised, and Synthetic Learning in the Era of Big Models

(ends 5:30 PM)

Workshop:

EgoMotion: Egocentric Body Motion Tracking, Synthesis and Action Recognition

(ends 6:00 PM)

Workshop:

Learning from Procedural Videos and Language: What is Next?

(ends 6:00 PM)

Workshop:

New Trends in Multimodal Human Action Perception, Understanding and Generation

(ends 6:00 PM)

Workshop:

(3rd) Monocular Depth Estimation Challenge

(ends 5:30 PM)

Workshop:

1st Workshop on Neural Volumetric Video

(ends 5:50 PM)

Workshop:

20th Workshop on Perception Beyond the Visible Spectrum

(ends 5:30 PM)

Workshop:

Embedded Vision Workshop

(ends 5:30 PM)

Workshop:

6th Workshop and Competition on Affective Behavior Analysis in-the-wild

(ends 6:00 PM)

Workshop:

AVA: Accessibility, Vision and Autonomy Meet

(ends 5:30 PM)

Tutorial:

From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond

(ends 6:00 PM)

Tutorial:

End-to-End Autonomy: A New Era of Self-Driving

(ends 6:00 PM)

Tutorial:

Full-Stack, GPU-based Acceleration of Deep Learning

(ends 5:00 PM)

2 p.m.

Tutorial:

Unifying Graph Neural Networks across Spatial and Spectral Domains

(ends 5:00 PM)

Tutorial:

Diffusion-based Video Generative Models

(ends 5:00 PM)

3 p.m.

Break:

Coffee Break

(ends 4:00 PM)

WED 19 JUN

7 a.m.

Registration / Badge Pickup

(ends 5:00 PM)

Break:

Breakfast

(ends 9:00 AM)

8:30 a.m.

Remarks:

Welcome & Awards

(ends 9:00 AM)

9 a.m.

Orals 1A Low-level vision [9:00-10:30]

Orals 9:00-10:30

[9:00] Specularity Factorization for Low-Light Enhancement

[9:18] FlowIE: Efficient Image Enhancement via Rectified Flow

[9:36] Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach

[9:54] Bilateral Event Mining and Complementary for Event Stream Super-Resolution

[10:12] FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring

(ends 10:30 AM)

Orals 1B Vision and Graphics [9:00-10:30]

Orals 9:00-10:30

[9:00] GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors

[9:18] Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

[9:36] Eclipse: Disambiguating Illumination and Materials using Unintended Shadows

[9:54] Objects as Volumes: A Stochastic Geometry View of Opaque Solids

[10:12] DiffusionLight: Light Probes for Free by Painting a Chrome Ball

(ends 10:30 AM)

Orals 1C Humans: Face, body, pose, gesture, movement [9:00-10:30]

Orals 9:00-10:30

[9:00] MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

[9:18] URHand: Universal Relightable Hands

[9:36] Relightable Gaussian Codec Avatars

[9:54] Semantic Human Mesh Reconstruction with Textures

[10:12] Stratified Avatar Generation from Sparse Observations

(ends 10:30 AM)

10:30 a.m.

Demonstration:

Demos

(ends 6:45 PM)

Poster Session 1 & Exhibit Hall [10:30-12:00]

Posters 10:30-12:00

SEAS: ShapE-Aligned Supervision for Person Re-Identification

Test-Time Domain Generalization for Face Anti-Spoofing

Gradient Alignment for Cross-Domain Face Anti-Spoofing

BigGait: Learning Gait Representation You Want by Large Vision Models

Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing

CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing

Psychometry: An Omnifit Model for Image Reconstruction from Human Brain Activity

KeyPoint Relative Position Encoding for Face Recognition

Distilling CLIP with Dual Guidance for Learning Discriminative Human Body Shape Representation

Flexible Biometrics Recognition: Bridging the Multimodality Gap through Attention Alignment and Prompt Tuning

One-Class Face Anti-spoofing via Spoof Cue Map-Guided Feature Learning

Activity-Biometrics: Person Identification from Daily Activities

Privacy-Preserving Face Recognition Using Trainable Feature Subtraction

Molecular Data Programming: Towards Molecule Pseudo-labeling with Systematic Weak Supervision

Clustering for Protein Representation Learning

Fun with Flags: Robust Principal Directions via Flag Manifolds

CAM Back Again: Large Kernel CNNs from a Weakly Supervised Object Localization Perspective

Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)

Quantifying Task Priority for Multi-Task Optimization

Unbiased Estimator for Distorted Conics in Camera Calibration

Multi-Object Tracking in the Dark

Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification

From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation

From Activation to Initialization: Scaling Insights for Optimizing Neural Fields

PairDETR : Joint Detection and Association of Human Bodies and Faces

Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion

Seamless Human Motion Composition with Blended Positional Encodings

VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams

OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers

HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

HUGS: Human Gaussian Splats

HOI-M^3: Capture Multiple Humans and Objects Interaction within Contextual Environment

InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion

SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models

Real-Time Simulated Avatar from Head-Mounted Sensors

Digital Life Project: Autonomous 3D Characters with Social Intelligence

Learning Visual Prompt for Gait Recognition

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model

Spatial-Aware Regression for Keypoint Localization

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models

Capturing Closely Interacted Two-Person Motions with Reaction Priors

SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation

Bidirectional Autoregessive Diffusion Model for Dance Generation

High-Quality Facial Geometry and Appearance Capture at Home

Multiple View Geometry Transformers for 3D Human Pose Estimation

PACER+: On-Demand Pedestrian Animation Controller in Driving Scenarios

I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions

HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images

Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction

3D Human Pose Perception from Egocentric Stereo Videos

Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement

Human Gaussian Splatting: Real-time Rendering of Animatable Avatars

OHTA: One-shot Hand Avatar via Data-driven Implicit Priors

HOIAnimator: Generating Text-prompt Human-object Animations using Novel Perceptive Diffusion Models

Arbitrary Motion Style Transfer with Multi-condition Motion Latent Diffusion Model

Single-View Scene Point Cloud Human Grasp Generation

Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

URHand: Universal Relightable Hands

AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents

From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration

HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations

Monocular Identity-Conditioned Facial Reflectance Reconstruction

GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

Score-Guided Diffusion for 3D Human Recovery

3D-Aware Face Editing via Warping-Guided Latent Direction Learning

WANDR: Intention-guided Human Motion Generation

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches

NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi

ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

Relightable and Animatable Neural Avatar from Sparse-View Video

Relightable Gaussian Codec Avatars

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

Rethinking Generalizable Face Anti-spoofing via Hierarchical Prototype-guided Distribution Refinement in Hyperbolic Space

MoML: Online Meta Adaptation for 3D Human Motion Prediction

KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation

Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses

PEGASUS: Personalized Generative 3D Avatars with Composable Attributes

Semantic Human Mesh Reconstruction with Textures

SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation

Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera

DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

DPHMs: Diffusion Parametric Head Models for Depth-based Tracking

KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation

Exploiting Style Latent Flows for Generalizing Deepfake Video Detection

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

A Unified Framework for Human-centric Point Cloud Video Understanding

ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering

CLOAF: CoLlisiOn-Aware Human Flow

EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams

A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark

Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras

Synergistic Global-space Camera and Human Reconstruction from Videos

3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow

UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures

OmniMotionGPT: Animal Motion Generation with Limited Data

Text-Guided 3D Face Synthesis - From Generation to Editing

Multi-scale Dynamic and Hierarchical Relationship Modeling for Facial Action Units Recognition

LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

OpticalDR: A Deep Optical Imaging Model for Privacy-Protective Depression Recognition

SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation

TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

Optimizing Diffusion Noise Can Serve As Universal Motion Priors

M&M VTO: Multi-Garment Virtual Try-On and Editing

AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond

A Simple Baseline for Efficient Hand Mesh Reconstruction

VINECS: Video-based Neural Character Skinning

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

Programmable Motion Generation for Open-Set Motion Control Tasks

From Feature to Gaze: A Generalizable Replacement of Linear Layer for Gaze Estimation

Unsupervised Gaze Representation Learning from Multi-view Face Images

Joint2Human: High-Quality 3D Human Generation via Compact Spherical Embedding of 3D Joints

DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

Bi-Causal: Group Activity Recognition via Bidirectional Causality

HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses

LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation

Human Motion Prediction Under Unexpected Perturbation

Cross-view and Cross-pose Completion for 3D Human Understanding

Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives

GALA: Generating Animatable Layered Assets from a Single Scan

MMM: Generative Masked Motion Model

What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation

Towards Variable and Coordinated Holistic Co-Speech Motion Generation

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

Garment Recovery with Shape and Deformation Priors

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning

HardMo: A Large-Scale Hardcase Dataset for Motion Capture

LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition

Motion Diversification Networks

NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors

3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation

Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers

CLIB-FIQA: Face Image Quality Assessment with Confidence Calibration

MoST: Motion Style Transformer Between Diverse Action Contents

TexVocab: Texture Vocabulary-conditioned Human Avatars

Forecasting of 3D Whole-body Human Poses with Grasping Objects

Scaling Up Dynamic Human-Scene Interaction Modeling

Design2Cloth: 3D Cloth Generation from 2D Masks

ReGenNet: Towards Human Action-Reaction Synthesis

MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using Differentiable Shading

FaceLift: Semi-supervised 3D Facial Landmark Localization

Fast Adaptation for Human Pose Estimation via Meta-Optimization

FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding

AAMDM: Accelerated Auto-regressive Motion Diffusion Model

SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement

AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion

Generating Human Motion in 3D Scenes from Text Descriptions

Stratified Avatar Generation from Sparse Observations

Locally Adaptive Neural 3D Morphable Models

IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing

Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors

MoMask: Generative Masked Modeling of 3D Human Motions

G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis

Dynamic Support Information Mining for Category-Agnostic Pose Estimation

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning

MAS: Multi-view Ancestral Sampling for 3D Motion Generation Using 2D Diffusion

Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes

Neural Sign Actors: A Diffusion Model for 3D Sign Language Production from Text

RAM-Avatar: Real-time Photo-Realistic Avatar from Monocular Videos with Full-body Control

Sharingan: A Transformer Architecture for Multi-Person Gaze Following

Degrees of Freedom Matter: Inferring Dynamics from Point Trajectories

Authentic Hand Avatar from a Phone Scan via Universal Hand Model

UniHuman: A Unified Model For Editing Human Images in the Wild

BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition

GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Self-Supervised Facial Representation Learning with Facial Region Awareness

ChatPose: Chatting about 3D Human Pose

AUEditNet: Dual-Branch Facial Action Unit Intensity Manipulation with Implicit Disentanglement

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

PoseIRM: Enhance 3D Human Pose Estimation on Unseen Camera Settings via Invariant Risk Minimization

Rethinking Human Motion Prediction with Symplectic Integral

Multimodal Sense-Informed Forecasting of 3D Human Motions

Semantics-aware Motion Retargeting with Vision-Language Models

Makeup Prior Models for 3D Facial Makeup Estimation and Applications

FaceCom: Towards High-fidelity 3D Facial Shape Completion via Optimization and Inpainting Guidance

When StyleGAN Meets Stable Diffusion: a W+ Adapter for Personalized Image Generation

MANUS: Markerless Grasp Capture using Articulated 3D Gaussians

Loose Inertial Poser: Motion Capture with IMU-attached Loose-Wear Jacket

Anatomically Constrained Implicit Face Models

DiffusionRegPose: Enhancing Multi-Person Pose Estimation using a Diffusion-Based End-to-End Regression Approach

A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation

RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud

Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling

Towards Robust 3D Pose Transfer with Adversarial Learning

PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos

HumMUSS: Human Motion Understanding using State Space Models

MultiPhys: Multi-Person Physics-aware 3D Motion Estimation

Physics-Aware Hand-Object Interaction Denoising

HOIST-Former: Hand-held Objects Identification Segmentation and Tracking in the Wild

SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

PFStorer: Personalized Face Restoration and Super-Resolution

MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints

BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics

MeshPose: Unifying DensePose and 3D Body Mesh Reconstruction

CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation

Generalizable Face Landmarking Guided by Conditional Face Warping

Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning

A Unified and Interpretable Emotion Representation and Expression Generation

Artist-Friendly Relightable and Animatable Neural Heads

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

BodyMAP - Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed

3D Facial Expressions through Analysis-by-Neural-Synthesis

SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation

DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion

Specularity Factorization for Low-Light Enhancement

Learning Diffusion Texture Priors for Image Restoration

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Enhancing Video Super-Resolution via Implicit Resampling-based Alignment

Boosting Neural Representations for Videos with a Conditional Decoder

FlowIE: Efficient Image Enhancement via Rectified Flow

Restoration by Generation with Constrained Priors

Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach

Bilateral Event Mining and Complementary for Event Stream Super-Resolution

Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM

Estimating Extreme 3D Image Rotations using Cascaded Attention

Learned Scanpaths Aid Blind Panoramic Video Quality Assessment

Automatic Controllable Colorization via Imagination

Reconstruction-free Cascaded Adaptive Compressive Sensing

A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint

AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution

Beyond Image Super-Resolution for Image Recognition with Task-Driven Perceptual Loss

Boosting Image Quality Assessment through Efficient Transformer Adaptation with Local Feature Enhancement

Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring

XFeat: Accelerated Features for Lightweight Image Matching

RecDiffusion: Rectangling for Image Stitching with Diffusion Models

Unsupervised Salient Instance Detection

FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions

FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring

Robust Image Denoising through Adversarial Frequency Mixup

Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring

Efficient Scene Recovery Using Luminous Flux Prior

Perception-Oriented Video Frame Interpolation via Asymmetric Blending

Modular Blind Video Quality Assessment

Residual Denoising Diffusion Models

JDEC: JPEG Decoding via Enhanced Continuous Cosine Coefficients

On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation

Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains

Exploring Efficient Asymmetric Blind-Spots for Self-Supervised Denoising in Real-World Scenarios

Deep Equilibrium Diffusion Restoration with Parallel Sampling

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

Improving Image Restoration through Removing Degradations in Textual Representations

Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network

Spatio-Temporal Turbulence Mitigation: A Translational Perspective

Boosting Image Restoration via Priors from Pre-trained Models

Misalignment-Robust Frequency Distribution Loss for Image Transformation

CoDe: An Explicit Content Decoupling Framework for Image Restoration

DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer

CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment

Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration

CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement

Learning to Control Camera Exposure via Reinforcement Learning

Real-Time Exposure Correction via Collaborative Transformations and Adaptive Sampling

Towards Progressive Multi-Frequency Representation for Image Warping

HIR-Diff: Unsupervised Hyperspectral Image Restoration Via Improved Diffusion Models

ZERO-IG: Zero-Shot Illumination-Guided Joint Denoising and Adaptive Enhancement for Low-Light Images

Masked and Shuffled Blind Spot Denoising for Real-World Images

Continuous Optical Zooming: A Benchmark for Arbitrary-Scale Image Super-Resolution in Real World

Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis

SD2Event:Self-supervised Learning of Dynamic Detectors and Contextual Descriptors for Event Cameras

LLaFS: When Large Language Models Meet Few-Shot Segmentation

Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

One-Shot Open Affordance Learning with Foundation Models

CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation

Collaborating Foundation Models for Domain Generalized Semantic Segmentation

FocSAM: Delving Deeply into Focused Objects in Segmenting Anything

Finsler-Laplace-Beltrami Operators with Application to Shape Analysis

Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects

Putting the Object Back into Video Object Segmentation

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations

Open-World Semantic Segmentation Including Class Similarity

Hierarchical Histogram Threshold Segmentation – Auto-terminating High-detail Oversegmentation

AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning

SANeRF-HQ: Segment Anything for NeRF in High Quality

UniVS: Unified and Universal Video Segmentation with Prompts as Queries

RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses

Event-assisted Low-Light Video Object Segmentation

Density-Guided Semi-Supervised 3D Semantic Segmentation with Dual-Space Hardness Sampling

Exploring Regional Clues in CLIP for Zero-Shot Semantic Segmentation

Category-Level Multi-Part Multi-Joint 3D Shape Assembly

SAI3D: Segment Any Instance in 3D Scenes

Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation

Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching

Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation

Self-Calibrating Vicinal Risk Minimisation for Model Calibration

ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning

Clustering Propagation for Universal Medical Image Segmentation

Addressing Background Context Bias in Few-Shot Segmentation through Iterative Modulation

Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining

RankMatch: Exploring the Better Consistency Regularization for Semi-supervised Semantic Segmentation

QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition

Frequency-Adaptive Dilated Convolution for Semantic Segmentation

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

PSDPM: Prototype-based Secondary Discriminative Pixels Mining for Weakly Supervised Semantic Segmentation

Coupled Laplacian Eigenmaps for Locally-Aware 3D Rigid Point Cloud Matching

Universal Segmentation at Arbitrary Granularity with Language Instruction

PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation

HIT: Estimating Internal Human Implicit Tissues from the Body Surface

Open-Vocabulary Segmentation with Semantic-Assisted Calibration

GraCo: Granularity-Controllable Interactive Segmentation

Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding

EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation

DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation

Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes

Diffuse Attend and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

ODIN: A Single Model for 2D and 3D Segmentation

Infer from What You Have Seen Before: Temporally-dependent Classifier for Semi-supervised Video Segmentation

Semantic-aware SAM for Point-Prompted Instance Segmentation

Class Tokens Infusion for Weakly Supervised Semantic Segmentation

Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation

Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning

AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation

Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling

PoNQ: a Neural QEM-based Mesh Representation

Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation

CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection

ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention

Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation

ASAM: Boosting Segment Anything Model with Adversarial Tuning

In-Context Matting

Weakly Supervised Point Cloud Semantic Segmentation via Artificial Oracle

Contextrast: Contextual Contrastive Learning for Semantic Segmentation

Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model

CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs

Point2CAD: Reverse Engineering CAD Models from 3D Point Clouds

Rethinking Interactive Image Segmentation with Low Latency High Quality and Diverse Prompts

General Object Foundation Model for Images and Videos at Scale

Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation

Guided Slot Attention for Unsupervised Video Object Segmentation

Unlocking the Potential of Pre-trained Vision Transformers for Few-Shot Semantic Segmentation through Relationship Descriptors

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

Continual Segmentation with Disentangled Objectness Learning and Class Recognition

GSVA: Generalized Segmentation via Multimodal Large Language Models

MaGGIe: Masked Guided Gradual Human Instance Matting

EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting

Segment Any Event Streams via Weighted Adaptation of Pivotal Tokens

PolarMatte: Fully Computational Ground-Truth-Quality Alpha Matte Extraction for Images and Video using Polarized Screen Matting

Segment Every Out-of-Distribution Object

Multi-view Aggregation Network for Dichotomous Image Segmentation

pix2gestalt: Amodal Segmentation by Synthesizing Wholes

Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation

Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation

Domain Separation Graph Neural Networks for Saliency Object Ranking

DIOD: Self-Distillation Meets Object Discovery

DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data

Rethinking Few-shot 3D Point Cloud Semantic Segmentation

Training Vision Transformers for Semi-Supervised Semantic Segmentation

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Memory-Scalable and Simplified Functional Map Learning

MFP: Making Full Use of Probability Maps for Interactive Image Segmentation

Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation

Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation

RobustSAM: Segment Anything Robustly on Degraded Images

LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion

Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Prompt-Driven Referring Image Segmentation with Instance Contrasting

Kandinsky Conformal Prediction: Efficient Calibration of Image Segmentation Algorithms

OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation

Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features

Deciphering ‘What’ and ‘Where’ Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations

Open Vocabulary Semantic Scene Sketch Understanding

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks

XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

StyLitGAN: Image-Based Relighting via Latent Control

GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors

Image Sculpting: Precise Object Editing with 3D Geometry Control

Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models

Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image

Neural Fields as Distributions: Signal Processing Beyond Euclidean Space

Eclipse: Disambiguating Illumination and Materials using Unintended Shadows

TexOct: Generating Textures of 3D Models with Octree-based Diffusion

Differentiable Micro-Mesh Construction

TextureDreamer: Image-Guided Texture Synthesis Through Geometry-Aware Diffusion

As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors

Breathing Life Into Sketches Using Text-to-Video Priors

Real-Time Neural BRDF with Spherically Distributed Primitives

Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering

Neural Super-Resolution for Real-time Rendering with Radiance Demodulation

DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation

Material Palette: Extraction of Materials from a Single Image

PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics

Differentiable Point-based Inverse Rendering

Objects as Volumes: A Stochastic Geometry View of Opaque Solids

Towards a Perceptual Evaluation Framework for Lighting Estimation

Vector Graphics Generation via Mutually Impulsed Dual-domain Diffusion

MatFuse: Controllable Material Generation with Diffusion Models

DiffusionLight: Light Probes for Free by Painting a Chrome Ball

TexTile: A Differentiable Metric for Texture Tileability

PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF

HashPoint: Accelerated Point Searching and Sampling for Neural Rendering

3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation

DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling

Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features

LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example

Dr. Bokeh: DiffeRentiable Occlusion-aware Bokeh Rendering

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

LightOctree: Lightweight 3D Spatially-Coherent Indoor Lighting Estimation

SVGDreamer: Text Guided SVG Generation with Diffusion Model

Control4D: Efficient 4D Portrait Editing with Text

HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation

Video2Game: Real-time Interactive Realistic and Browser-Compatible Environment from a Single Video

NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation

ESR-NeRF: Emissive Source Reconstruction Using LDR Multi-view Images

DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling

GenesisTex: Adapting Image Denoising Diffusion to Texture Space

Mosaic-SDF for 3D Generative Models

NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs

Hyper-MD: Mesh Denoising with Customized Parameters Aware of Noise Intensity and Geometric Characteristics

QUADify: Extracting Meshes with Pixel-level Details and Materials from Images

SfmCAD: Unsupervised CAD Reconstruction by Learning Sketch-based Feature Modeling Operations

Self-Supervised Dual Contouring

SVDTree: Semantic Voxel Diffusion for Single Image Tree Reconstruction

Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles

CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention

Functional Diffusion

(ends 12:00 PM)

Art Program [10:30-6:45]

(ends 6:45 PM)

11 a.m.

noon

Break:

Lunch

(ends 2:00 PM)

1 p.m.

Orals 2A Image & Video Synthesis [1:00-2:30]

Orals 1:00-2:30

[1:00] FreeU: Free Lunch in Diffusion U-Net

[1:18] Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

[1:36] Instruct-Imagen: Image Generation with Multi-modal Instruction

[1:54] Attention Calibration for Disentangled Text-to-Image Personalization

[2:12] Style Aligned Image Generation via Shared Attention

(ends 2:30 PM)

Orals 2B Deep learning architectures and techniques [1:00-2:30]

Orals 1:00-2:30

[1:00] Neural Redshift: Random Networks are not Random Functions

[1:18] Neural Lineage

[1:36] Learning Structure-from-Motion with Graph Attention Networks

[1:54] Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

[2:12] In Search of a Data Transformation That Accelerates Neural Field Training

(ends 2:30 PM)

Orals 2C 3D from multiview and sensors [1:00-2:30]

Orals 1:00-2:30

[1:00] Point Transformer V3: Simpler Faster Stronger

[1:18] Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences

[1:36] Seeing the World through Your Eyes

[1:54] Tri-Perspective View Decomposition for Geometry-Aware Depth Completion

[2:12] Steerers: A Framework for Rotation Equivariant Keypoint Descriptors

(ends 2:30 PM)

1:15 p.m.

Expo Track Keynote:

Computer vision at scale: Driving customer innovation and industry adoption

Swami Sivasubramanian

(ends 2:15 PM)

2:30 p.m.

Break:

Courtesy Break

(ends 2:45 PM)

2:45 p.m.

Keynote:

The Tip and the Iceberg: Deep Learning and Embodiment

Joshua Bongard

(ends 3:45 PM)

3:45 p.m.

Break:

Courtesy Break

(ends 4:00 PM)

4 p.m.

Panel:

Societal opportunities and challenges of AI

Fei-Fei Li · Matt McIlwain · Hadi Partovi · Oren Etzioni · Peter Lee

(ends 5:00 PM)

5 p.m.

Poster Session 2 & Exhibit Hall [5:00-6:30]

Posters 5:00-6:30

Point Transformer V3: Simpler Faster Stronger

Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences

Seeing the World through Your Eyes

Tri-Perspective View Decomposition for Geometry-Aware Depth Completion

Steerers: A Framework for Rotation Equivariant Keypoint Descriptors

VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation

Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields

GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

iToF-flow-based High Frame Rate Depth Imaging

Generalizable Novel-View Synthesis using a Stereo Camera

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Priors

Leveraging Camera Triplets for Efficient and Accurate Structure-from-Motion

LAENeRF: Local Appearance Editing for Neural Radiance Fields

SuperPrimitive: Scene Reconstruction at a Primitive Level

Revisiting Sampson Approximations for Geometric Estimation Problems

Interactive3D: Create What You Want by Interactive 3D Generation

Multiplane Prior Guided Few-Shot Aerial Scene Rendering

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting

DaReNeRF: Direction-aware Representation for Dynamic Scenes

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

LTM: Lightweight Textured Mesh Extraction and Refinement of Large Unbounded Scenes for Efficient Storage and Real-time Rendering

Minimal Perspective Autocalibration

X-3D: Explicit 3D Structure Modeling for Point Cloud Recognition

2S-UDF: A Novel Two-stage UDF Learning Method for Robust Non-watertight Model Reconstruction from Multi-view Images

UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and Unfavorable Sets

GenN2N: Generative NeRF2NeRF Translation

Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors

Noisy One-point Homographies are Surprisingly Good

Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching

LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis

NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation

VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction

Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates

SPU-PMD: Self-Supervised Point Cloud Upsampling via Progressive Mesh Deformation

Intrinsic Image Diffusion for Indoor Single-view Material Estimation

Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis

Robust Self-calibration of Focal Lengths from the Fundamental Matrix

RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction

Neural 3D Strokes: Creating Stylized 3D Scenes with Vectorized 3D Strokes

Unsupervised Template-assisted Point Cloud Shape Correspondence Network

Efficient Detection of Long Consistent Cycles and its Application to Distributed Synchronization

AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings

Accurate Training Data for Occupancy Map Prediction in Automated Driving Using Evidence Theory

Continuous Pose for Monocular Cameras in Neural Implicit Representation

Towards 3D Vision with Low-Cost Single-Photon Cameras

Inlier Confidence Calibration for Point Cloud Registration

GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior

SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering

DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior

VAREN: Very Accurate and Realistic Equine Network

REACTO: Reconstructing Articulated Objects from a Single Video

DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction

ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization

Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis

PaReNeRF: Toward Fast Large-scale Dynamic NeRF with Patch-based Reference

Fitting Flats to Flats

ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D Image

Neural Markov Random Field for Stereo Matching

Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization

DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

Pose-Transformed Equivariant Network for 3D Point Trajectory Prediction

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

KPConvX: Modernizing Kernel Point Convolution with Kernel Attention

Time- Memory- and Parameter-Efficient Visual Adaptation

Affine Equivariant Networks Based on Differential Invariants

PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution

Making Vision Transformers Truly Shift-Equivariant

Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression

Data-Free Quantization via Pseudo-label Filtering

FedHCA2: Towards Hetero-Client Federated Multi-Task Learning

SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks

TetraSphere: A Neural Descriptor for O(3)-Invariant Point Cloud Analysis

Friendly Sharpness-Aware Minimization

RMT: Retentive Networks Meet Vision Transformers

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Boosting Order-Preserving and Transferability for Neural Architecture Search: a Joint Architecture Refined Search and Fine-tuning Approach

Neural Redshift: Random Networks are not Random Functions

InceptionNeXt: When Inception Meets ConvNeXt

Neural Lineage

BiPer: Binary Neural Networks using a Periodic Function

Rewrite the Stars

A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network

Neural Clustering based Visual Representation Learning

Building Optimal Neural Architectures using Interpretable Knowledge

Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner

UniPTS: A Unified Framework for Proficient Post-Training Sparsity

Learning Structure-from-Motion with Graph Attention Networks

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Denoising Point Clouds in Latent Space via Graph Convolution and Invertible Neural Network

JointSQ: Joint Sparsification-Quantization for Distributed Learning

YolOOD: Utilizing Object Detection Concepts for Multi-Label Out-of-Distribution Detection

RepAn: Enhanced Annealing through Re-parameterization

D^4: Dataset Distillation via Disentangled Diffusion Model

State Space Models for Event Cameras

Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion

Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection

MAPSeg: Unified Unsupervised Domain Adaptation for Heterogeneous Medical Image Segmentation Based on 3D Masked Autoencoding and Pseudo-Labeling

FedUV: Uniformity and Variance for Heterogeneous Federated Learning

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Pick-or-Mix: Dynamic Channel Sampling for ConvNets

Sheared Backpropagation for Fine-tuning Foundation Models

AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search

MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation

Training-Free Pretrained Model Merging

Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts

IReNe: Instant Recoloring of Neural Radiance Fields

AdaShift: Learning Discriminative Self-Gated Neural Feature Activation With an Adaptive Shift Factor

Kernel Adaptive Convolution for Scene Text Detection via Distance Map Prediction

Towards Accurate and Robust Architectures via Neural Architecture Search

PDF: A Probability-Driven Framework for Open World 3D Point Cloud Semantic Segmentation

Permutation Equivariance of Transformers and Its Applications

MedBN: Robust Test-Time Adaptation against Malicious Test Samples

Small Scale Data-Free Knowledge Distillation

Identifying Important Group of Pixels using Interactions

Efficiently Assemble Normalization Layers and Regularization for Federated Domain Generalization

OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning

Mean-Shift Feature Transformer

You Only Need Less Attention at Each Stage in Vision Transformers

HEAL-SWIN: A Vision Transformer On The Sphere

NC-TTT: A Noise Constrastive Approach for Test-Time Training

Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning

MR-VNet: Media Restoration using Volterra Networks

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs

FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer

Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices

In Search of a Data Transformation That Accelerates Neural Field Training

Wired Perspectives: Multi-View Wire Art Embraces Generative AI

DemoFusion: Democratising High-Resolution Image Generation With No $$$

DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation

InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models

Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

ControlRoom3D: Room Generation using Semantic Proxy Rooms

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

Real-time 3D-aware Portrait Video Relighting

InstanceDiffusion: Instance-level Control for Image Generation

Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text

ZONE: Zero-Shot Instruction-Guided Local Editing

Don’t Drop Your Samples! Coherence-Aware Training Benefits Conditional Diffusion

Generating Illustrated Instructions

SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream

Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement

UniGS: Unified Representation for Image Generation and Segmentation

Adversarial Text to Continuous Image Generation

Self-correcting LLM-controlled Diffusion Models

TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing

Taming Stable Diffusion for Text to 360 Panorama Image Generation

EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models

Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

FreeU: Free Lunch in Diffusion U-Net

Move Anything with Layered Scene Diffusion

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

CapHuman: Capture Your Moments in Parallel Universes

IQ-VFI: Implicit Quadratic Motion Estimation for Video Frame Interpolation

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

MACE: Mass Concept Erasure in Diffusion Models

GenTron: Diffusion Transformers for Image and Video Generation

Relightful Harmonization: Lighting-aware Portrait Background Replacement

InstructVideo: Instructing Video Diffusion Models with Human Feedback

SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation

TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video

SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

WaveFace: Authentic Face Restoration with Efficient Frequency Recovery

AnyDoor: Zero-shot Object-level Image Customization

ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation

One-step Diffusion with Distribution Matching Distillation

Check Locate Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

WonderJourney: Going from Anywhere to Everywhere

Balancing Act: Distribution-Guided Debiasing in Diffusion Models

SIGNeRF: Scene Integrated Generation for Neural Radiance Fields

VideoBooth: Diffusion-based Video Generation with Image Prompts

Total Selfie: Generating Full-Body Selfies

CCEdit: Creative and Controllable Video Editing via Diffusion Models

Cinematic Behavior Transfer via NeRF-based Differentiable Filming

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

Learning Continuous 3D Words for Text-to-Image Generation

CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization

ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Towards Text-guided 3D Scene Composition

BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation

Face2Diffusion for Fast and Editable Face Personalization

FreeDrag: Feature Dragging for Reliable Point-based Image Editing

OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models

Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

CLiC: Concept Learning in Context

Z*: Zero-shot Style Transfer via Attention Reweighting

Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models

CosmicMan: A Text-to-Image Foundation Model for Humans

Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training

PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

PromptCoT: Align Prompt Distribution via Adapted Chain-of-Thought

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

L-MAGIC: Language Model Assisted Generation of Images with Coherence

Text-Driven Image Editing via Learnable Regions

On Exact Inversion of DPM-Solvers

Instruct-Imagen: Image Generation with Multi-modal Instruction

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

LAMP: Learn A Motion Pattern for Few-Shot Video Generation

Task-Customized Mixture of Adapters for General Image Fusion

Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples

Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data

Animating General Image with Large Visual Motion Model

Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

AVID: Any-Length Video Inpainting with Diffusion Model

Generative Powers of Ten

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Condition-Aware Neural Network for Controlled Image Generation

It's All About Your Sketch: Democratising Sketch Control in Diffusion Models

FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation

In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing

Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes

DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

Structure-Guided Adversarial Training of Diffusion Models

Learning Adaptive Spatial Coherent Correlations for Speech-Preserving Facial Expression Manipulation

On the Content Bias in Fréchet Video Distance

Residual Learning in Diffusion Models

A Unified Approach for Text- and Image-guided 4D Scene Generation

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Neural Implicit Morphing of Face Images

One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls

Video Interpolation with Diffusion Models

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture

Scaling Laws of Synthetic Images for Model Training ... for Now

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

Pose Adapted Shape Learning for Large-Pose Face Reenactment

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

Discriminative Probing and Tuning for Text-to-Image Generation

Towards Automated Movie Trailer Generation

CDFormer: When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization

VidToMe: Video Token Merging for Zero-Shot Video Editing

Layout-Agnostic Scene Text Image Synthesis with Diffusion Models

3D Multi-frame Fusion for Video Stabilization

DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video

A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing

StrokeFaceNeRF: Stroke-based Facial Appearance Editing in Neural Radiance Field

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

One-dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications

Hierarchical Patch Diffusion Models for High-Resolution Video Generation

Taming the Tail in Class-Conditional GANs: Knowledge Sharing via Unconditional Training at Lower Resolutions

Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting

Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

High-fidelity Person-centric Subject-to-Image Synthesis

Relation Rectification in Diffusion Model

Diffusion Handles Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D

LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model

FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance Head-pose and Facial Expression Features

Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting

MMA-Diffusion: MultiModal Attack on Diffusion Models

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models

Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

Generating Non-Stationary Textures using Self-Rectification

Fast ODE-based Sampling for Diffusion Models in Around 5 Steps

Deformable One-shot Face Stylization via DINO Semantic Guidance

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

SimDA: Simple Diffusion Adapter for Efficient Video Generation

Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

Shadow-Enlightened Image Outpainting

Exploiting Diffusion Prior for Generalizable Dense Prediction

StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN

MotionEditor: Editing Video Motion via Content-Aware Diffusion

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

Diversity-aware Channel Pruning for StyleGAN Compression

DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing

StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation

Grounded Text-to-Image Synthesis with Attention Refocusing

VecFusion: Vector Font Generation with Diffusion

Single Mesh Diffusion Models with Field Latents for Texture Generation

Orthogonal Adaptation for Modular Customization of Diffusion Models

Low-Latency Neural Stereo Streaming

TextCraftor: Your Text Encoder Can be Image Quality Controller

4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling

Image Neural Field Diffusion Models

Learning Multi-Dimensional Human Preference for Text-to-Image Generation

Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification

Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

PEEKABOO: Interactive Video Generation via Masked-Diffusion

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

Shadow Generation for Composite Image Using Diffusion Model

Adversarial Score Distillation: When score distillation meets GAN

Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Attention Calibration for Disentangled Text-to-Image Personalization

Personalized Residuals for Concept-Driven Text-to-Image Generation

UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Readout Guidance: Learning Control from Diffusion Features

Diffusion Model Alignment Using Direct Preference Optimization

Diffusion Models Without Attention

CommonCanvas: Open Diffusion Models Trained on Creative-Commons Images

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

Edit One for All: Interactive Batch Image Editing

Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration

Accelerating Diffusion Sampling with Optimized Time Steps

One-Shot Structure-Aware Stylized Image Synthesis

Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization

Observation-Guided Diffusion Probabilistic Models

Scaling Up Video Summarization Pretraining with Large Language Models

DREAM: Diffusion Rectification and Estimation-Adaptive Models

Clockwork Diffusion: Efficient Generation With Model-Step Distillation

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model

Exact Fusion via Feature Distribution Matching for Few-shot Image Generation

Cross Initialization for Face Personalization of Text-to-Image Models

EasyDrag: Efficient Point-based Manipulation on Diffusion Models

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

Towards Memorization-Free Diffusion Models

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Video Frame Interpolation via Direct Synthesis with the Event-based Reference

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars

Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis

HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Learned Representation-Guided Diffusion Models for Large-Image Generation

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Geometry Transfer for Stylizing Radiance Fields

Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation

Video-P2P: Video Editing with Cross-attention Control

PAIR Diffusion: A Comprehensive Multimodal Object-Level Image Editor

ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation

DemoCaricature: Democratising Caricature Generation with a Rough Sketch

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models

SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models

Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

Correcting Diffusion Generation through Resampling

AnyScene: Customized Image Synthesis with Composited Foreground

Grid Diffusion Models for Text-to-Video Generation

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability

Style Aligned Image Generation via Shared Attention

Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation

Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

Vlogger: Make Your Dream A Vlog

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

Prompt Augmentation for Self-supervised Text-guided Image Manipulation

DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

Make Pixels Dance: High-Dynamic Video Generation

LEDITS++: Limitless Image Editing using Text-to-Image Models

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models

ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models

3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis

Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging

NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image

MaskPLAN: Masked Generative Layout Planning from Partial Input

WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

CONFORM: Contrast is All You Need for High-Fidelity Text-to-Image Diffusion Models

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Taming Mode Collapse in Score Distillation for Text-to-3D Generation

CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation

Universal Robustness via Median Randomized Smoothing for Real-World Super-Resolution

ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

CAMEL: CAusal Motion Enhancement Tailored for Lifting Text-driven Video Editing

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

Amodal Completion via Progressive Mixed Context Diffusion

Named Entity Driven Zero-Shot Image Manipulation

Learning Degradation-unaware Representation with Prior-based Latent Transformations for Blind Face Restoration

AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error

VRetouchEr: Learning Cross-frame Feature Interdependence with Imperfection Flow for Face Retouching in Videos

Generative Unlearning for Any Identity

Doubly Abductive Counterfactual Inference for Text-based Image Editing

Text-conditional Attribute Alignment across Latent Spaces for 3D Controllable Face Image Synthesis

Customization Assistant for Text-to-Image Generation

Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing

Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

Combining Frame and GOP Embeddings for Neural Video Representation

PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis

Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models

Mitigating Motion Blur in Neural Radiance Fields with Events and Frames

Unmixing Before Fusion: A Generalized Paradigm for Multi-Source-based Hyperspectral Image Synthesis

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

MarkovGen: Structured Prediction for Efficient Text-to-Image Generation

DisCo: Disentangled Control for Realistic Human Dance Generation

The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

C3: High-Performance and Low-Complexity Neural Compression from a Single Image or Video

LightIt: Illumination Modeling and Control for Diffusion Models

Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

On the Scalability of Diffusion-based Text-to-Image Generation

Distilling ODE Solvers of Diffusion Models into Smaller Steps

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Fixed Point Diffusion Models

Gaussian Shell Maps for Efficient 3D Human Generation

Inversion-Free Image Editing with Language-Guided Diffusion Models

TIGER: Time-Varying Denoising Model for 3D Point Cloud Generation with Diffusion Process

Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion

U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation

(ends 6:30 PM)

7 p.m.

THU 20 JUN

7:30 a.m.

Registration / Badge Pickup

(ends 4:00 PM)

Break:

Breakfast

(ends 9:00 AM)

8:30 a.m.

9 a.m.

Orals 3A 3D from single view [9:00-10:30]

Orals 9:00-10:30

[9:00] Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

[9:18] EscherNet: A Generative Model for Scalable View Synthesis

[9:36] WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects Under Occlusion

[9:54] Diffusion-FOF: Single-View Clothed Human Reconstruction via Diffusion-Based Fourier Occupancy Field

[10:12] Rethinking Inductive Biases for Surface Normal Estimation

(ends 10:30 AM)

Orals 3B Vision, Language, and Reasoning [9:00-10:30]

Orals 9:00-10:30

[9:00] Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods

[9:18] MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

[9:36] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

[9:54] LISA: Reasoning Segmentation via Large Language Model

[10:12] Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

(ends 10:30 AM)

Orals 3C Medical and Physics-based vision [9:00-10:30]

Orals 9:00-10:30

[9:00] EventPS: Real-Time Photometric Stereo Using an Event Camera

[9:18] EvDiG: Event-guided Direct and Global Components Separation

[9:36] MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation

[9:54] Transcriptomics-guided Slide Representation Learning in Computational Pathology

[10:12] Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration

(ends 10:30 AM)

10:30 a.m.

Demonstration:

Demos

(ends 6:45 PM)

Art Program [10:30-6:45]

(ends 6:45 PM)

Expo Track Keynote:

Today’s Pictures, Tomorrow’s Training Data: The Synergy Between Human Creativity and AI

Andrea Gagliano

(ends 11:30 AM)

Poster Session 3 & Exhibit Hall [10:30-12:00]

Posters 10:30-12:00

G3DR: Generative 3D Reconstruction in ImageNet

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation

Generative Proxemics: A Prior for 3D Social Interaction from Images

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

WorDepth: Variational Language Prior for Monocular Depth Estimation

Free3D: Consistent Novel View Synthesis without 3D Representation

PostureHMR: Posture Transformation for 3D Human Mesh Recovery

3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surfaces

Learning the 3D Fauna of the Web

Bilateral Propagation Network for Depth Completion

Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes

LowRankOcc: Tensor Decomposition and Low-Rank Recovery for Vision-based 3D Semantic Occupancy Prediction

CNC-Net: Self-Supervised Learning for CNC Machining Operations

Reconstructing Hands in 3D with Transformers

Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation

Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning

Depth Prompting for Sensor-Agnostic Depth Estimation

ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

Slice3D: Multi-Slice Occlusion-Revealing Single View 3D Reconstruction

Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior

GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence

RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D

Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion

SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction

Diffusion Time-step Curriculum for One Image to 3D Generation

SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation

Wonder3D: Single Image to 3D using Cross-Domain Diffusion

En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data

MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision

Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation

PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation

SPAD: Spatially Aware Multi-View Diffusers

GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects

PointInfinity: Resolution-Invariant Point Diffusion Models

ZeroShape: Regression-based Zero-shot Shape Reconstruction

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis

MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images

UniDepth: Universal Monocular Metric Depth Estimation

G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images

3DToonify: Creating Your High-Fidelity 3D Stylized Avatar Easily from 2D Portrait Images

Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving

HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation

HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions

3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation

NViST: In the Wild New View Synthesis from a Single Image with Transformers

CAD: Photorealistic 3D Generation via Adversarial Distillation

Splatter Image: Ultra-Fast Single-View 3D Reconstruction

Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer

Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

Object Pose Estimation via the Aggregation of Diffusion Features

MonoCD: Monocular 3D Object Detection with Complementary Depths

MultiDiff: Consistent Novel View Synthesis from a Single Image

SeaBird: Segmentation in Bird’s View with Dice Loss Improves Monocular 3D Detection of Large Objects

Learning Occupancy for Monocular 3D Object Detection

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows

R-Cyclic Diffuser: Reductive and Cyclic Latent Diffusion for 3D Clothed Human Digitalization

Unleashing Network Potentials for Semantic Scene Completion

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers

VOODOO 3D: Volumetric Portrait Disentanglement For One-Shot 3D Head Reenactment

Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

SAOR: Single-View Articulated Object Reconstruction

EscherNet: A Generative Model for Scalable View Synthesis

HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Novel View Synthesis with View-Dependent Effects from a Single Image

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction

Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects Under Occlusion

DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

3D-LFM: Lifting Foundation Model

MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation

DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors

VS: Reconstructing Clothed 3D Human from Single Image via Vertex Shift

Weakly Supervised Monocular 3D Detection with a Single-View Image

From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior

Gated Fields: Learning Scene Reconstruction from Gated Videos

SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image

Diffusion-FOF: Single-View Clothed Human Reconstruction via Diffusion-Based Fourier Occupancy Field

Instance-aware Contrastive Learning for Occluded Human Mesh Reconstruction

IBD-SLAM: Learning Image-Based Depth Fusion for Generalizable SLAM

HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D

UV-IDM: Identity-Conditioned Latent Diffusion Model for Face UV-Texture Generation

AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing

Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

Bayesian Diffusion Models for 3D Shape Reconstruction

Rethinking Inductive Biases for Surface Normal Estimation

LaneCPP: Continuous 3D Lane Detection using Physical Priors

Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences

MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models

HiLo: Detailed and Robust 3D Clothed Human Reconstruction with High-and Low-Frequency Information of Parametric Models

MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization

GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement

Unsupervised 3D Structure Inference from Category-Specific Image Collections

Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction

BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image

DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions

MonoNPHM: Dynamic Head Reconstruction from Monocular Videos

FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion

Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection

Towards Modern Image Manipulation Localization: A Large-Scale Dataset and Novel Methods

ManiFPT: Defining and Analyzing Fingerprints of Generative Models

ProMark: Proactive Diffusion Watermarking for Causal Attribution

CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion

SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation

Would Deep Generative Models Amplify Bias in Future Models?

Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Visual Objectification in Films: Towards a New AI Task for Video Interpretation

ToonerGAN: Reinforcing GANs for Obfuscating Automated Facial Indexing

MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes

Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models

Discover and Mitigate Multiple Biased Subgroups in Image Classifiers

CORES: Convolutional Response-based Score for Out-of-distribution Detection

Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer

Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods

On the Faithfulness of Vision Transformer Explanations

Understanding Video Transformers via Universal Concept Discovery

Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions

WWW: A Unified Framework for Explaining What Where and Why of Neural Networks by Interpretation of Neuron Concepts

HDQMF: Holographic Feature Decomposition Using Quantum Algorithms

SLICE: Stabilized LIME for Consistent Explanations for Image Classification

What Sketch Explainability Really Means for Downstream Tasks?

Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training

Learning Triangular Distribution in Visual World

Incremental Residual Concept Bottleneck Models

Uncertainty Visualization via Low-Dimensional Posterior Projections

Epistemic Uncertainty Quantification For Pre-Trained Neural Networks

Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding

CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation

Discovering and Mitigating Visual Biases through Keyword Explanation

DiG-IN: Diffusion Guidance for Investigating Networks - Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations

Cross-Dimension Affinity Distillation for 3D EM Neuron Segmentation

Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning

A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning

CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification

Towards Generalizable Tumor Synthesis

Tyche: Stochastic In-Context Learning for Medical Image Segmentation

Structure-Aware Sparse-View X-ray 3D Reconstruction

Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation

Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation

C^2RV: Cross-Regional and Cross-View Learning for Sparse-View CBCT Reconstruction

Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration

SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology

Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models

ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification

Virtual Immunohistochemistry Staining for Histological Images Assisted by Weakly-supervised Learning

Representing Part-Whole Hierarchies in Foundation Models by Learning Localizability Composability and Decomposability from Anatomy via Self Supervision

XFibrosis: Explicit Vessel-Fiber Modeling for Fibrosis Staging from Liver Pathology Images

Prompting Vision Foundation Models for Pathology Image Analysis

One-Prompt to Segment All Medical Images

Learning Large-Factor EM Image Super-Resolution with Generative Priors

Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis

MindBridge: A Cross-Subject Brain Decoding Framework

Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology

Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images

Rethinking Diffusion Model for Multi-Contrast MRI Super-Resolution

Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting

MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation

Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction

Incremental Nuclei Segmentation from Histopathological Images via Future-class Awareness and Compatibility-inspired Distillation

PH-Net: Semi-Supervised Breast Lesion Segmentation via Patch-wise Hardness

ToNNO: Tomographic Reconstruction of a Neural Network’s Output for Weakly Supervised Segmentation of 3D Medical Images

Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts

CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment

Transcriptomics-guided Slide Representation Learning in Computational Pathology

MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections

Diversified and Personalized Multi-rater Medical Image Segmentation

Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention

Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework

MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant

H-ViT: A Hierarchical Vision Transformer for Deformable Image Registration

Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling

Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI

IIRP-Net: Iterative Inference Residual Pyramid Network for Enhanced Image Registration

ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images

Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology

Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction

Accurate Spatial Gene Expression Prediction by Integrating Multi-Resolution Features

Teeth-SEG: An Efficient Instance Segmentation Framework for Orthodontic Treatment based on Multi-Scale Aggregation and Anthropic Prior Knowledge

Low-Rank Knowledge Decomposition for Medical Foundation Models

M3-UDA: A New Benchmark for Unsupervised Domain Adaptive Fetal Cardiac Structure Detection

CycleINR: Cycle Implicit Neural Representation for Arbitrary-Scale Volumetric Super-Resolution of Medical Data

Constructing and Exploring Intermediate Domains in Mixed Domain Semi-supervised Medical Image Segmentation

PairAug: What Can Augmented Image-Text Pairs Do for Radiology?

Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering

Mudslide: A Universal Nuclear Instance Segmentation Method

Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration

Rotation-Agnostic Image Representation Learning for Digital Pathology

Tumor Micro-environment Interactions Guided Graph Learning for Survival Analysis of Human Cancers from Whole-slide Pathological Images

MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning

FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders

Bi-level Learning of Task-Specific Decoders for Joint Registration and One-Shot Medical Image Segmentation

PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation

Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation

Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology

EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation

Neural Underwater Scene Representation

Hearing Anything Anywhere

VMINer: Versatile Multi-view Inverse Rendering with Near- and Far-field Light Sources

EventPS: Real-Time Photometric Stereo Using an Event Camera

DiLiGenRT: A Photometric Stereo Dataset with Quantified Roughness and Translucency

NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images

EvDiG: Event-guided Direct and Global Components Separation

Differentiable Display Photometric Stereo

Bayesian Differentiable Physics for Cloth Digitalization

Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion

Sparse Views Near Light: A Practical Paradigm for Uncalibrated Point-light Photometric Stereo

Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance

Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under Manhattan World Assumption

Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models

Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo

Discontinuity-preserving Normal Integration with Auxiliary Edges

A Theory of Joint Light and Heat Transport for Lambertian Scenes

IDGuard: Robust General Identity-centric POI Proactive Defense Against Face Editing Abuse

Ungeneralizable Examples

Distilled Datamodel with Reverse Gradient Matching

EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection

SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples

FedAS: Bridging Inconsistency in Personalized Federated Learning

FairRAG: Fair Human Generation via Fair Retrieval Augmentation

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations

Data Valuation and Detections in Federated Learning

Utility-Fairness Trade-Offs and How to Find Them

SimAC: A Simple Anti-Customization Method for Protecting Face Privacy against Text-to-Image Synthesis of Diffusion Models

GLOW: Global Layout Aware Attacks on Object Detection

FADES: Fair Disentanglement with Sensitive Relevance

Fair Federated Learning under Domain Skew with Local Consistency and Domain Diversity

WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights

FLHetBench: Benchmarking Device and State Heterogeneity in Federated Learning

An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning

Privacy-Preserving Optics for Enhancing Protection in Face De-Identification

A Stealthy Wrongdoer: Feature-Oriented Reconstruction Attack against Split Learning

RCL: Reliable Continual Learning for Unified Failure Detection

Global and Local Prompts Cooperation via Optimal Transport for Federated Learning

Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models

Explaining CLIP's Performance Disparities on Data from Blind/Low Vision Users

Model Inversion Robustness: Can Transfer Learning Help?

Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models

Validating Privacy-Preserving Face Recognition under a Minimum Assumption

Re-thinking Data Availability Attacks Against Deep Neural Networks

OpenBias: Open-set Bias Detection in Text-to-Image Generative Models

In-distribution Public Data Synthesis with Diffusion Models for Differentially Private Image Classification

Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning

Countering Personalized Text-to-Image Generation with Influence Watermarks

Fair-VPT: Fair Visual Prompt Tuning for Image Classification

Relaxed Contrastive Learning for Federated Learning

FairCLIP: Harnessing Fairness in Vision-Language Learning

Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining

Adaptive Hyper-graph Aggregation for Modality-Agnostic Federated Learning

Navigate Beyond Shortcuts: Debiased Learning Through the Lens of Neural Collapse

Enhancing Intrinsic Features for Debiasing via Investigating Class-Discerning Common Attributes in Bias-Contrastive Pair

Device-Wise Federated Network Pruning

All Rivers Run to the Sea: Private Learning with Asymmetric Flows

VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models

CPR: Retrieval Augmented Generation for Copyright Protection

Communication-Efficient Federated Learning with Accelerated Client Gradient

Self-supervised Debiasing Using Low Rank Regularization

Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction

Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline

Label-Efficient Group Robustness via Out-of-Distribution Concept Curation

Long-Tailed Anomaly Detection with Learnable Class Names

Robust Emotion Recognition in Context Debiasing

Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities

An Edit Friendly DDPM Noise Space: Inversion and Manipulations

SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers

AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

FedSOL: Stabilized Orthogonal Learning with Proximal Restrictions in Federated Learning

UnionFormer: Unified-Learning Transformer with Multi-View Representation for Image Manipulation Detection and Localization

Motion Blur Decomposition with Cross-shutter Guidance

SNIDA: Unlocking Few-Shot Object Detection with Non-linear Semantic Decoupling Augmentation

Rapid 3D Model Generation with Intuitive 3D Input

SketchINR: A First Look into Sketches as Implicit Neural Representations

ERMVP: Communication-Efficient and Collaboration-Robust Multi-Vehicle Perception in Challenging Environments

DiaLoc: An Iterative Approach to Embodied Dialog Localization

WildlifeMapper: Aerial Image Analysis for Multi-Species Detection and Identification

Harnessing Meta-Learning for Improving Full-Frame Video Stabilization

De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts

Day-Night Cross-domain Vehicle Re-identification

Brush2Prompt: Contextual Prompt Generator for Object Inpainting

Cloud-Device Collaborative Learning for Multimodal Large Language Models

Making Visual Sense of Oracle Bones for You and Me

Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation

InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields

Language Models as Black-Box Optimizers for Vision-Language Models

Mind Marginal Non-Crack Regions: Clustering-Inspired Representation Learning for Crack Segmentation

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

Desigen: A Pipeline for Controllable Design Template Generation

Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World

Behind the Veil: Enhanced Indoor 3D Scene Reconstruction with Occluded Surfaces Completion

EarthLoc: Astronaut Photography Localization by Indexing Earth from Space

DiffForensics: Leveraging Diffusion Prior to Image Forgery Detection and Localization

MuseChat: A Conversational Music Recommendation System for Videos

The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement

Blind Image Quality Assessment Based on Geometric Order Learning

CrowdDiff: Multi-hypothesis Crowd Density Estimation using Diffusion Models

Towards Efficient Replay in Federated Incremental Learning

MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation

PolarRec: Improving Radio Interferometric Data Reconstruction Using Polar Coordinates

Constrained Layout Generation with Factor Graphs

Visual In-Context Prompting

Traceable Federated Continual Learning

Interactive Continual Learning: Fast and Slow Thinking

PIGEON: Predicting Image Geolocations

LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

DePT: Decoupled Prompt Tuning

Grounded Question-Answering in Long Egocentric Videos

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

The Manga Whisperer: Automatically Generating Transcriptions for Comics

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

The Neglected Tails in Vision-Language Models

Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation

GLaMM: Pixel Grounding Large Multimodal Model

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Pixel-Aligned Language Model

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection

Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Improved Visual Grounding through Self-Consistent Explanations

Distilling Vision-Language Models on Millions of Videos

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

Referring Image Editing: Object-level Image Editing via Referring Expressions

Vision-and-Language Navigation via Causal Learning

VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

Aligning and Prompting Everything All at Once for Universal Visual Perception

Can I Trust Your Answer? Visually Grounded Video Question Answering

Prompt Highlighter: Interactive Control for Multi-Modal LLMs

Language-only Training of Zero-shot Composed Image Retrieval

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation

CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

Synthesize Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding

AssistGUI: Task-Oriented PC Graphical User Interface Automation

SEED-Bench: Benchmarking Multimodal Large Language Models

Unknown Prompt the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models

Posterior Distillation Sampling

Towards More Unified In-context Visual Understanding

Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problems

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering

Segment and Caption Anything

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

Revisiting Counterfactual Problems in Referring Expression Comprehension

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

See Say and Segment: Teaching LMMs to Overcome False Premises

SignGraph: A Sign Sequence is Worth Graphs of Nodes

Enhancing Vision-Language Pre-training with Rich Supervisions

De-Diffusion Makes Text a Strong Cross-Modal Interface

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition

Retrieval-Augmented Egocentric Video Captioning

Towards Better Vision-Inspired Vision-Language Models

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

Koala: Key Frame-Conditioned Long Video-LLM

Generating Enhanced Negatives for Training Language-Based Object Detectors

Non-autoregressive Sequence-to-Sequence Vision-Language Models

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Synthesize Step-by-Step: Tools Templates and LLMs as Data Generators for Reasoning-Based Chart VQA

Towards Learning a Generalist Model for Embodied Navigation

Previously on ... From Recaps to Story Summarization

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

Holistic Autonomous Driving Understanding by Bird’s-Eye-View Injected Multi-Modal Large Models

Situational Awareness Matters in 3D Vision Language Reasoning

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation

Tune-An-Ellipse: CLIP Has Potential to Find What You Want

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Plug-and-Play Diffusion Distillation

Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation

Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Iterated Learning Improves Compositionality in Large Vision-Language Models

RegionGPT: Towards Region Understanding Vision Language Model

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Honeybee: Locality-enhanced Projector for Multimodal LLM

E-GPS: Explainable Geometry Problem Solving via Top-Down Solver and Bottom-Up Generator

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Any-Shift Prompting for Generalization over Distributions

Question Aware Vision Transformer for Multimodal Reasoning

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Text-Image Alignment for Diffusion-Based Perception

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding

LISA: Reasoning Segmentation via Large Language Model

VideoCon: Robust Video-Language Alignment via Contrast Captions

Taming Self-Training for Open-Vocabulary Object Detection

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Generative Region-Language Pretraining for Open-Ended Object Detection

CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

MICap: A Unified Model for Identity-Aware Movie Descriptions

CapsFusion: Rethinking Image-Text Data at Scale

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

VidLA: Video-Language Alignment at Scale

Viewpoint-Aware Visual Grounding in 3D Scenes

Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering

Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

MeaCap: Memory-Augmented Zero-shot Image Captioning

The STVchrono Dataset: Towards Continuous Change Recognition in Time

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding

Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

Masked AutoDecoder is Effective Multi-Task Vision Generalist

Efficient Test-Time Adaptation of Vision-Language Models

FFF: Fixing Flawed Foundations in Contrastive Pre-Training Results in Very Strong Vision-Language Models

Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships

Instance-level Expert Knowledge and Aggregate Discriminative Attention for Radiology Report Generation

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Building Vision-Language Models on Solid Foundations with Masked Distillation

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback

LASO: Language-guided Affordance Segmentation on 3D Object

Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding

VTimeLLM: Empower LLM to Grasp Video Moments

CogAgent: A Visual Language Model for GUI Agents

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

Multi-Modal Hallucination Control by Visual Information Grounding

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

AHIVE: Anatomy-aware Hierarchical Vision Encoding for Interactive Radiology Report Retrieval

Do Vision and Language Encoders Represent the World Similarly?

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Composing Object Relations and Attributes for Image-Text Matching

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

Generative Multimodal Models are In-Context Learners

A Vision Check-up for Language Models

Compositional Chain-of-Thought Prompting for Large Multimodal Models

On Scaling Up a Multilingual Vision and Language Model

Dual-View Visual Contextualization for Web Navigation

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Beyond Seen Primitive Concepts and Attribute-Object Compositional Learning

(ends 12:00 PM)

11:30 a.m.

Talk:

Doctoral Consortium

(ends 1:30 PM)

noon

Break:

Lunch

(ends 2:00 PM)

1 p.m.

Orals 4A Autonomous navigation and egocentric vision [1:00-2:30]

Orals 1:00-2:30

[1:00] SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection

[1:18] UnO: Unsupervised Occupancy Fields for Perception and Forecasting

[1:36] EgoGen: An Egocentric Synthetic Data Generator

[1:54] Learning to Segment Referred Objects from Narrated Egocentric Videos

[2:12] Producing and Leveraging Online Map Uncertainty in Trajectory Prediction

(ends 2:30 PM)

Orals 4B 3D Vision [1:00-2:30]

Orals 1:00-2:30

[1:00] SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

[1:18] SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency

[1:36] PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness

[1:54] PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar

[2:12] A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion

(ends 2:30 PM)

Orals 4C Action and motion [1:00-2:30]

Orals 1:00-2:30

[1:00] Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

[1:18] An N-Point Linear Solver for Line and Motion Estimation with Event Cameras

[1:36] RoHM: Robust Human Motion Reconstruction via Diffusion

[1:54] Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

[2:12] FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment

(ends 2:30 PM)

1:30 p.m.

Art Program Panel Discussion [1:30-2:30]

(ends 2:30 PM)

2:30 p.m.

Break:

Courtesy Break

(ends 2:45 PM)

2:45 p.m.

Keynote:

Design of New Protein Functions Using Deep Learning

David Baker

(ends 3:45 PM)

3:45 p.m.

Break:

Courtesy Break

(ends 4:00 PM)

4 p.m.

Meeting:

PAMI TC Meeting

(ends 5:00 PM)

5 p.m.

Poster Session 4 & Exhibit Hall [5:00-6:30]

Posters 5:00-6:30

PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness

PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar

Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations

Multi-Space Alignments Towards Universal LiDAR Segmentation

Generalized Predictive Model for Autonomous Driving

Visual Point Cloud Forecasting enables Scalable Autonomous Driving

SeMoLi: What Moves Together Belongs Together

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis

BEVSpread: Spread Voxel Pooling for Bird’s-Eye-View Representation in Vision-based Roadside 3D Object Detection

DualAD: Disentangling the Dynamic and Static World for End-to-End Driving

Towards Realistic Scene Generation with LiDAR Diffusion Models

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

VLP: Vision Language Planning for Autonomous Driving

Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion

UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather

Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation

OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising

MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction

Density-Adaptive Model Based on Motif Matrix for Multi-Agent Trajectory Prediction

StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential Equation

SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection

View From Above: Orthogonal-View aware Cross-view Localization

Improving Distant 3D Object Detection Using 2D Box Supervision

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

CaDeT: a Causal Disentanglement Approach for Robust Trajectory Prediction in Autonomous Driving

Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving

NeuRAD: Neural Rendering for Autonomous Driving

IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels

RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

Driving Everywhere with Large Language Model Policy Adaptation

Text2Loc: 3D Point Cloud Localization from Natural Language

Commonsense Prototype for Outdoor Unsupervised 3D Object Detection

A-Teacher: Asymmetric Network for 3D Semi-Supervised Object Detection

MoST: Multi-Modality Scene Tokenization for Motion Prediction

Feedback-Guided Autonomous Driving

Bootstrapping Autonomous Driving Radars with Self-Supervised Learning

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

SIRA: Scalable Inter-frame Relation and Association for Radar Perception

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

DiffLoc: Diffusion Model for Outdoor LiDAR Localization

Weak-to-Strong 3D Object Detection with X-Ray Distillation

T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory

Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents

Uncertainty-Guided Never-Ending Learning to Drive

On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving

DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Iterative Diffusion-Based Refinement

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields

LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs

GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds

Towards Robust 3D Object Detection with LiDAR and 4D Radar Fusion in Various Weather Conditions

3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling

ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association

PointBeV: A Sparse Approach for BeV Predictions

Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

Adapting to Length Shift: FlexiLength Network for Trajectory Prediction

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

Higher-order Relational Reasoning for Pedestrian Trajectory Prediction

HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention

LiSA: LiDAR Localization with Semantic Awareness

SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction

Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection

Multi-agent Collaborative Perception via Motion-aware Robust Communication Network

TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation

HINTED: Hard Instance Enhanced Detector with Mixed-Density Feature Fusion for Sparsely-Supervised 3D Object Detection

CaKDP: Category-aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection

Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous and Instruction-guided Driving

TULIP: Transformer for Upsampling of LiDAR Point Clouds

Bézier Everywhere All at Once: Learning Drivable Lanes as Bézier Graphs

Flow-Guided Online Stereo Rectification for Wide Baseline Stereo

LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic Simulation

HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction

RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation

3D LiDAR Mapping in Dynamic Environments using a 4D Implicit Neural Representation

Quantifying Uncertainty in Motion Prediction with Variational Bayesian Mixture

Continual Learning for Motion Prediction Model via Meta-Representation Learning and Optimal Memory Buffer Retention Strategy

PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving

ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles

CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation

Communication-Efficient Collaborative Perception via Information Filling with Codebook

RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features

ICP-Flow: LiDAR Scene Flow Estimation with ICP

Improving Bird's Eye View Semantic Segmentation by Task Decomposition

DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

Producing and Leveraging Online Map Uncertainty in Trajectory Prediction

HRVDA: High-Resolution Visual Document Assistant

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

RoDLA: Benchmarking the Robustness of Document Layout Analysis Models

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

GRAM: Global Reasoning for Multi-Page VQA

Bridging the Gap Between End-to-End and Two-Step Text Spotting

An Empirical Study of Scaling Law for Scene Text Recognition

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition

DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks

LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding

Generating Handwritten Mathematical Expressions From Symbol Graphs: An End-to-End Pipeline

OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies

PELA: Learning Parameter-Efficient Models with Low-Rank Approximation

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

VkD: Improving Knowledge Distillation using Orthogonal Projections

Logit Standardization in Knowledge Distillation

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks

DeepCache: Accelerating Diffusion Models for Free

ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers

A General and Efficient Training for Transformer via Token Expansion

Efficient Dataset Distillation via Minimax Diffusion

PEM: Prototype-based Efficient MaskFormer for Image Segmentation

Transferable and Principled Efficiency for Open-Vocabulary Segmentation

Dense Vision Transformer Compression with Few Samples

Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

MaxQ: Multi-Axis Query for N:M Sparsity Network

Retraining-Free Model Quantization via One-Shot Weight-Coupling Learning

LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking

Towards High-fidelity Artistic Image Vectorization via Texture-Encapsulated Shape Parameterization

Learning Vision from Models Rivals Learning Vision from Data

Efficient Multitask Dense Predictor via Binarization

RepViT: Revisiting Mobile CNN From ViT Perspective

Enhancing Post-training Quantization Calibration through Contrastive Learning

FreeKD: Knowledge Distillation via Semantic Frequency Prompt

PTQ4SAM: Post-Training Quantization for Segment Anything

CLIP-KD: An Empirical Study of CLIP Model Distillation

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Scaled Decoupled Distillation

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks

C2KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation

KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling

Towards Accurate Post-training Quantization for Diffusion Models

CURSOR: Scalable Mixed-Order Hypergraph Matching with CUR Decomposition

Frozen Feature Augmentation for Few-Shot Image Classification

Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

BilevelPruning: Unified Dynamic and Static Channel Pruning for Convolutional Neural Networks

Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models

Instance-Aware Group Quantization for Vision Transformers

Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning

Joint-Task Regularization for Partially Labeled Multi-Task Learning

Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch

Reg-PTQ: Regression-specialized Post-training Quantization for Fully Quantized Object Detector

MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning

MTLoRA: Low-Rank Adaptation Approach for Efficient Multi-Task Learning

Resource-Efficient Transformer Pruning for Finetuning of Large Models

Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences

Holodeck: Language Guided Generation of 3D Embodied AI Environments

SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World

RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI

Seeing the Unseen: Visual Common Sense for Semantic Placement

LEMON: Learning 3D Human-Object Interaction Relation from 2D Images

OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

Volumetric Environment Representation for Vision-Language Navigation

Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation

UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence

Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation Demonstration and Imitation

GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation

Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations

Rapid Motor Adaptation for Robotic Manipulator Arms

Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction

An Interactive Navigation Method with Effect-oriented Affordance

A Category Agnostic Model for Visual Rearrangment

SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views

OpenEQA: Embodied Question Answering in the Era of Foundation Models

Model Adaptation for Time Constrained Embodied Control

EgoGen: An Egocentric Synthetic Data Generator

RoHM: Robust Human Motion Reconstruction via Diffusion

An N-Point Linear Solver for Line and Motion Estimation with Event Cameras

A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion

SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency

You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval

CrossKD: Cross-Head Knowledge Distillation for Object Detection

ProTeCt: Prompt Tuning for Taxonomic Open Set Classification

CAT: Exploiting Inter-Class Dynamics for Domain Adaptive Object Detection

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

UniMODE: Unified Monocular 3D Object Detection

OVMR: Open-Vocabulary Recognition with Multi-Modal References

From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding

Language-conditioned Detection Transformer

Distribution-aware Knowledge Prototyping for Non-exemplar Lifelong Person Re-identification

Learning Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification

Active Object Detection with Knowledge Aggregation and Distillation from Large Models

SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection

Object Recognition as Next Token Prediction

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

Gradient Reweighting: Towards Imbalanced Class-Incremental Learning

Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection

Multi-View Attentive Contextualization for Multi-View 3D Object Detection

RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection

Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization

PointOBB: Learning Oriented Object Detection via Single Point Supervision

Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection

Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer

Hyperbolic Learning with Synthetic Captions for Open-World Detection

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision

Scene Adaptive Sparse Transformer for Event-based Object Detection

Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval

Preserving Fairness Generalization in Deepfake Detection

Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers

PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection

Structured Model Probing: Empowering Efficient Transfer Learning by Structured Regularization

How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?

Shallow-Deep Collaborative Learning for Unsupervised Visible-Infrared Person Re-Identification

Solving the Catastrophic Forgetting Problem in Generalized Category Discovery

Active Generalized Category Discovery

YOLO-World: Real-Time Open-Vocabulary Object Detection

Theoretically Achieving Continuous Representation of Oriented Bounding Boxes

Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection

LEOD: Label-Efficient Object Detection for Event Cameras

Lane2Seq: Towards Unified Lane Detection via Sequence Generation

Open-World Human-Object Interaction Detection via Multi-modal Prompts

DETRs Beat YOLOs on Real-time Object Detection

Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection

Referring Expression Counting

ActiveDC: Distribution Calibration for Active Finetuning

LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection

Fine-grained Prototypical Voting with Heterogeneous Mixup for Semi-supervised 2D-3D Cross-modal Retrieval

MS-DETR: Efficient DETR Training with Mixed Supervision

Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning

Pixel-level Semantic Correspondence through Layout-aware Representation Learning and Multi-scale Matching Integration

Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation

Point Segment and Count: A Generalized Framework for Object Counting

Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval

Riemannian Multinomial Logistics Regression for SPD Neural Networks

Learning for Transductive Threshold Calibration in Open-World Recognition

Region-Based Representations Revisited

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Holistic Features are almost Sufficient for Text-to-Video Retrieval

Enhancing the Power of OOD Detection via Sample-Aware Model Selection

PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation

VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning

D3still: Decoupled Differential Distillation for Asymmetric Image Retrieval

SFOD: Spiking Fusion Object Detector

Depth-Aware Concealed Crop Detection in Dense Agricultural Scenes

Extreme Point Supervised Instance Segmentation

Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model

Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping

Effective Video Mirror Detection with Inconsistent Motion Cues

Multi-Attribute Interactions Matter for 3D Visual Grounding

Looking 3D: Anomaly Detection with 2D-3D Alignment

Characteristics Matching Based Hash Codes Generation for Efficient Fine-grained Image Retrieval

EASE-DETR: Easing the Competition among Object Queries

ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

Exploring Orthogonality in Open World Object Detection

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

Unleashing Channel Potential: Space-Frequency Selection Convolution for SAR Object Detection

Hyperspherical Classification with Dynamic Label-to-Prototype Assignment

A Pedestrian is Worth One Prompt: Towards Language Guidance Person Re-Identification

VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping

On Train-Test Class Overlap and Detection for Image Retrieval

Multi-Scale Video Anomaly Detection by Multi-Grained Spatio-Temporal Representation Learning

LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection

Rethinking Boundary Discontinuity Problem for Oriented Object Detection

Hybrid Proposal Refiner: Revisiting DETR Series from the Faster R-CNN Perspective

Retrieval-Augmented Open-Vocabulary Object Detection

LiDAR-based Person Re-identification

EventDance: Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition

All in One Framework for Multimodal Re-identification in the Wild

Logarithmic Lenses: Exploring Log RGB Data for Image Classification

ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection

Infrared Small Target Detection with Scale and Location Sensitivity

SURE: SUrvey REcipes for building reliable and robust deep networks

Hyperbolic Anomaly Detection

Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions

CA-Jaccard: Camera-aware Jaccard Distance for Person Re-identification

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use

Neural Exposure Fusion for High-Dynamic Range Object Detection

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

Learning Transferable Negative Prompts for Out-of-Distribution Detection

TransLoc4D: Transformer-based 4D Radar Place Recognition

Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization

Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection

Contrastive Learning for DeepFake Classification and Localization via Multi-Label Ranking

Adaptive Softassign via Hadamard-Equipped Sinkhorn

An Asymmetric Augmented Self-Supervised Learning Method for Unsupervised Fine-Grained Image Hashing

Optimal Transport Aggregation for Visual Place Recognition

Atom-Level Optical Chemical Structure Recognition with Limited Supervision

Novel Class Discovery for Ultra-Fine-Grained Visual Categorization

Attribute-Guided Pedestrian Retrieval: Bridging Person Re-ID with Internal Attribute Variability

Robust Noisy Correspondence Learning with Equivariant Similarity Consistency

Bootstrapping SparseFormers from Vision Foundation Models

Not All Classes Stand on Same Embeddings: Calibrating a Semantic Distance with Metric Tensor

Improving Single Domain-Generalized Object Detection: A Focus on Diversification and Alignment

On the Estimation of Image-matching Uncertainty in Visual Place Recognition

Supervised Anomaly Detection for Complex Industrial Images

Fourier-basis Functions to Bridge Augmentation Gap: Rethinking Frequency Augmentation in Image Classification

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

Plug and Play Active Learning for Object Detection

BoQ: A Place is Worth a Bag of Learnable Queries

From Coarse to Fine-Grained Open-Set Recognition

Exploring Pose-Aware Human-Object Interaction via Hybrid Learning

Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts

Learning to Navigate Efficiently and Precisely in Real Environments

Task-Conditioned Adaptation of Visual Features in Multi-Task Policy Learning

FastMAC: Stochastic Spectral Sampling of Correspondence Graph

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

CAGE: Controllable Articulation GEneration

SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model

Language-driven Grasp Detection

MemoNav: Working Memory Model for Visual Navigation

NOPE: Novel Object Pose Estimation from a Single Image

Dexterous Grasp Transformer

Versatile Navigation Under Partial Observability via Value-guided Diffusion Policy

CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation

SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System

READ: Retrieval-Enhanced Asymmetric Diffusion for Motion Planning

Retrieval-Augmented Embodied Agents

Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles

Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation

Adaptive VIO: Deep Visual-Inertial Odometry with Online Continual Learning

F3Loc: Fusion and Filtering for Floorplan Localization

Gaussian Splatting SLAM

SUGAR: Pre-training 3D Visual Representations for Robotics

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

Open-Vocabulary Object 6D Pose Estimation

Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households

Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge

A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization

Neural Visibility Field for Uncertainty-Driven Active Mapping

SPIN: Simultaneous Perception Interaction and Navigation

SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

TIM: A Time Interval Machine for Audio-Visual Action Recognition

AutoAD III: The Prequel – Back to the Pixels

FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Action Segmentation

Progress-Aware Online Action Segmentation for Egocentric Procedural Task Videos

Video ReCap: Recursive Captioning of Hour-Long Videos

OmniViD: A Generative Framework for Universal Video Understanding

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Learning Group Activity Features Through Person Attribute Prediction

Streaming Dense Video Captioning

Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment

Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions

A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

Open-Vocabulary Video Anomaly Detection

Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection

Prompt-Enhanced Multiple Instance Learning for Weakly Supervised Video Anomaly Detection

Context-Guided Spatio-Temporal Video Grounding

Just Add ?! Pose Induced Video Transformers for Understanding Activities of Daily Living

Action Detection via an Image Diffusion Process

LLMs are Good Sign Language Translators

End-to-End Spatio-Temporal Action Localisation with Video Transformers

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding

LLMs are Good Action Recognizers

VideoLLM-online: Online Video Large Language Model for Streaming Video

What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Narrative Action Evaluation with Prompt-Guided Multimodal Interaction

Realigning Confidence with Temporal Saliency Information for Point-Level Weakly-Supervised Temporal Action Localization

Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes

LoCoNet: Long-Short Context Network for Active Speaker Detection

Neighbor Relations Matter in Video Scene Detection

PREGO: Online Mistake Detection in PRocedural EGOcentric Videos

Learning Object State Changes in Videos: An Open-World Perspective

Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Harnessing Large Language Models for Training-free Video Anomaly Detection

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

VicTR: Video-conditioned Text Representations for Activity Recognition

Dual DETRs for Multi-Label Temporal Action Detection

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

Can’t Make an Omelette Without Breaking Some Eggs: Plausible Action Anticipation Using Large Video-Language Models

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

RMem: Restricted Memory Banks Improve Video Object Segmentation

Low-power Continuous Remote Behavioral Localization with Event Cameras

Action Scene Graphs for Long-Form Understanding of Egocentric Videos

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More

Uncertainty-aware Action Decoupling Transformer for Action Anticipation

Error Detection in Egocentric Procedural Task Videos

Learning to Predict Activity Progress by Self-Supervised Video Alignment

MaskCLR: Attention-Guided Contrastive Learning for Robust Action Representation Learning

Align Before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Test-Time Zero-Shot Temporal Action Localization

Selective Interpretable and Motion Consistent Privacy Attribute Obfuscation for Action Recognition

Step Differences in Instructional Video

Compositional Video Understanding with Spatiotemporal Structure-based Transformers

Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment

Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition

vid-TLDR: Training Free Token Merging for Light-weight Video Transformer

CPR-Coach: Recognizing Composite Error Actions based on Single-class Training

Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

Detours for Navigating Instructional Videos

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-stage Action Localization

TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression

CSTA: CNN-based Spatiotemporal Attention for Video Summarization

PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition

MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection

Language Model Guided Interpretable Video Action Reasoning

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training

SnAG: Scalable and Accurate Video Grounding

Learning Correlation Structures for Vision Transformers

Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling

Matching Anything by Segmenting Anything

3D Feature Tracking via Event Camera

Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture

Towards Generalizable Multi-Object Tracking

SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction

Self-Supervised Multi-Object Tracking with Path Consistency

UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model

RTracker: Recoverable Tracking via PN Tree Structured Memory

ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe

Endow SAM with Keen Eyes: Temporal-spatial Prompt Learning for Video Camouflaged Object Detection

MemFlow: Optical Flow Estimation and Prediction with Memory

OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning

Learned Trajectory Embedding for Subspace Clustering

PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos

DiffusionTrack: Point Set Diffusion Model for Visual Object Tracking

Sparse Global Matching for Video Frame Interpolation with Large Motion

iKUN: Speak to Trackers without Retraining

NetTrack: Tracking Highly Dynamic Objects with a Net

Single-Model and Any-Modality for Video Object Tracking

FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models

Video Harmonization with Triplet Spatio-Temporal Variation Patterns

Dense Optical Tracking: Connecting the Dots

Efficient Meshflow and Optical Flow Estimation from Event Cameras

Context-Aware Integration of Language and Visual References for Natural Language Tracking

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

Weakly Supervised Video Individual Counting

Dual Prototype Attention for Unsupervised Video Object Segmentation

Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline

HIPTrack: Visual Tracking with Historical Prompts

FlowTrack: Revisiting Optical Flow for Long-Range Dense Tracking

Implicit Motion Function

DeconfuseTrack: Dealing with Confusion for Multi-Object Tracking

Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers

ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction

DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction

GigaTraj: Predicting Long-term Trajectories of Hundreds of Pedestrians in Gigapixel Complex Scenes

Delving into the Trajectory Long-tail Distribution for Muti-object Tracking

OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

Learning to Segment Referred Objects from Narrated Egocentric Videos

(ends 6:30 PM)

7 p.m.

Reception:

Reception & Musical Performances

(ends 9:00 PM)

FRI 21 JUN

8 a.m.

Registration / Badge Pickup

(ends 2:00 PM)

Break:

Breakfast

(ends 9:30 AM)

9 a.m.

Expo Track Keynote:

Phase Transition in AI: Opportunities and Gaps Towards Making AI Real

Ece Kamar

(ends 10:00 AM)

Orals 5A Datasets and evaluation [9:00-10:30]

Orals 9:00-10:30

[9:00] Deep Generative Model based Rate-Distortion for Image Downscaling Assessment

[9:18] 360+x: A Panoptic Multi-modal Scene Understanding Dataset

[9:36] Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

[9:54] Rich Human Feedback for Text-to-Image Generation

[10:12] BioCLIP: A Vision Foundation Model for the Tree of Life

(ends 10:30 AM)

Orals 5B 3D from multiview and sensors [9:00-10:30]

Orals 9:00-10:30

[9:00] Grounding and Enhancing Grid-based Models for Neural Fields

[9:18] NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation

[9:36] Mip-Splatting: Alias-free 3D Gaussian Splatting

[9:54] pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

[10:12] Learning to Produce Semi-dense Correspondences for Visual Localization

(ends 10:30 AM)

Orals 5C Low-shot, self-supervised, semi-supervised learning [9:00-10:30]

Orals 9:00-10:30

[9:00] CroSel: Cross Selection of Confident Pseudo Labels for Partial-Label Learning

[9:18] MLP Can Be A Good Transformer Learner

[9:36] From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation

[9:54] LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content

[10:12] Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps

(ends 10:30 AM)

10:30 a.m.

Art Program [10:30-6:45]

(ends 6:45 PM)

Poster Session 5 & Exhibit Hall [10:30-12:00]

Posters 10:30-12:00

TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes

Event-based Structure-from-Orbit

Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes

Instantaneous Perception of Moving Objects in 3D

Implicit Event-RGBD Neural SLAM

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

Learning Instance-Aware Correspondences for Robust Multi-Instance Point Cloud Registration in Cluttered Scenes

MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers

Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization

SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild

HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation

Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching

Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling

Global Latent Neural Rendering

HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting

LoS: Local Structure-Guided Stereo Matching

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement

CausalPC: Improving the Robustness of Point Cloud Classification by Causal Effect Identification

RoMa: Robust Dense Feature Matching

MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

GES : Generalized Exponential Splatting for Efficient Radiance Field Rendering

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis

LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry

FAR: Flexible Accurate and Robust 6DoF Relative Camera Pose Estimation

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

GART: Gaussian Articulated Template Models

CG-HOI: Contact-Guided 3D Human-Object Interaction Generation

FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations

PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion

Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes

NEAT: Distilling 3D Wireframes from Neural Attraction Fields

NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation

3DInAction: Understanding Human Actions in 3D Point Clouds

Dynamic LiDAR Re-simulation using Compositional Neural Fields

Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields

PanoPose: Self-supervised Relative Pose Estimation for Panoramic Images

GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds

4K4D: Real-Time 4D View Synthesis at 4K Resolution

MuRF: Multi-Baseline Radiance Fields

LangSplat: 3D Language Gaussian Splatting

Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields

Accelerating Neural Field Training via Soft Mining

CORE-MPI: Consistency Object Removal with Embedding MultiPlane Image

NECA: Neural Customizable Human Avatar

S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

Bi-SSC: Geometric-Semantic Bidirectional Fusion for Camera-based 3D Semantic Scene Completion

Learning to Select Views for Efficient Multi-View Understanding

Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata

Spectrum AUC Difference (SAUCD): Human-aligned 3D Shape Evaluation

Federated Online Adaptation for Deep Stereo

Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

Real-time Acquisition and Reconstruction of Dynamic Volumes with Neural Structured Illumination

Unifying Correspondence Pose and NeRF for Generalized Pose-Free Novel View Synthesis

GoMVS: Geometrically Consistent Cost Aggregation for Multi-View Stereo

MESA: Matching Everything by Segmenting Anything

OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees

MirageRoom: 3D Scene Segmentation with 2D Pre-trained Models by Mirage Projection

Robust Synthetic-to-Real Transfer for Stereo Matching

Symphonize 3D Semantic Scene Completion with Contextual Instance Queries

Differentiable Neural Surface Refinement for Modeling Transparent Objects

DeMatch: Deep Decomposition of Motion Field for Two-View Correspondence Learning

Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?

GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

How Far Can We Compress Instant-NGP-Based NeRF?

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency

NTO3D: Neural Target Object 3D Reconstruction with Segment Anything

Loopy-SLAM: Dense Neural SLAM with Loop Closures

BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation

ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models

Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields

SpatialTracker: Tracking Any 2D Pixels in 3D Space

GauHuman: Articulated Gaussian Splatting from Monocular Human Videos

IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images

GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields

LASA: Instance Reconstruction from Real Scans using A Large-scale Aligned Shape Annotation Dataset

GenZI: Zero-Shot 3D Human-Scene Interaction Generation

MVCPS-NeuS: Multi-view Constrained Photometric Stereo for Neural Surface Reconstruction

DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses

Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking

DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

Test-Time Adaptation for Depth Completion

Global and Hierarchical Geometry Consistency Priors for Few-shot NeRFs in Indoor Scenes

KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation

Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes

DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF

BANF: Band-Limited Neural Fields for Levels of Detail Reconstruction

SuperNormal: Neural Surface Reconstruction via Multi-View Normal Integration

ADFactory: An Effective Framework for Generalizing Optical Flow with NeRF

Dr.Hair: Reconstructing Scalp-Connected Hair Strands without Pre-Training via Differentiable Rendering of Line Segments

OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

GEARS: Local Geometry-aware Hand-object Interaction Synthesis

Edge-Aware 3D Instance Segmentation Network with Intelligent Semantic Prior

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Map-Relative Pose Regression for Visual Re-Localization

3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos

Revisiting Global Translation Estimation with Feature Tracks

DUSt3R: Geometric 3D Vision Made Easy

Robust Depth Enhancement via Polarization Prompt Fusion Tuning

StraightPCF: Straight Point Cloud Filtering

NeRFiller: Completing Scenes via Generative 3D Inpainting

NeRF Director: Revisiting View Selection in Neural Volume Rendering

Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling

COLMAP-Free 3D Gaussian Splatting

GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding

Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension

Fully Geometric Panoramic Localization

Multiway Point Cloud Mosaicking with Diffusion and Global Optimization

Mip-Splatting: Alias-free 3D Gaussian Splatting

Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing

Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction

Absolute Pose from One or Two Scaled and Oriented Features

DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching

Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

The More You See in 2D the More You Perceive in 3D

Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering

Practical Measurements of Translucent Materials with Inter-Pixel Translucency Prior

OneFormer3D: One Transformer for Unified Point Cloud Segmentation

General Point Model Pretraining with Autoencoding and Autoregressive

MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video

pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Object Dynamics Modeling with Hierarchical Point Cloud-based Representations

Neural Refinement for Absolute Pose Regression with Feature Synthesis

Gaussian Shadow Casting for Neural Characters

PAPR in Motion: Seamless Point-level 3D Scene Interpolation

ShapeMatcher: Self-Supervised Joint Shape Canonicalization Segmentation Retrieval and Deformation

XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold

Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation

RepKPU: Point Cloud Upsampling with Kernel Point Representation and Deformation

ColorPCR: Color Point Cloud Registration with Multi-Stage Geometric-Color Fusion

ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery

Improving Depth Completion via Depth Feature Upsampling

ZeroRF: Fast Sparse View 360° Reconstruction with Zero Pretraining

Multi-Level Neural Scene Graphs for Dynamic Urban Environments

Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream

Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling

SNI-SLAM: Semantic Neural Implicit SLAM

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors

SpecNeRF: Gaussian Directional Encoding for Specular Reflections

Correspondence-Free Non-Rigid Point Set Registration Using Unsupervised Clustering Analysis

GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection

3D Neural Edge Reconstruction

AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis

Polarization Wavefront Lidar: Learning Large Scene Reconstruction from Polarized Wavefronts

A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation

Open-Vocabulary 3D Semantic Segmentation with Foundation Models

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation

Efficient Solution of Point-Line Absolute Pose

CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoor Object Detection from Multi-view Images

HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM

SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM

Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D

TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations

L0-Sampler: An L0 Model Guided Volume Sampling for NeRF

Text-to-3D using Gaussian Splatting

TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization

NeISF: Neural Incident Stokes Field for Geometry and Material Estimation

Non-Rigid Structure-from-Motion: Temporally-Smooth Procrustean Alignment and Spatially-Variant Deformation Modeling

Small Steps and Level Sets: Fitting Neural Surface Models with Point Guidance

CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion

PanoRecon: Real-Time Panoptic 3D Reconstruction from Monocular Video

Three Pillars Improving Vision Foundation Model Distillation for Lidar

GARField: Group Anything with Radiance Fields

Flexible Depth Completion for Sparse and Varying Point Densities

ReconFusion: 3D Reconstruction with Diffusion Priors

GLACE: Global Local Accelerated Coordinate Encoding

NARUTO: Neural Active Reconstruction from Uncertain Target Observations

Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras

Detector-Free Structure from Motion

Memory-based Adapters for Online 3D Scene Perception

SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field

CoGS: Controllable Gaussian Splatting

DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes

GS-IR: 3D Gaussian Splatting for Inverse Rendering

Cross-spectral Gated-RGB Stereo Depth Estimation

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed

Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields

VGGSfM: Visual Geometry Grounded Deep Structure From Motion

Dynamic Cues-Assisted Transformer for Robust Point Cloud Registration

Learning to Produce Semi-dense Correspondences for Visual Localization

GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding

Compact 3D Gaussian Representation for Radiance Field

Unsupervised Occupancy Learning from Sparse Point Cloud

Grounding and Enhancing Grid-based Models for Neural Fields

TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding

ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object

SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving

FineSports: A Multi-person Hierarchical Sports Video Dataset for Fine-grained Action Understanding

Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation

Probing the 3D Awareness of Visual Foundation Models

VBench: Comprehensive Benchmark Suite for Video Generative Models

MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding

Video Recognition in Portrait Mode

MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

COCONut: Modernizing COCO Segmentation

Traffic Scene Parsing through the TSP6K Dataset

Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

Rethinking the Evaluation Protocol of Domain Generalization

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

Learning from Synthetic Human Group Activities

Instance Tracking in 3D Scenes from Egocentric Videos

Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding

Low-Resource Vision Challenges for Foundation Models

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions

LiDAR-Net: A Real-scanned 3D Point Cloud Dataset for Indoor Scenes

View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network

UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity

Towards Automatic Power Battery Detection: New Challenge Benchmark Dataset and Baseline

Abductive Ego-View Accident Video Understanding for Safe Driving Perception

Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

Towards Surveillance Video-and-Language Understanding: New Dataset Baselines and Challenges

Pre-training Vision Models with Mandelbulb Variations

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups

Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset

MatSynth: A Modern PBR Materials Dataset

When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach

HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Can Biases in ImageNet Models Explain Generalization?

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network

Point-VOS: Pointing Up Video Object Segmentation

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks

FISBe: A Real-World Benchmark Dataset for Instance Segmentation of Long-Range Thin Filamentous Structures

Inter-X: Towards Versatile Human-Human Interaction Analysis

TextNeRF: A Novel Scene-Text Image Synthesis Method based on Neural Radiance Fields

Systematic Comparison of Semi-supervised and Self-supervised Learning for Medical Image Classification

Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains

MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception

360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries

Deep Generative Model based Rate-Distortion for Image Downscaling Assessment

JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments

MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark

RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception

UVEB: A Large-scale Benchmark and Baseline Towards Real-World Underwater Video Enhancement

Real-World Mobile Image Denoising Dataset with Efficient Baselines

RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos

Evaluating Transferability in Retrieval Tasks: An Approach Using MMD and Kernel Methods

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation

Sieve: Multimodal Dataset Pruning using Image Captioning Models

Perceptual Assessment and Optimization of HDR Image Rendering

GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?

WinSyn: : A High Resolution Testbed for Synthetic Data

DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields

Learning Discriminative Dynamics with Label Corruption for Noisy Label Detection

DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World Videos

HouseCat6D - A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios

Benchmarking Segmentation Models with Mask-Preserved Attribute Editing

The Devil is in the Fine-Grained Details: Evaluating Open-Vocabulary Object Detectors for Fine-Grained Understanding

PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling

Insights from the Use of Previously Unseen Neural Architecture Search Datasets

TULIP: Multi-camera 3D Precision Assessment of Parkinson’s Disease

LUWA Dataset: Learning Lithic Use-Wear Analysis on Microscopic Images

ShapeWalk: Compositional Shape Editing Through Language-Guided Chains

360+x: A Panoptic Multi-modal Scene Understanding Dataset

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Rich Human Feedback for Text-to-Image Generation

TRINS: Towards Multimodal Language Models that Can Read

MAGICK: A Large-scale Captioned Dataset from Matting Generated Images using Chroma Keying

EFHQ: Multi-purpose ExtremePose-Face-HQ dataset

How to Train Neural Field Representations: A Comprehensive Study and Benchmark

BioCLIP: A Vision Foundation Model for the Tree of Life

A Noisy Elephant in the Room: Is Your Out-of-Distribution Detector Robust to Label Noise?

eTraM: Event-based Traffic Monitoring Dataset

SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments

MSU-4S - The Michigan State University Four Seasons Dataset

TUMTraf V2X Cooperative Perception Dataset

Multiview Aerial Visual RECognition (MAVREC): Can Multi-view Improve Aerial Visual Perception?

Towards Co-Evaluation of Cameras HDR and Algorithms for Industrial-Grade 6DoF Pose Estimation

Scaling Laws for Data Filtering— Data Curation cannot be Compute Agnostic

Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos

MLP Can Be A Good Transformer Learner

From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation

Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation

VideoMAC: Video Masked Autoencoders Meet ConvNets

Unsupervised Universal Image Segmentation

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs

SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers

Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos

Distributionally Generative Augmentation for Fair Facial Attribute Classification

Estimating Noisy Class Posterior with Part-level Labels for Noisy Label Learning

Unsupervised Keypoints from Pretrained Diffusion Models

Learning to Rank Patches for Unbiased Image Redundancy Reduction

Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data

GLID: Pre-training a Generalist Encoder-Decoder Vision Model

Sequential Modeling Enables Scalable Learning for Large Vision Models

VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

CroSel: Cross Selection of Confident Pseudo Labels for Partial-Label Learning

BEM: Balanced and Entropy-based Mix for Long-Tailed Semi-Supervised Learning

ReCoRe: Regularized Contrastive Representation Learning of World Model

Universal Novelty Detection Through Adaptive Contrastive Learning

Learning to Count without Annotations

Point Cloud Pre-training with Diffusion Models

Improving Unsupervised Hierarchical Representation with Reinforcement Learning

Investigating and Mitigating the Side Effects of Noisy Views for Self-Supervised Clustering Algorithms in Practical Multi-View Scenarios

Self-Supervised Representation Learning from Arbitrary Scenarios

Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform

A Bayesian Approach to OOD Robustness in Image Classification

Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training

Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers

DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes

Brain Decodes Deep Nets

Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery

Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange

Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

Targeted Representation Alignment for Open-World Semi-Supervised Learning

Hierarchical Correlation Clustering and Tree Preserving Embedding

Contrastive Mean-Shift Learning for Generalized Category Discovery

CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers

SODA: Bottleneck Diffusion Models for Representation Learning

HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation

Positive-Unlabeled Learning by Latent Group-Aware Meta Disambiguation

Aligning Logits Generatively for Principled Black-Box Knowledge Distillation

Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps

Neural Modes: Self-supervised Learning of Nonlinear Modal Subspaces

Decentralized Directed Collaboration for Personalized Federated Learning

Improving Graph Contrastive Learning via Adaptive Positive Sampling

Integrating Efficient Optimal Transport and Functional Maps For Unsupervised Shape Correspondence Learning

Unsupervised Feature Learning with Emergent Data-Driven Prototypicality

Label Propagation for Zero-shot Classification with Vision-Language Models

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Backpropagation-free Network for 3D Test-time Adaptation

GDA: Generalized Diffusion for Robust Test-time Adaptation

Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer

Few-shot Learner Parameterization by Diffusion Time-steps

FREE: Faster and Better Data-Free Meta-Learning

Classes Are Not Equal: An Empirical Study on Image Recognition Fairness

DAVE - A Detect-and-Verify Paradigm for Low-Shot Counting

Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds

D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object Detection

AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning

LEAD: Learning Decomposition for Source-free Universal Domain Adaptation

Improving Generalized Zero-Shot Learning by Exploring the Diverse Semantics from External Class Names

What How and When Should Object Detectors Update in Continually Changing Test Domains?

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation

Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation

DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

Unified Language-driven Zero-shot Domain Adaptation

Stable Neighbor Denoising for Source-free Domain Adaptive Segmentation

A Simple Recipe for Language-guided Domain Generalized Segmentation

TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model

Adapters Strike Back

Improving Plasticity in Online Continual Learning via Collaborative Learning

Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach

Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks

ESCAPE: Encoding Super-keypoints for Category-Agnostic Pose Estimation

PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization

Rethinking Multi-domain Generalization with A General Learning Objective

L2B: Learning to Bootstrap Robust Models for Combating Label Noise

Meta-Point Learning and Refining for Category-Agnostic Pose Estimation

A2XP: Towards Private Domain Generalization

Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning

VRP-SAM: SAM with Visual Reference Prompt

Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning

MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection

Disentangled Prompt Representation for Domain Generalization

Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation

Convolutional Prompting meets Language Models for Continual Learning

Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning

InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning

Discriminative Pattern Calibration Mechanism for Source-Free Domain Adaptation

NICE: Neurogenesis Inspired Contextual Encoding for Replay-free Class Incremental Learning

Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Towards Generalizing to Unseen Domains with Few Labels

Improved Self-Training for Test-Time Adaptation

Source-Free Domain Adaptation with Frozen Multimodal Foundation Model

Deep Imbalanced Regression via Hierarchical Classification Adjustment

A Versatile Framework for Continual Test-Time Domain Adaptation: Balancing Discriminability and Generalizability

DYSON: Dynamic Feature Space Self-Organization for Online Task-Free Class Incremental Learning

Test-Time Linear Out-of-Distribution Detection

LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content

APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation

LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP

On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do We Really Need Prompt Learning?

Discriminative Sample-Guided and Parameter-Efficient Feature Space Adaptation for Cross-Domain Few-Shot Learning

Regularized Parameter Uncertainty for Improving Generalization in Reinforcement Learning

An Empirical Study of the Generalization Ability of Lidar 3D Object Detectors to Unseen Domains

MMA: Multi-Modal Adapter for Vision-Language Models

PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees

Bayesian Exploration of Pre-trained Models for Low-shot Image Classification

NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation

Text-Enhanced Data-free Approach for Federated Class-Incremental Learning

Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning

TEA: Test-time Energy Adaptation

Universal Semi-Supervised Domain Adaptation by Mitigating Common-Class Bias

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

Learning Equi-angular Representations for Online Continual Learning

Open-Set Domain Adaptation for Semantic Segmentation

Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

Unified Entropy Optimization for Open-Set Test-Time Adaptation

FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-Tuning

Dual-Enhanced Coreset Selection with Class-wise Collaboration for Online Blurry Class Incremental Learning

Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning

Unveiling the Unknown: Unleashing the Power of Unknown to Known in Open-Set Source-Free Domain Adaptation

Dual-Consistency Model Inversion for Non-Exemplar Class Incremental Learning

Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation

Overcoming Generic Knowledge Loss with Selective Parameter Update

BrainWash: A Poisoning Attack to Forget in Continual Learning

Enhancing Visual Continual Learning with Language-Guided Supervision

(ends 12:00 PM)

Demonstration:

Demos

(ends 6:45 PM)

noon

Break:

Lunch

(ends 2:00 PM)

1 p.m.

Orals 6A Low-level vision and remote sensing [1:00-2:30]

Orals 1:00-2:30

[1:00] LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

[1:18] S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data

[1:36] Task-Driven Wavelets using Constrained Empirical Risk Minimization

[1:54] Image Processing GNN: Breaking Rigidity in Super-Resolution

[2:12] DART: Implicit Doppler Tomography for Radar Novel View Synthesis

(ends 2:30 PM)

Orals 6B Image & Video Synthesis [1:00-2:30]

Orals 1:00-2:30

[1:00] Alchemist: Parametric Control of Material Properties with Diffusion Models

[1:18] Generative Image Dynamics

[1:36] Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

[1:54] MonoHair: High-Fidelity Hair Modeling from a Monocular Video

[2:12] Analyzing and Improving the Training Dynamics of Diffusion Models

(ends 2:30 PM)

Orals 6C Multi-modal learning [1:00-2:30]

Orals 1:00-2:30

[1:00] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

[1:18] Describing Differences in Image Sets with Natural Language

[1:36] NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models

[1:54] MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning

[2:12] EGTR: Extracting Graph from Transformer for Scene Graph Generation

(ends 2:30 PM)

2:30 p.m.

Break:

Courtesy Break

(ends 2:45 PM)

2:45 p.m.

Keynote:

Entanglements, Exploring Artificial Biodiversity

Sofia Crespo

(ends 3:45 PM)

3:45 p.m.

Break:

Courtesy Break

(ends 4:00 PM)

4 p.m.

Panel:

CVPR: past, present, and future

Dima Damen · Cordelia Schmid · Ranjay Krishna

(ends 5:00 PM)

5 p.m.

Poster Session 6 & Exhibit Hall [5:00-6:30]

Posters 5:00-6:30

MonoHair: High-Fidelity Hair Modeling from a Monocular Video

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

Semantic-Aware Multi-Label Adversarial Attacks

Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay

Learning to Transform Dynamically for Better Adversarial Transferability

Infrared Adversarial Car Stickers

Unsegment Anything by Simulating Deformation

Efficient Model Stealing Defense with Noise Transition Matrix

Fully Exploiting Every Real Sample: SuperPixel Sample Gradient Model Stealing

Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds

Boosting Adversarial Transferability by Block Shuffle and Rotation

Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM

Data Poisoning based Backdoor Attacks to Contrastive Learning

NAPGuard: Towards Detecting Naturalistic Adversarial Patches

Ensemble Diversity Facilitates Adversarial Transferability

Revamping Federated Learning Security from a Defender's Perspective: A Unified Defense with Homomorphic Encrypted Data Space

Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models

Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transfomers

Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training

Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving

Distraction is All You Need: Memory-Efficient Image Immunization against Diffusion-Based Image Editing

PAD: Patch-Agnostic Defense against Adversarial Patch Attacks

PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor

Revisiting Adversarial Training Under Long-Tailed Distributions

Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness

Towards Transferable Targeted 3D Adversarial Attack in the Physical World

Nearest is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks

Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models

Boosting Adversarial Training via Fisher-Rao Norm-based Regularization

Random Entangled Tokens for Adversarially Robust Vision Transformer

Backdoor Defense via Test-Time Detecting and Repairing

1-Lipschitz Layers Compared: Memory Speed and Certifiable Robustness

DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection

DAP: A Dynamic Adversarial Patch for Evading Person Detectors

Adversarial Distillation Based on Slack Matching and Attribution Region Alignment

Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models

BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning

MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models

MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model

Revisiting Adversarial Training at Scale

Language-Driven Anchors for Zero-Shot Adversarial Robustness

Transferable Structural Sparse Adversarial Attack Via Exact Group Sparsity Training

Fooling Polarization-Based Vision using Locally Controllable Polarizing Projection

Overload: Latency Attacks on Object Detection for Edge Devices

Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models

Towards Understanding and Improving Adversarial Robustness of Vision Transformers

Towards Fairness-Aware Adversarial Learning

Byzantine-robust Decentralized Federated Learning via Dual-domain Clustering and Trust Bootstrapping

Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation

Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement

SlowFormer: Adversarial Attack on Compute and Energy Consumption of Efficient Vision Transformers

LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

Deep-TROJ: An Inference Stage Trojan Insertion Algorithm through Efficient Weight Replacement Attack

Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment

Initialization Matters for Adversarial Transfer Learning

Strong Transferable Adversarial Attacks via Ensembled Asymptotically Normal Distribution Learning

HDRFlow: Real-Time HDR Video Reconstruction with Large Motions

A Physics-informed Low-rank Deep Neural Network for Blind and Universal Lens Aberration Correction

Super-Resolution Reconstruction from Bayer-Pattern Spike Streams

In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging

SuperSVG: Superpixel-based Scalable Vector Graphics Synthesis

Language-driven All-in-one Adverse Weather Removal

LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network

Language-guided Image Reflection Separation

Time-Efficient Light-Field Acquisition Using Coded Aperture and Events

NB-GTR: Narrow-Band Guided Turbulence Removal

Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction

Boosting Spike Camera Image Reconstruction from a Perspective of Dealing with Spike Fluctuations

Frequency-aware Event-based Video Deblurring for Real-World Motion Blur

Latency Correction for Event-guided Deblurring and Frame Interpolation

Learning to Remove Wrinkled Transparent Film with Polarized Prior

Dispersed Structured Light for Hyperspectral 3D Imaging

Generalized Event Cameras

Intensity-Robust Autofocus for Spike Camera

Selective Nonlinearities Removal from Digital Signals

Close Imitation of Expert Retouching for Black-and-White Photography

Spike-guided Motion Deblurring with Unknown Modal Spatiotemporal Alignment

Coherence As Texture – Passive Textureless 3D Reconstruction by Self-interference

TurboSL: Dense Accurate and Fast 3D by Neural Inverse Structured Light

SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing

CPP-Net: Embracing Multi-Scale Feature Fusion into Deep Unfolding CP-PPA Network for Compressive Sensing

SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting

Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation

Towards HDR and HFR Video from Rolling-Mixed-Bit Spikings

Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI

Generative Quanta Color Imaging

UFC-Net: Unrolling Fixed-point Continuous Network for Deep Compressive Sensing

Batch Normalization Alleviates the Spectral Bias in Coordinate Networks

EVS-assisted Joint Deblurring Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling

Unsupervised Deep Unrolling Networks for Phase Unwrapping

LAN: Learning to Adapt Noise for Image Denoising

Snapshot Lidar: Fourier Embedding of Amplitude and Phase for Single-Image Depth Reconstruction

FC-GNN: Recovering Reliable and Accurate Correspondences from Interferences

Projecting Trackable Thermal Patterns for Dynamic Computer Vision

PixelRNN: In-pixel Recurrent Neural Networks for End-to-end–optimized Perception with Neural Sensors

Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance

DART: Implicit Doppler Tomography for Radar Novel View Synthesis

Equivariant Plug-and-Play Image Reconstruction

CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras

WaveMo: Learning Wavefront Modulations to See Through Scattering

Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence

DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model

Resolution Limit of Single-Photon LiDAR

QN-Mixer: A Quasi-Newton MLP-Mixer Model for Sparse-View CT Reconstruction

Dual-Scale Transformer for Large-Scale Single-Pixel Imaging

Rolling Shutter Correction with Intermediate Distortion Flow Estimation

Passive Snapshot Coded Aperture Dual-Pixel RGB-D Imaging

Single View Refractive Index Tomography with Neural Fields

SPECAT: SPatial-spEctral Cumulative-Attention Transformer for High-Resolution Hyperspectral Image Reconstruction

Task-Driven Wavelets using Constrained Empirical Risk Minimization

Describing Differences in Image Sets with Natural Language

Alchemist: Parametric Control of Material Properties with Diffusion Models

Generative Image Dynamics

Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models

Analyzing and Improving the Training Dynamics of Diffusion Models

Fourier Priors-Guided Diffusion for Zero-Shot Joint Low-Light Enhancement and Deblurring

Color Shift Estimation-and-Correction for Image Enhancement

Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention

Distilling Semantic Priors from SAM to Efficient Image Restoration Models

Beyond Average: Individualized Visual Scanpath Prediction

Multimodal Prompt Perceiver: Empower Adaptiveness Generalizability and Fidelity for All-in-One Image Restoration

Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

Revisiting Single Image Reflection Removal In the Wild

ODCR: Orthogonal Decoupling Contrastive Regularization for Unpaired Image Dehazing

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Enhancing Quality of Compressed Images by Mitigating Enhancement Bias Towards Compression Domain

Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing

NightCC: Nighttime Color Constancy via Adaptive Channel Masking

Navigating Beyond Dropout: An Intriguing Solution towards Generalizable Image Super Resolution

Learning Inclusion Matching for Animation Paint Bucket Colorization

Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization

Towards Backward-Compatible Continual Learning of Image Compression

APISR: Anime Production Inspired Real-World Anime Super-Resolution

Unifying Automatic and Interactive Matting with Pretrained ViTs

Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring

Genuine Knowledge from Practice: Diffusion Test-Time Adaptation for Video Adverse Weather Removal

HomoFormer: Homogenized Transformer for Image Shadow Removal

Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining

LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising

Seeing Motion at Nighttime with an Event Camera

Leveraging Frame Affinity for sRGB-to-RAW Video De-rendering

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring

Unsupervised Blind Image Deblurring Based on Self-Enhancement

TTA-EVF: Test-Time Adaptation for Event-based Video Frame Interpolation via Reliable Pixel and Sample Estimation

Learning Coupled Dictionaries from Unpaired Data for Image Super-Resolution

Empowering Resampling Operation for Ultra-High-Definition Image Enhancement with Model-Aware Guidance

Generating Content for HDR Deghosting from Frequency View

Dual Prior Unfolding for Snapshot Compressive Imaging

Binarized Low-light Raw Video Enhancement

Neural Spline Fields for Burst Image Fusion and Layer Separation

Learning Degradation-Independent Representations for Camera ISP Pipelines

SeD: Semantic-Aware Discriminator for Image Super-Resolution

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution

Improving Spectral Snapshot Reconstruction with Spectral-Spatial Rectification

Diffusion-based Blind Text Image Super-Resolution

CAMixerSR: Only Details Need More "Attention"

ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation

Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning

CoSeR: Bridging Image and Language for Cognitive Super-Resolution

Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization

SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder

Text-guided Explorable Image Super-resolution

Equivariant Multi-Modality Image Fusion

Revisiting Spatial-Frequency Information Integration from a Hierarchical Perspective for Panchromatic and Multi-Spectral Image Fusion

MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation

Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment

MuGE: Multiple Granularity Edge Detection

KVQ: Kwai Video Quality Assessment for Short-form Videos

Transfer CLIP for Generalizable Image Denoising

Improved Implicit Neural Representation with Fourier Reparameterized Training

Deep Video Inverse Tone Mapping Based on Temporal Clues

Boosting Flow-based Generative Super-Resolution Models via Learned Prior

Look-Up Table Compression for Efficient Image Restoration

Latent Modulated Function for Computational Optimal Continuous Image Representation

Task-Aware Encoder Control for Deep Video Compression

A Dynamic Kernel Prior Model for Unsupervised Blind Image Super-Resolution

Zero-Reference Low-Light Enhancement via Physical Quadruple Priors

ParamISP: Learned Forward and Inverse ISPs using Camera Parameters

FSC: Few-point Shape Completion

Generative Latent Coding for Ultra-Low Bitrate Image Compression

Neural Video Compression with Feature Modulation

Driving-Video Dehazing with Non-Aligned Regularization for Safety Assistance

Image Processing GNN: Breaking Rigidity in Super-Resolution

CFAT: Unleashing Triangular Windows for Image Super-resolution

Zero-Shot Structure-Preserving Diffusion Model for High Dynamic Range Tone Mapping

Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

Learn from View Correlation: An Anchor Enhancement Strategy for Multi-view Clustering

Circuit Design and Efficient Simulation of Quantum Inner Product and Empirical Studies of Its Effect on Near-Term Hybrid Quantum-Classic Machine Learning

Discriminability-Driven Channel Selection for Out-of-Distribution Detection

Efficient Hyperparameter Optimization with Adaptive Fidelity Identification

Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing

Online Task-Free Continual Generative and Discriminative Learning via Dynamic Cluster Memory

S²MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering

Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning

An Aggregation-Free Federated Learning for Tackling Data Heterogeneity

POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning

SVDinsTN: A Tensor Network Paradigm for Efficient Structure Search from Regularized Modeling Perspective

Fine-Grained Bipartite Concept Factorization for Clustering

Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld

The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes

Improved Baselines with Visual Instruction Tuning

Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment

FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization

Audio-Visual Segmentation via Unlabeled Frame Exploitation

Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

MoDE: CLIP Data Experts via Clustering

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

PixelLM: Pixel Reasoning with Large Multimodal Model

Probing Synergistic High-Order Interaction in Infrared and Visible Image Fusion

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

SHAP-EDITOR: Instruction-Guided Latent 3D Editing in Seconds

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow

Dispel Darkness for Better Fusion: A Controllable Visual Enhancer based on Cross-modal Conditional Adversarial Learning

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation

DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning

Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation

Tactile-Augmented Radiance Fields

LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Mask Grounding for Referring Image Segmentation

OneLLM: One Framework to Align All Modalities with Language

EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning

ModaVerse: Efficiently Transforming Modalities with LLMs

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

Dynamic Prompt Optimizing for Text-to-Image Generation

Domain Prompt Learning with Quaternion Networks

ViT-Lens: Towards Omni-modal Representations

Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

Cyclic Learning for Binaural Audio Generation and Localization

Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval

VILA: On Pre-training for Visual Language Models

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

How to Configure Good In-Context Sequence for Visual Question Answering

CrossMAE: Cross-Modality Masked Autoencoders for Region-Aware Audio-Visual Pre-Training

Modality-Collaborative Test-Time Adaptation for Action Recognition

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Rethinking Multi-view Representation Learning via Distilled Disentangling

Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection

Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation

Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities

Efficient Vision-Language Pre-training by Cluster Masking

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Weakly Misalignment-free Adaptive Feature Alignment for UAVs-based Multimodal Object Detection

DiVAS: Video and Audio Synchronization with Dynamic Frame Rates

Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model

SonicVisionLM: Playing Sound with Vision Language Models

Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion

C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

Composed Video Retrieval via Enriched Context and Discriminative Embeddings

Looking Similar Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

Anchor-based Robust Finetuning of Vision-Language Models

Event-based Visible and Infrared Fusion via Multi-task Collaboration

Prompt Learning via Meta-Regularization

Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval

Contextual Augmented Global Contrast for Multimodal Intent Recognition

MRFS: Mutually Reinforcing Image Fusion and Segmentation

POPDG: Popular 3D Dance Generation with PopDanceSet

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Active Prompt Learning in Vision Language Models

Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

ES³: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations

PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Language-aware Visual Semantic Distillation for Video Question Answering

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion.

AV-RIR: Audio-Visual Room Impulse Response Estimation

Link-Context Learning for Multimodal LLMs

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

Mind Artist: Creating Artistic Snapshots with Human Thought

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models

Data-Efficient Multimodal Fusion on a Single GPU

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Accept the Modality Gap: An Exploration in the Hyperbolic Space

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning

Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks Methods and Applications

DIEM: Decomposition-Integration Enhancing Multimodal Insights

MAFA: Managing False Negatives for Vision-Language Pre-training

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Enhancing Multimodal Cooperation via Sample-level Modality Valuation

Diff-BGM: A Diffusion Model for Video Background Music Generation

SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training

MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric

Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Differentiable Information Bottleneck for Deterministic Multi-view Clustering

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Multimodal Representation Learning by Alternating Unimodal Adaptation

View-Category Interactive Sharing Transformer for Incomplete Multi-View Multi-Label Learning

Scalable 3D Registration via Truncated Entry-wise Absolute Residuals

Partial-to-Partial Shape Matching with Geometric Consistency

Towards Robust Learning to Optimize with Theoretical Guarantees

From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models

Ink Dot-Oriented Differentiable Optimization for Neural Image Halftoning

Are Conventional SNNs Really Efficient? A Perspective from Network Quantization

FedMef: Towards Memory-efficient Federated Dynamic Pruning

SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching

Purified and Unified Steganographic Network

Learned Lossless Image Compression based on Bit Plane Slicing

Towards Calibrated Multi-label Deep Neural Networks

Improving Generalization via Meta-Learning on Hard Samples

Learning with Structural Labels for Learning with Noisy Labels

DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models

Improving Out-of-Distribution Generalization in Graphs via Hierarchical Semantic Environments

Patch2Self2: Self-supervised Denoising on Coresets via Matrix Sketching

G-FARS: Gradient-Field-based Auto-Regressive Sampling for 3D Part Grouping

Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Building Bridges across Spatial and Temporal Resolutions: Reference-Based Super-Resolution via Change Priors and Conditional Diffusion Model

SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation

S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data

Poly Kernel Inception Network for Remote Sensing Detection

Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels

3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions

Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening

SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation

DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting

MoCha-Stereo: Motif Channel Attention Network for Stereo Matching

PBWR: Parametric-Building-Wireframe Reconstruction from Aerial LiDAR Point Clouds

Multi-modal Learning for Geospatial Vegetation Forecasting

Relational Matching for Weakly Semi-Supervised Oriented Object Detection

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

Unmixing Diffusion for Self-Supervised Hyperspectral Image Denoising

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

Parameter Efficient Self-Supervised Geospatial Domain Adaptation

Bridging Remote Sensors with Multisensor Geospatial Foundation Models

CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning

Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans

Semantics Distortion and Style Matter: Towards Source-free UDA for Panoramic Segmentation

Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

Construct to Associate: Cooperative Context Learning for Domain Adaptive Point Cloud Segmentation

Multi-Task Dense Prediction via Mixture of Low-Rank Experts

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

OMG-Seg: Is One Model Good Enough For All Segmentation?

DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data

Bilateral Adaptation for Human-Object Interaction Detection with Occlusion-Robustness

CurveCloudNet: Processing Point Clouds with 1D Structure

VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Amodal Ground Truth and Completion in the Wild

Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments

Single Domain Generalization for Crowd Counting

LTA-PCS: Learnable Task-Agnostic Point Cloud Sampling

Prompt3D: Random Prompt Assisted Weakly-Supervised 3D Object Detection

No More Ambiguity in 360° Room Layout via Bi-Layout Estimation

Semantic Line Combination Detector

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

PanoContext-Former: Panoramic Total Scene Understanding with a Transformer

DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly

ProMotion: Prototypes As Motion Learners

HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes

Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection

Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now

Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection

CoralSCOP: Segment any COral Image on this Planet

Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models

Disentangled Pre-training for Human-Object Interaction Detection

Osprey: Pixel Understanding with Visual Instruction Tuning

Discovering Syntactic Interaction Clues for Human-Object Interaction Detection

Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball

HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

Hierarchical Intra-modal Correlation Learning for Label-free 3D Semantic Segmentation

FreePoint: Unsupervised Point Cloud Instance Segmentation

GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

Physical Property Understanding from Language-Embedded Feature Fields

LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation

DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation

OTE: Exploring Accurate Scene Text Recognition Using One Token

SemCity: Semantic Scene Generation with Triplane Diffusion

Advancing Saliency Ranking with Human Fixations: Dataset Models and Benchmarks

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing

Leveraging Predicate and Triplet Learning for Scene Graph Generation

Regressor-Segmenter Mutual Prompt Learning for Crowd Counting

Learning from Observer Gaze: Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition

EGTR: Extracting Graph from Transformer for Scene Graph Generation

SG-PGM: Partial Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples

Class Incremental Learning with Multi-Teacher Distillation

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Consistent Prompting for Rehearsal-Free Continual Learning

Tuning Stable Rank Shrinkage: Aiming at the Overlooked Structural Risk in Fine-tuning

Coherent Temporal Synthesis for Incremental Action Segmentation

FCS: Feature Calibration and Separation for Non-Exemplar Class Incremental Learning

DeIL: Direct-and-Inverse CLIP for Open-World Few-Shot Learning

Understanding and Improving Source-free Domain Adaptation from a Theoretical Perspective

Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning

Adversarially Robust Few-shot Learning via Parameter Co-distillation of Similarity and Class Concept Learners

Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation

Efficient Stitchable Task Adaptation

Gradient-based Parameter Selection for Efficient Fine-Tuning

ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models

Simple Semantic-Aided Few-Shot Learning

Long-Tail Class Incremental Learning via Independent Sub-prototype Construction

Few-Shot Object Detection with Foundation Models

Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

Continual Forgetting for Pre-trained Vision Models

AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation

Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation

LEAD: Exploring Logit Space Evolution for Model Selection

Instance-based Max-margin for Practical Few-shot Recognition

Domain Gap Embeddings for Generative Dataset Augmentation

JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models

Generative Multi-modal Models are Good Class Incremental Learners

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory

Federated Generalized Category Discovery

Learning from One Continuous Video Stream

OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning

SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection

Active Domain Adaptation with False Negative Prediction for Object Detection

Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements

Your Transferability Barrier is Fragile: Free-Lunch for Transferring the Non-Transferable Learning

Transductive Zero-Shot and Few-Shot CLIP

Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships

Unbiased Faster R-CNN for Single-source Domain Generalized Object Detection

MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning

(ends 6:30 PM)