CVPR 2024 Events with Videos
Art Programs
Expo Track Keynotes
Keynotes
Posters
- MMM: Generative Masked Motion Model
- On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
- Forecasting of 3D Whole-body Human Poses with Grasping Objects
- BodyMAP - Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed
- Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling
- Real-Time Neural BRDF with Spherically Distributed Primitives
- QUADify: Extracting Meshes with Pixel-level Details and Materials from Images
- KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation
- Makeup Prior Models for 3D Facial Makeup Estimation and Applications
- XFeat: Accelerated Features for Lightweight Image Matching
- GSVA: Generalized Segmentation via Multimodal Large Language Models
- AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution
- FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding
- MultiPhys: Multi-Person Physics-aware 3D Motion Estimation
- Activity-Biometrics: Person Identification from Daily Activities
- Eclipse: Disambiguating Illumination and Materials using Unintended Shadows
- Diffuse Attend and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion
- Infer from What You Have Seen Before: Temporally-dependent Classifier for Semi-supervised Video Segmentation
- GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh
- MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints
- FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio
- Open-World Semantic Segmentation Including Class Similarity
- OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers
- HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images
- Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation
- Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis
- Neural Sign Actors: A Diffusion Model for 3D Sign Language Production from Text
- OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation
- HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models
- DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling
- OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion
- TexOct: Generating Textures of 3D Models with Octree-based Diffusion
- OHTA: One-shot Hand Avatar via Data-driven Implicit Priors
- CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement
- Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles
- FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions
- BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model
- BigGait: Learning Gait Representation You Want by Large Vision Models
- Generating Human Motion in 3D Scenes from Text Descriptions
- Finsler-Laplace-Beltrami Operators with Application to Shape Analysis
- 4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations
- HOIST-Former: Hand-held Objects Identification Segmentation and Tracking in the Wild
- 3D Human Pose Perception from Egocentric Stereo Videos
- ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning
- Stratified Avatar Generation from Sparse Observations
- DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling
- MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild
- AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation
- PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics
- SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
- SEAS: ShapE-Aligned Supervision for Person Re-Identification
- HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video
- Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining
- Towards Variable and Coordinated Holistic Co-Speech Motion Generation
- Human Gaussian Splatting: Real-time Rendering of Animatable Avatars
- SfmCAD: Unsupervised CAD Reconstruction by Learning Sketch-based Feature Modeling Operations
- SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation
- DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion
- InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion
- Mosaic-SDF for 3D Generative Models
- A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint
- MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
- Digital Life Project: Autonomous 3D Characters with Social Intelligence
- WANDR: Intention-guided Human Motion Generation
- Molecular Data Programming: Towards Molecule Pseudo-labeling with Systematic Weak Supervision
- Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling
- Human Motion Prediction Under Unexpected Perturbation
- Memory-Scalable and Simplified Functional Map Learning
- When StyleGAN Meets Stable Diffusion: a W+ Adapter for Personalized Image Generation
- Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives
- AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents
- Loose Inertial Poser: Motion Capture with IMU-attached Loose-Wear Jacket
- Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features
- Score-Guided Diffusion for 3D Human Recovery
- AAMDM: Accelerated Auto-regressive Motion Diffusion Model
- Capturing Closely Interacted Two-Person Motions with Reaction Priors
- VINECS: Video-based Neural Character Skinning
- Semantics-aware Motion Retargeting with Vision-Language Models
- Video-Based Human Pose Regression via Decoupled Space-Time Aggregation
- Test-Time Domain Generalization for Face Anti-Spoofing
- A Unified and Interpretable Emotion Representation and Expression Generation
- Locally Adaptive Neural 3D Morphable Models
- Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption
- Fun with Flags: Robust Principal Directions via Flag Manifolds
- I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions
- Automatic Controllable Colorization via Imagination
- SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting
- Differentiable Point-based Inverse Rendering
- Video2Game: Real-time Interactive Realistic and Browser-Compatible Environment from a Single Video
- GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning
- DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer
- Modular Blind Video Quality Assessment
- Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting
- Rethinking Few-shot 3D Point Cloud Semantic Segmentation
- MAS: Multi-view Ancestral Sampling for 3D Motion Generation Using 2D Diffusion
- Quantifying Task Priority for Multi-Task Optimization
- From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration
- Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches
- HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation
- DIOD: Self-Distillation Meets Object Discovery
- PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos
- Cross-view and Cross-pose Completion for 3D Human Understanding
- SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement
- Enhancing Video Super-Resolution via Implicit Resampling-based Alignment
- Flexible Biometrics Recognition: Bridging the Multimodality Gap through Attention Alignment and Prompt Tuning
- Perception-Oriented Video Frame Interpolation via Asymmetric Blending
- Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
- Masked and Shuffled Blind Spot Denoising for Real-World Images
- CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation
- Joint2Human: High-Quality 3D Human Generation via Compact Spherical Embedding of 3D Joints
- DPHMs: Diffusion Parametric Head Models for Depth-based Tracking
- Anatomically Constrained Implicit Face Models
- Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation
- CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment
- Self-Calibrating Vicinal Risk Minimisation for Model Calibration
- LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example
- Prompt-Driven Referring Image Segmentation with Instance Contrasting
- HashPoint: Accelerated Point Searching and Sampling for Neural Rendering
- MoST: Motion Style Transformer Between Diverse Action Contents
- CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs
- Unbiased Estimator for Distorted Conics in Camera Calibration
- Towards a Perceptual Evaluation Framework for Lighting Estimation
- General Object Foundation Model for Images and Videos at Scale
- Design2Cloth: 3D Cloth Generation from 2D Masks
- Authentic Hand Avatar from a Phone Scan via Universal Hand Model
- HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations
- DiffusionLight: Light Probes for Free by Painting a Chrome Ball
- EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams
- GALA: Generating Animatable Layered Assets from a Single Scan
- Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation
- GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians
- Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects
- 3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation
- BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition
- Artist-Friendly Relightable and Animatable Neural Heads
- AUEditNet: Dual-Branch Facial Action Unit Intensity Manipulation with Implicit Disentanglement
- Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
- TexTile: A Differentiable Metric for Texture Tileability
- Guided Slot Attention for Unsupervised Video Object Segmentation
- From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation
- SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion
- GraCo: Granularity-Controllable Interactive Segmentation
- MoMask: Generative Masked Modeling of 3D Human Motions
- Breathing Life Into Sketches Using Text-to-Video Priors
- Exploiting Style Latent Flows for Generalizing Deepfake Video Detection
- Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians
- Open Vocabulary Semantic Scene Sketch Understanding
- Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers
- OmniMotionGPT: Animal Motion Generation with Limited Data
- Learning to Control Camera Exposure via Reinforcement Learning
- TexVocab: Texture Vocabulary-conditioned Human Avatars
- ZERO-IG: Zero-Shot Illumination-Guided Joint Denoising and Adaptive Enhancement for Low-Light Images
- HumMUSS: Human Motion Understanding using State Space Models
- From Feature to Gaze: A Generalizable Replacement of Linear Layer for Gaze Estimation
- G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis
- Boosting Image Restoration via Priors from Pre-trained Models
- UniVS: Unified and Universal Video Segmentation with Prompts as Queries
- PolarMatte: Fully Computational Ground-Truth-Quality Alpha Matte Extraction for Images and Video using Polarized Screen Matting
- One-Class Face Anti-spoofing via Spoof Cue Map-Guided Feature Learning
- HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion
- Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains
- NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation
- IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing
- Weakly Supervised Point Cloud Semantic Segmentation via Artificial Oracle
- Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning
- KeyPoint Relative Position Encoding for Face Recognition
- Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation
- BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics
- FaceLift: Semi-supervised 3D Facial Landmark Localization
- SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes
- Robust Image Denoising through Adversarial Frequency Mixup
- Estimating Extreme 3D Image Rotations using Cascaded Attention
- Collaborating Foundation Models for Domain Generalized Semantic Segmentation
- Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors
- A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation
- Sharingan: A Transformer Architecture for Multi-Person Gaze Following
- Optimizing Diffusion Noise Can Serve As Universal Motion Priors
- CAM Back Again: Large Kernel CNNs from a Weakly Supervised Object Localization Perspective
- Differentiable Micro-Mesh Construction
- UniHuman: A Unified Model For Editing Human Images in the Wild
- ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention
- Segment Every Out-of-Distribution Object
- NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis
- SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
- SVDTree: Semantic Voxel Diffusion for Single Image Tree Reconstruction
- A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark
- TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation
- VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams
- HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
- Neural Super-Resolution for Real-time Rendering with Radiance Demodulation
- Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation
- Self-Supervised Dual Contouring
- Segment Any Event Streams via Weighted Adaptation of Pivotal Tokens
- Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence
- Beyond Image Super-Resolution for Image Recognition with Task-Driven Perceptual Loss
- MANUS: Markerless Grasp Capture using Articulated 3D Gaussians
- GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors
- Deep Equilibrium Diffusion Restoration with Parallel Sampling
- Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
- Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes
- Point2CAD: Reverse Engineering CAD Models from 3D Point Clouds
- NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors
- Scaling Up Dynamic Human-Scene Interaction Modeling
- GenesisTex: Adapting Image Denoising Diffusion to Texture Space
- Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation
- PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation
- MatFuse: Controllable Material Generation with Diffusion Models
- Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement
- Garment Recovery with Shape and Deformation Priors
- LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging
- Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation
- Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image
- EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
- Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring
- Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation
- UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures
- Putting the Object Back into Video Object Segmentation
- Unlocking the Potential of Pre-trained Vision Transformers for Few-Shot Semantic Segmentation through Relationship Descriptors
- RAM-Avatar: Real-time Photo-Realistic Avatar from Monocular Videos with Full-body Control
- Misalignment-Robust Frequency Distribution Loss for Image Transformation
- Programmable Motion Generation for Open-Set Motion Control Tasks
- RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method
- Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi
- Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
- Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance
- Relightable Gaussian Codec Avatars
- Arbitrary Motion Style Transfer with Multi-condition Motion Latent Diffusion Model
- From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
- SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis
- Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing
- USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation
- High-Quality Facial Geometry and Appearance Capture at Home
- HOIAnimator: Generating Text-prompt Human-object Animations using Novel Perceptive Diffusion Models
- Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
- SD2Event:Self-supervised Learning of Dynamic Detectors and Contextual Descriptors for Event Cameras
- HOI-M^3: Capture Multiple Humans and Objects Interaction within Contextual Environment
- RecDiffusion: Rectangling for Image Stitching with Diffusion Models
- Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
- PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF
- XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies
- No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation
- HIT: Estimating Internal Human Implicit Tissues from the Body Surface
- RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses
- EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation
- ChatPose: Chatting about 3D Human Pose
- PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
- Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
- Relightable and Animatable Neural Avatar from Sparse-View Video
- ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning
- Unsupervised Gaze Representation Learning from Multi-view Face Images
- A Unified Framework for Human-centric Point Cloud Video Understanding
- KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation
- PFStorer: Personalized Face Restoration and Super-Resolution
- MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using Differentiable Shading
- Real-Time Exposure Correction via Collaborative Transformations and Adaptive Sampling
- HUGS: Human Gaussian Splats
- 3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow
- Deciphering ‘What’ and ‘Where’ Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations
- Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes
- AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond
- Degrees of Freedom Matter: Inferring Dynamics from Point Trajectories
- As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors
- Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks
- Unsupervised Salient Instance Detection
- MeshPose: Unifying DensePose and 3D Body Mesh Reconstruction
- SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation
- DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans
- Semantic-aware SAM for Point-Prompted Instance Segmentation
- One-Shot Open Affordance Learning with Foundation Models
- Monocular Identity-Conditioned Facial Reflectance Reconstruction
- RobustSAM: Segment Anything Robustly on Degraded Images
- Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
- Self-Supervised Facial Representation Learning with Facial Region Awareness
- Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach
- 3D Facial Expressions through Analysis-by-Neural-Synthesis
- LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment
- Rethinking Interactive Image Segmentation with Low Latency High Quality and Diverse Prompts
- ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring
- Hierarchical Histogram Threshold Segmentation – Auto-terminating High-detail Oversegmentation
- Image Sculpting: Precise Object Editing with 3D Geometry Control
- FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring
- Functional Diffusion
- Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding
- From Activation to Initialization: Scaling Insights for Optimizing Neural Fields
- Objects as Volumes: A Stochastic Geometry View of Opaque Solids
- Residual Denoising Diffusion Models
- LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition
- Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching
- Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction
- PoNQ: a Neural QEM-based Mesh Representation
- Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras
- ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering
- Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations
- URHand: Universal Relightable Hands
- Learned Scanpaths Aid Blind Panoramic Video Quality Assessment
- Bidirectional Autoregessive Diffusion Model for Dance Generation
- DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation
- CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention
- PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild
- Gradient Alignment for Cross-Domain Face Anti-Spoofing
- Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering
- TextureDreamer: Image-Guided Texture Synthesis Through Geometry-Aware Diffusion
- Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution
- Multimodal Sense-Informed Forecasting of 3D Human Motions
- M&M VTO: Multi-Garment Virtual Try-On and Editing
- MFP: Making Full Use of Probability Maps for Interactive Image Segmentation
- Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation
- Identifying Important Group of Pixels using Interactions
- MaskPLAN: Masked Generative Layout Planning from Partial Input
- Data-Free Quantization via Pseudo-label Filtering
- Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
- Generating Non-Stationary Textures using Self-Rectification
- MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
- Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth
- HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting
- Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis
- Pose Adapted Shape Learning for Large-Pose Face Reenactment
- Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
- Unsupervised Template-assisted Point Cloud Shape Correspondence Network
- Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching
- MR-VNet: Media Restoration using Volterra Networks
- Readout Guidance: Learning Control from Diffusion Features
- Layout-Agnostic Scene Text Image Synthesis with Diffusion Models
- DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
- Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion
- AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error
- Learning Degradation-unaware Representation with Prior-based Latent Transformations for Blind Face Restoration
- Geometry Transfer for Stylizing Radiance Fields
- Scaling Laws of Synthetic Images for Model Training ... for Now
- SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design
- Unmixing Before Fusion: A Generalized Paradigm for Multi-Source-based Hyperspectral Image Synthesis
- WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
- Panacea: Panoramic and Controllable Video Generation for Autonomous Driving
- DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction
- Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices
- NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging
- Doubly Abductive Counterfactual Inference for Text-based Image Editing
- Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation
- VidToMe: Video Token Merging for Zero-Shot Video Editing
- PAIR Diffusion: A Comprehensive Multimodal Object-Level Image Editor
- REACTO: Reconstructing Articulated Objects from a Single Video
- Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis
- Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
- Diversity-aware Channel Pruning for StyleGAN Compression
- Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
- Steerers: A Framework for Rotation Equivariant Keypoint Descriptors
- PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
- Diffusion Handles Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D
- VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
- FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
- SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation
- Exact Fusion via Feature Distribution Matching for Few-shot Image Generation
- Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting
- DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars
- Single Mesh Diffusion Models with Field Latents for Texture Generation
- Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection
- LightIt: Illumination Modeling and Control for Diffusion Models
- GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs
- Towards 3D Vision with Low-Cost Single-Photon Cameras
- PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns
- VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models
- NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild
- Fitting Flats to Flats
- LAENeRF: Local Appearance Editing for Neural Radiance Fields
- Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing
- Prompt Augmentation for Self-supervised Text-guided Image Manipulation
- Universal Robustness via Median Randomized Smoothing for Real-World Super-Resolution
- In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing
- Permutation Equivariance of Transformers and Its Applications
- SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer
- TextCraftor: Your Text Encoder Can be Image Quality Controller
- SIGNeRF: Scene Integrated Generation for Neural Radiance Fields
- FedUV: Uniformity and Variance for Heterogeneous Federated Learning
- LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model
- Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement
- Denoising Point Clouds in Latent Space via Graph Convolution and Invertible Neural Network
- Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution
- InstructVideo: Instructing Video Diffusion Models with Human Feedback
- Fast ODE-based Sampling for Diffusion Models in Around 5 Steps
- DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
- Continuous Pose for Monocular Cameras in Neural Implicit Representation
- Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis
- Grid Diffusion Models for Text-to-Video Generation
- Observation-Guided Diffusion Probabilistic Models
- Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation
- Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning
- DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
- Learning Continuous 3D Words for Text-to-Image Generation
- FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
- Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer
- HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
- Exploiting Diffusion Prior for Generalizable Dense Prediction
- Neural Lineage
- Style Aligned Image Generation via Shared Attention
- NC-TTT: A Noise Constrastive Approach for Test-Time Training
- GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos
- AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings
- Training-Free Pretrained Model Merging
- A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
- MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation
- Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting
- AnyDoor: Zero-shot Object-level Image Customization
- Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
- CCEdit: Creative and Controllable Video Editing via Diffusion Models
- Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences
- Mitigating Motion Blur in Neural Radiance Fields with Events and Frames
- Making Vision Transformers Truly Shift-Equivariant
- Leveraging Camera Triplets for Efficient and Accurate Structure-from-Motion
- Video-P2P: Video Editing with Cross-attention Control
- Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
- SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation
- DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
- Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On
- Learned Representation-Guided Diffusion Models for Large-Image Generation
- WonderJourney: Going from Anywhere to Everywhere
- A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
- Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models
- Personalized Residuals for Concept-Driven Text-to-Image Generation
- PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
- Gaussian Shell Maps for Efficient 3D Human Generation
- AdaShift: Learning Discriminative Self-Gated Neural Feature Activation With an Adaptive Shift Factor
- StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
- InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
- Accurate Training Data for Occupancy Map Prediction in Automated Driving Using Evidence Theory
- CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model
- Cache Me if You Can: Accelerating Diffusion Models through Block Caching
- Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts
- Generative Unlearning for Any Identity
- Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing
- Instruct-Imagen: Image Generation with Multi-modal Instruction
- CDFormer: When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution
- FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance Head-pose and Facial Expression Features
- Boosting Diffusion Models with Moving Average Sampling in Frequency Domain
- Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models
- Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation
- Towards Memorization-Free Diffusion Models
- Face2Diffusion for Fast and Editable Face Personalization
- RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction
- MotionEditor: Editing Video Motion via Content-Aware Diffusion
- GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image
- GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models
- DreamComposer: Controllable 3D Object Generation via Multi-View Conditions
- Deformable One-shot Face Stylization via DINO Semantic Guidance
- 2S-UDF: A Novel Two-stage UDF Learning Method for Robust Non-watertight Model Reconstruction from Multi-view Images
- SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model
- OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos
- 3D Multi-frame Fusion for Video Stabilization
- Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation
- In Search of a Data Transformation That Accelerates Neural Field Training
- Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
- Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection
- TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video
- Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models
- Named Entity Driven Zero-Shot Image Manipulation
- StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN
- Orthogonal Adaptation for Modular Customization of Diffusion Models
- SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
- Intrinsic Image Diffusion for Indoor Single-view Material Estimation
- Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models
- RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
- TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
- Generating Illustrated Instructions
- LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis
- Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability
- Clockwork Diffusion: Efficient Generation With Model-Step Distillation
- Condition-Aware Neural Network for Controlled Image Generation
- VAREN: Very Accurate and Realistic Equine Network
- Time- Memory- and Parameter-Efficient Visual Adaptation
- It's All About Your Sketch: Democratising Sketch Control in Diffusion Models
- Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
- MedBN: Robust Test-Time Adaptation against Malicious Test Samples
- CAMEL: CAusal Motion Enhancement Tailored for Lifting Text-driven Video Editing
- Seeing the World through Your Eyes
- Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance
- Friendly Sharpness-Aware Minimization
- LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching
- DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance
- TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing
- Low-Latency Neural Stereo Streaming
- MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior
- Learning Structure-from-Motion with Graph Attention Networks
- VideoBooth: Diffusion-based Video Generation with Image Prompts
- TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models
- Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
- TetraSphere: A Neural Descriptor for O(3)-Invariant Point Cloud Analysis
- AVID: Any-Length Video Inpainting with Diffusion Model
- Generalizable Novel-View Synthesis using a Stereo Camera
- Grounded Text-to-Image Synthesis with Attention Refocusing
- Robust Self-calibration of Focal Lengths from the Fundamental Matrix
- 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
- Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields
- Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples
- Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models
- FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer
- AnyScene: Customized Image Synthesis with Composited Foreground
- Vlogger: Make Your Dream A Vlog
- FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
- ControlRoom3D: Room Generation using Semantic Proxy Rooms
- Correcting Diffusion Generation through Resampling
- AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search
- Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following
- Hierarchical Patch Diffusion Models for High-Resolution Video Generation
- YolOOD: Utilizing Object Detection Concepts for Multi-Label Out-of-Distribution Detection
- SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering
- GenN2N: Generative NeRF2NeRF Translation
- Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data
- CosmicMan: A Text-to-Image Foundation Model for Humans
- Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion
- SuperPrimitive: Scene Reconstruction at a Primitive Level
- ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image
- HEAL-SWIN: A Vision Transformer On The Sphere
- VecFusion: Vector Font Generation with Diffusion
- StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation
- The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing
- Text-Driven Image Editing via Learnable Regions
- DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video
- KPConvX: Modernizing Kernel Point Convolution with Kernel Attention
- ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D Image
- Mean-Shift Feature Transformer
- Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance
- State Space Models for Event Cameras
- Relightful Harmonization: Lighting-aware Portrait Background Replacement
- SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering
- Don’t Drop Your Samples! Coherence-Aware Training Benefits Conditional Diffusion
- Revisiting Sampson Approximations for Geometric Estimation Problems
- Wired Perspectives: Multi-View Wire Art Embraces Generative AI
- InceptionNeXt: When Inception Meets ConvNeXt
- CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization
- Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates
- Video Interpolation with Diffusion Models
- JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
- Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
- LTM: Lightweight Textured Mesh Extraction and Refinement of Large Unbounded Scenes for Efficient Storage and Real-time Rendering
- DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
- On Exact Inversion of DPM-Solvers
- DaReNeRF: Direction-aware Representation for Dynamic Scenes
- Total Selfie: Generating Full-Body Selfies
- Efficient Detection of Long Consistent Cycles and its Application to Distributed Synchronization
- Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization
- DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior
- Self-correcting LLM-controlled Diffusion Models
- DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
- Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing
- Building Optimal Neural Architectures using Interpretable Knowledge
- A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network
- SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
- You Only Need Less Attention at Each Stage in Vision Transformers
- FreeU: Free Lunch in Diffusion U-Net
- HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data
- Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
- Towards Accurate and Robust Architectures via Neural Architecture Search
- Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis
- InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
- ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion
- Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
- Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification
- PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
- 3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis
- SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream
- Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models
- ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
- 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting
- Balancing Act: Distribution-Guided Debiasing in Diffusion Models
- Customization Assistant for Text-to-Image Generation
- Emu Edit: Precise Image Editing via Recognition and Generation Tasks
- Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis
- Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training
- One-Shot Structure-Aware Stylized Image Synthesis
- Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network
- IReNe: Instant Recoloring of Neural Radiance Fields
- Relation Rectification in Diffusion Model
- Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation
- Long-Tailed Anomaly Detection with Learnable Class Names
- Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation
- Learning Occupancy for Monocular 3D Object Detection
- PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
- Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models
- ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting
- MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation
- Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions
- Learning Triangular Distribution in Visual World
- Depth Prompting for Sensor-Agnostic Depth Estimation
- NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows
- Understanding Video Transformers via Universal Concept Discovery
- EscherNet: A Generative Model for Scalable View Synthesis
- Beyond Seen Primitive Concepts and Attribute-Object Compositional Learning
- Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
- Weakly Supervised Monocular 3D Detection with a Single-View Image
- A Stealthy Wrongdoer: Feature-Oriented Reconstruction Attack against Split Learning
- Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention
- Efficient Test-Time Adaptation of Vision-Language Models
- FairRAG: Fair Human Generation via Fair Retrieval Augmentation
- HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
- SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
- Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
- CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment
- Explaining CLIP's Performance Disparities on Data from Blind/Low Vision Users
- OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
- Compositional Chain-of-Thought Prompting for Large Multimodal Models
- Behind the Veil: Enhanced Indoor 3D Scene Reconstruction with Occluded Surfaces Completion
- MuseChat: A Conversational Music Recommendation System for Videos
- ViewFusion: Towards Multi-View Consistency via Interpolated Denoising
- DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
- HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions
- HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields
- CrowdDiff: Multi-hypothesis Crowd Density Estimation using Diffusion Models
- VTimeLLM: Empower LLM to Grasp Video Moments
- MICap: A Unified Model for Identity-Aware Movie Descriptions
- Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training
- SketchINR: A First Look into Sketches as Implicit Neural Representations
- 6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation
- Referring Image Editing: Object-level Image Editing via Referring Expressions
- SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples
- SAOR: Single-View Articulated Object Reconstruction
- RegionGPT: Towards Region Understanding Vision Language Model
- AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing
- The Manga Whisperer: Automatically Generating Transcriptions for Comics
- VOODOO 3D: Volumetric Portrait Disentanglement For One-Shot 3D Head Reenactment
- Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models
- Previously on ... From Recaps to Story Summarization
- See Say and Segment: Teaching LMMs to Overcome False Premises
- Multi-Modal Hallucination Control by Visual Information Grounding
- 3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surfaces
- Koala: Key Frame-Conditioned Long Video-LLM
- Prompting Vision Foundation Models for Pathology Image Analysis
- A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning
- E-GPS: Explainable Geometry Problem Solving via Top-Down Solver and Bottom-Up Generator
- HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation
- SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers
- Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration
- Tumor Micro-environment Interactions Guided Graph Learning for Survival Analysis of Human Cancers from Whole-slide Pathological Images
- An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning
- Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer
- MonoCD: Monocular 3D Object Detection with Complementary Depths
- Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation
- VS: Reconstructing Clothed 3D Human from Single Image via Vertex Shift
- Rethinking Inductive Biases for Surface Normal Estimation
- ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis
- Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology
- Enhancing Intrinsic Features for Debiasing via Investigating Class-Discerning Common Attributes in Bias-Contrastive Pair
- Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts
- Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
- Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction
- Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering
- Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration
- Unleashing Network Potentials for Semantic Scene Completion
- Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning
- Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
- Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World
- CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation
- Honeybee: Locality-enhanced Projector for Multimodal LLM
- A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
- Differentiable Display Photometric Stereo
- Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
- Learning Large-Factor EM Image Super-Resolution with Generative Priors
- Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods
- Tune-An-Ellipse: CLIP Has Potential to Find What You Want
- Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes
- Constructing and Exploring Intermediate Domains in Mixed Domain Semi-supervised Medical Image Segmentation
- 3D-LFM: Lifting Foundation Model
- Unsupervised 3D Structure Inference from Category-Specific Image Collections
- Tyche: Stochastic In-Context Learning for Medical Image Segmentation
- Towards Better Vision-Inspired Vision-Language Models
- SeaBird: Segmentation in Bird’s View with Dice Loss Improves Monocular 3D Detection of Large Objects
- UniDepth: Universal Monocular Metric Depth Estimation
- CAD: Photorealistic 3D Generation via Adversarial Distillation
- GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement
- Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering
- Model Inversion Robustness: Can Transfer Learning Help?
- Instance-aware Contrastive Learning for Occluded Human Mesh Reconstruction
- DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
- Navigate Beyond Shortcuts: Debiased Learning Through the Lens of Neural Collapse
- Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance
- Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline
- Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
- EventPS: Real-Time Photometric Stereo Using an Event Camera
- Validating Privacy-Preserving Face Recognition under a Minimum Assumption
- SignGraph: A Sign Sequence is Worth Graphs of Nodes
- Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning
- MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
- Hearing Anything Anywhere
- Fair-VPT: Fair Visual Prompt Tuning for Image Classification
- Towards Efficient Replay in Federated Incremental Learning
- Label-Efficient Group Robustness via Out-of-Distribution Concept Curation
- Diversified and Personalized Multi-rater Medical Image Segmentation
- BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image
- SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
- HiLo: Detailed and Robust 3D Clothed Human Reconstruction with High-and Low-Frequency Information of Parametric Models
- Do Vision and Language Encoders Represent the World Similarly?
- DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors
- LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation
- Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
- EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
- Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction
- Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
- Situational Awareness Matters in 3D Vision Language Reasoning
- Brush2Prompt: Contextual Prompt Generator for Object Inpainting
- MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation
- DiG-IN: Diffusion Guidance for Investigating Networks - Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations
- WorDepth: Variational Language Prior for Monocular Depth Estimation
- LaneCPP: Continuous 3D Lane Detection using Physical Priors
- Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds
- Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis
- Slice3D: Multi-Slice Occlusion-Revealing Single View 3D Reconstruction
- What Sketch Explainability Really Means for Downstream Tasks?
- FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders
- Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding
- CycleINR: Cycle Implicit Neural Representation for Arbitrary-Scale Volumetric Super-Resolution of Medical Data
- SEED-Bench: Benchmarking Multimodal Large Language Models
- MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes
- XFibrosis: Explicit Vessel-Fiber Modeling for Fibrosis Staging from Liver Pathology Images
- GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
- PointInfinity: Resolution-Invariant Point Diffusion Models
- Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
- LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation
- R-Cyclic Diffuser: Reductive and Cyclic Latent Diffusion for 3D Clothed Human Digitalization
- Revisiting Counterfactual Problems in Referring Expression Comprehension
- Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models
- Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning
- ProMark: Proactive Diffusion Watermarking for Causal Attribution
- Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation
- NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images
- CNC-Net: Self-Supervised Learning for CNC Machining Operations
- Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
- Language-only Training of Zero-shot Composed Image Retrieval
- In-distribution Public Data Synthesis with Diffusion Models for Differentially Private Image Classification
- Text-Image Alignment for Diffusion-Based Perception
- ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations
- Plug-and-Play Diffusion Distillation
- Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition
- CORES: Convolutional Response-based Score for Out-of-distribution Detection
- Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer
- WWW: A Unified Framework for Explaining What Where and Why of Neural Networks by Interpretation of Neuron Concepts
- Neural Underwater Scene Representation
- MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
- Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation
- Global and Local Prompts Cooperation via Optimal Transport for Federated Learning
- On the Faithfulness of Vision Transformer Explanations
- The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement
- Communication-Efficient Federated Learning with Accelerated Client Gradient
- Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models
- Sparse Views Near Light: A Practical Paradigm for Uncalibrated Point-light Photometric Stereo
- Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework
- Towards Learning a Generalist Model for Embodied Navigation
- Bi-level Learning of Task-Specific Decoders for Joint Registration and One-Shot Medical Image Segmentation
- Incremental Residual Concept Bottleneck Models
- Unknown Prompt the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization
- V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
- Uncertainty Visualization via Low-Dimensional Posterior Projections
- The STVchrono Dataset: Towards Continuous Change Recognition in Time
- 3DToonify: Creating Your High-Fidelity 3D Stylized Avatar Easily from 2D Portrait Images
- Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
- Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences
- Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation
- RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D
- FLHetBench: Benchmarking Device and State Heterogeneity in Federated Learning
- GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects
- Language Models as Black-Box Optimizers for Vision-Language Models
- Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding
- BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
- VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models
- ScanFormer: Referring Expression Comprehension by Iteratively Scanning
- EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation
- PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation
- Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining
- Discovering and Mitigating Visual Biases through Keyword Explanation
- Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction
- InstaGen: Enhancing Object Detection by Training on Synthetic Dataset
- FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
- Building Vision-Language Models on Solid Foundations with Masked Distillation
- From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior
- Posterior Distillation Sampling
- CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering
- SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection
- FairCLIP: Harnessing Fairness in Vision-Language Learning
- Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation
- Privacy-Preserving Optics for Enhancing Protection in Face De-Identification
- WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights
- VMINer: Versatile Multi-view Inverse Rendering with Near- and Far-field Light Sources
- ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
- Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images
- Towards Language-Driven Video Inpainting via Multimodal Large Language Models
- Discontinuity-preserving Normal Integration with Auxiliary Edges
- Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation
- Would Deep Generative Models Amplify Bias in Future Models?
- SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling
- Dual-View Visual Contextualization for Web Navigation
- Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under Manhattan World Assumption
- WildlifeMapper: Aerial Image Analysis for Multi-Species Detection and Identification
- Bayesian Differentiable Physics for Cloth Digitalization
- Pixel-Aligned Language Model
- Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation
- CityDreamer: Compositional Generative Model of Unbounded 3D Cities
- Visual Objectification in Films: Towards a New AI Task for Video Interpretation
- A Theory of Joint Light and Heat Transport for Lambertian Scenes
- Cross-Dimension Affinity Distillation for 3D EM Neuron Segmentation
- Instance-level Expert Knowledge and Aggregate Discriminative Attention for Radiology Report Generation
- WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects Under Occlusion
- DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions
- Holistic Autonomous Driving Understanding by Bird’s-Eye-View Injected Multi-Modal Large Models
- G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images
- PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
- Learning the 3D Fauna of the Web
- SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image
- AHIVE: Anatomy-aware Hierarchical Vision Encoding for Interactive Radiology Report Retrieval
- EarthLoc: Astronaut Photography Localization by Indexing Earth from Space
- En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data
- Blind Image Quality Assessment Based on Geometric Order Learning
- Question Aware Vision Transformer for Multimodal Reasoning
- Transcriptomics-guided Slide Representation Learning in Computational Pathology
- Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
- IDGuard: Robust General Identity-centric POI Proactive Defense Against Face Editing Abuse
- Free3D: Consistent Novel View Synthesis without 3D Representation
- PH-Net: Semi-Supervised Breast Lesion Segmentation via Patch-wise Hardness
- EvDiG: Event-guided Direct and Global Components Separation
- Incremental Nuclei Segmentation from Histopathological Images via Future-class Awareness and Compatibility-inspired Distillation
- IBD-SLAM: Learning Image-Based Depth Fusion for Generalizable SLAM
- Viewpoint-Aware Visual Grounding in 3D Scenes
- SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology
- SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- An Edit Friendly DDPM Noise Space: Inversion and Manipulations
- ZeroShape: Regression-based Zero-shot Shape Reconstruction
- Retrieval-Augmented Egocentric Video Captioning
- MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models
- MonoNPHM: Dynamic Head Reconstruction from Monocular Videos
- Improved Visual Grounding through Self-Consistent Explanations
- Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo
- MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections
- Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI
- InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
- Structure-Aware Sparse-View X-ray 3D Reconstruction
- MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
- Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation
- Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer
- Holistic Features are almost Sufficient for Text-to-Video Retrieval
- Streaming Dense Video Captioning
- PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
- EASE-DETR: Easing the Competition among Object Queries
- 3D Feature Tracking via Event Camera
- EgoGen: An Egocentric Synthetic Data Generator
- RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation
- OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation
- Rethinking Boundary Discontinuity Problem for Oriented Object Detection
- Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous and Instruction-guided Driving
- Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions
- Test-Time Zero-Shot Temporal Action Localization
- CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation
- Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training
- D3still: Decoupled Differential Distillation for Asymmetric Image Retrieval
- What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
- CURSOR: Scalable Mixed-Order Hypergraph Matching with CUR Decomposition
- FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models
- Effective Video Mirror Detection with Inconsistent Motion Cues
- A Category Agnostic Model for Visual Rearrangment
- Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
- Atom-Level Optical Chemical Structure Recognition with Limited Supervision
- Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping
- Region-Based Representations Revisited
- How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?
- Versatile Navigation Under Partial Observability via Value-guided Diffusion Policy
- CaKDP: Category-aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection
- LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking
- Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households
- GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation Demonstration and Imitation
- Fourier-basis Functions to Bridge Augmentation Gap: Rethinking Frequency Augmentation in Image Classification
- Bézier Everywhere All at Once: Learning Drivable Lanes as Bézier Graphs
- Learning Vision from Models Rivals Learning Vision from Data
- Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval
- VicTR: Video-conditioned Text Representations for Activity Recognition
- VLP: Vision Language Planning for Autonomous Driving
- Efficient Meshflow and Optical Flow Estimation from Event Cameras
- MemFlow: Optical Flow Estimation and Prediction with Memory
- OmniViD: A Generative Framework for Universal Video Understanding
- VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection
- Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors
- Compositional Video Understanding with Spatiotemporal Structure-based Transformers
- UnO: Unsupervised Occupancy Fields for Perception and Forecasting
- Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge
- LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
- PELA: Learning Parameter-Efficient Models with Low-Rank Approximation
- An N-Point Linear Solver for Line and Motion Estimation with Event Cameras
- Neural Visibility Field for Uncertainty-Driven Active Mapping
- Logit Standardization in Knowledge Distillation
- Selective Interpretable and Motion Consistent Privacy Attribute Obfuscation for Action Recognition
- PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar
- Joint-Task Regularization for Partially Labeled Multi-Task Learning
- 3D LiDAR Mapping in Dynamic Environments using a 4D Implicit Neural Representation
- CaDeT: a Causal Disentanglement Approach for Robust Trajectory Prediction in Autonomous Driving
- Hyperspherical Classification with Dynamic Label-to-Prototype Assignment
- LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic Simulation
- PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos
- CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation
- Exploring Orthogonality in Open World Object Detection
- Novel Class Discovery for Ultra-Fine-Grained Visual Categorization
- Frozen Feature Augmentation for Few-Shot Image Classification
- Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers
- Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning
- RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features
- Driving Everywhere with Large Language Model Policy Adaptation
- LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels
- Point Segment and Count: A Generalized Framework for Object Counting
- Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers
- Hyperbolic Learning with Synthetic Captions for Open-World Detection
- Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID
- Towards Generalizable Multi-Object Tracking
- Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos
- RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection
- Open-Vocabulary Object 6D Pose Estimation
- Visual Point Cloud Forecasting enables Scalable Autonomous Driving
- Step Differences in Instructional Video
- Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning
- End-to-End Spatio-Temporal Action Localisation with Video Transformers
- Gaussian Splatting SLAM
- Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation
- TransLoc4D: Transformer-based 4D Radar Place Recognition
- ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More
- Scaled Decoupled Distillation
- Learning Transferable Negative Prompts for Out-of-Distribution Detection
- UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model
- Pixel-level Semantic Correspondence through Layout-aware Representation Learning and Multi-scale Matching Integration
- NeuRAD: Neural Rendering for Autonomous Driving
- Improving Distant 3D Object Detection Using 2D Box Supervision
- EventDance: Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition
- CSTA: CNN-based Spatiotemporal Attention for Video Summarization
- Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models
- Holodeck: Language Guided Generation of 3D Embodied AI Environments
- FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models
- MaxQ: Multi-Axis Query for N:M Sparsity Network
- Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-stage Action Localization
- RepViT: Revisiting Mobile CNN From ViT Perspective
- CLIP-KD: An Empirical Study of CLIP Model Distillation
- Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
- Weak-to-Strong 3D Object Detection with X-Ray Distillation
- ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association
- ProTeCt: Prompt Tuning for Taxonomic Open Set Classification
- Supervised Anomaly Detection for Complex Industrial Images
- From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding
- GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation
- LoCoNet: Long-Short Context Network for Active Speaker Detection
- CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images
- OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation
- LLMs are Good Action Recognizers
- LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection
- StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential Equation
- SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields
- SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model
- Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
- DualAD: Disentangling the Dynamic and Static World for End-to-End Driving
- Object Recognition as Next Token Prediction
- MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection
- Enhancing the Power of OOD Detection via Sample-Aware Model Selection
- Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection
- Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching
- Harnessing Large Language Models for Training-free Video Anomaly Detection
- MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
- Generalized Predictive Model for Autonomous Driving
- Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
- Extreme Point Supervised Instance Segmentation
- Video ReCap: Recursive Captioning of Hour-Long Videos
- SeMoLi: What Moves Together Belongs Together
- HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
- A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion
- Adaptive Softassign via Hadamard-Equipped Sinkhorn
- Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline
- Commonsense Prototype for Outdoor Unsupervised 3D Object Detection
- UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather
- ICP-Flow: LiDAR Scene Flow Estimation with ICP
- NetTrack: Tracking Highly Dynamic Objects with a Net
- Dual Prototype Attention for Unsupervised Video Object Segmentation
- SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
- GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds
- Preserving Fairness Generalization in Deepfake Detection
- TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation
- Active Object Detection with Knowledge Aggregation and Distillation from Large Models
- SnAG: Scalable and Accurate Video Grounding
- Optimal Transport Aggregation for Visual Place Recognition
- On the Estimation of Image-matching Uncertainty in Visual Place Recognition
- PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation
- CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow
- CrossKD: Cross-Head Knowledge Distillation for Object Detection
- SFOD: Spiking Fusion Object Detector
- Rapid Motor Adaptation for Robotic Manipulator Arms
- Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation
- Dense Vision Transformer Compression with Few Samples
- MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
- Depth-Aware Concealed Crop Detection in Dense Agricultural Scenes
- UniMODE: Unified Monocular 3D Object Detection
- Producing and Leveraging Online Map Uncertainty in Trajectory Prediction
- CAT: Exploiting Inter-Class Dynamics for Domain Adaptive Object Detection
- UniPAD: A Universal Pre-training Paradigm for Autonomous Driving
- Model Adaptation for Time Constrained Embodied Control
- Learning Group Activity Features Through Person Attribute Prediction
- Towards High-fidelity Artistic Image Vectorization via Texture-Encapsulated Shape Parameterization
- Single-Model and Any-Modality for Video Object Tracking
- View From Above: Orthogonal-View aware Cross-view Localization
- Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection
- Resource-Efficient Transformer Pruning for Finetuning of Large Models
- A Generative Approach for Wikipedia-Scale Visual Entity Recognition
- Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
- Video Harmonization with Triplet Spatio-Temporal Variation Patterns
- Detours for Navigating Instructional Videos
- A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives
- Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers
- Multi-View Attentive Contextualization for Multi-View 3D Object Detection
- Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views
- On Train-Test Class Overlap and Detection for Image Retrieval
- Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection
- Riemannian Multinomial Logistics Regression for SPD Neural Networks
- SIRA: Scalable Inter-frame Relation and Association for Radar Perception
- PREGO: Online Mistake Detection in PRocedural EGOcentric Videos
- Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis
- SPIN: Simultaneous Perception Interaction and Navigation
- MoST: Multi-Modality Scene Tokenization for Motion Prediction
- CAGE: Controllable Articulation GEneration
- Seeing the Unseen: Visual Common Sense for Semantic Placement
- GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction
- Continual Learning for Motion Prediction Model via Meta-Representation Learning and Optimal Memory Buffer Retention Strategy
- GRAM: Global Reasoning for Multi-Page VQA
- Learning Object State Changes in Videos: An Open-World Perspective
- vid-TLDR: Training Free Token Merging for Light-weight Video Transformer
- Learning for Transductive Threshold Calibration in Open-World Recognition
- Feedback-Guided Autonomous Driving
- BEVSpread: Spread Voxel Pooling for Bird’s-Eye-View Representation in Vision-based Roadside 3D Object Detection
- SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution
- Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving
- PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks
- RoHM: Robust Human Motion Reconstruction via Diffusion
- Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers
- TIM: A Time Interval Machine for Audio-Visual Action Recognition
- DiffLoc: Diffusion Model for Outdoor LiDAR Localization
- Higher-order Relational Reasoning for Pedestrian Trajectory Prediction
- Prompt-Enhanced Multiple Instance Learning for Weakly Supervised Video Anomaly Detection
- Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch
- Transferable and Principled Efficiency for Open-Vocabulary Segmentation
- TransNeXt: Robust Foveal Visual Perception for Vision Transformers
- Generating Handwritten Mathematical Expressions From Symbol Graphs: An End-to-End Pipeline
- LLMs are Good Sign Language Translators
- Language-driven Grasp Detection
- Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning
- Dense Optical Tracking: Connecting the Dots
- Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion
- Gradient Reweighting: Towards Imbalanced Class-Incremental Learning
- SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes
- Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition
- Attribute-Guided Pedestrian Retrieval: Bridging Person Re-ID with Internal Attribute Variability
- From Coarse to Fine-Grained Open-Set Recognition
- SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency
- SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World
- Learning Correlation Structures for Vision Transformers
- OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning
- YOLO-World: Real-Time Open-Vocabulary Object Detection
- Learning to Navigate Efficiently and Precisely in Real Environments
- Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving
- You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
- Matching Anything by Segmenting Anything
- Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes
- Low-power Continuous Remote Behavioral Localization with Event Cameras
- MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction
- Action Detection via an Image Diffusion Process
- PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
- Towards Realistic Scene Generation with LiDAR Diffusion Models
- TULIP: Transformer for Upsampling of LiDAR Point Clouds
- Towards Robust 3D Object Detection with LiDAR and 4D Radar Fusion in Various Weather Conditions
- Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions
- Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection
- Task-Conditioned Adaptation of Visual Features in Multi-Task Policy Learning
- Hybrid Proposal Refiner: Revisiting DETR Series from the Faster R-CNN Perspective
- MTLoRA: Low-Rank Adaptation Approach for Efficient Multi-Task Learning
- PTQ4SAM: Post-Training Quantization for Segment Anything
- Characteristics Matching Based Hash Codes Generation for Efficient Fine-grained Image Retrieval
- Delving into the Trajectory Long-tail Distribution for Muti-object Tracking
- ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
- Hyperbolic Anomaly Detection
- Implicit Motion Function
- Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations
- Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences
- Referring Expression Counting
- Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
- ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction
- Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation
- Contrastive Learning for DeepFake Classification and Localization via Multi-Label Ranking
- Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
- PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition
- Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation
- DETRs Beat YOLOs on Real-time Object Detection
- MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
- Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception
- Sparse Global Matching for Video Frame Interpolation with Large Motion
- Context-Aware Integration of Language and Visual References for Natural Language Tracking
- Asymmetric Masked Distillation for Pre-Training Small Foundation Models
- PEM: Prototype-based Efficient MaskFormer for Image Segmentation
- Looking 3D: Anomaly Detection with 2D-3D Alignment
- Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations
- IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
- Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach
- Depth-aware Test-Time Training for Zero-shot Video Object Segmentation
- Retrieval-Augmented Open-Vocabulary Object Detection
- OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition
- LiDAR-based Person Re-identification
- Instance-Aware Group Quantization for Vision Transformers
- BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
- NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis
- CORE-MPI: Consistency Object Removal with Embedding MultiPlane Image
- RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception
- Unified Language-driven Zero-shot Domain Adaptation
- CausalPC: Improving the Robustness of Point Cloud Classification by Causal Effect Identification
- DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis
- Improving Plasticity in Online Continual Learning via Collaborative Learning
- Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation
- FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations
- MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation
- A Versatile Framework for Continual Test-Time Domain Adaptation: Balancing Discriminability and Generalizability
- Label Propagation for Zero-shot Classification with Vision-Language Models
- MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video
- CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoor Object Detection from Multi-view Images
- Spectrum AUC Difference (SAUCD): Human-aligned 3D Shape Evaluation
- VGGSfM: Visual Geometry Grounded Deep Structure From Motion
- Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding
- NeRF Director: Revisiting View Selection in Neural Volume Rendering
- TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations
- Unveiling the Unknown: Unleashing the Power of Unknown to Known in Open-Set Source-Free Domain Adaptation
- Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering
- GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis
- PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling
- A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
- GauHuman: Articulated Gaussian Splatting from Monocular Human Videos
- FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions
- Improving Generalized Zero-Shot Learning by Exploring the Diverse Semantics from External Class Names
- Compact 3D Gaussian Representation for Radiance Field
- Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras
- Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning
- 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
- GEARS: Local Geometry-aware Hand-object Interaction Synthesis
- What How and When Should Object Detectors Update in Continually Changing Test Domains?
- Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields
- Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling
- L2B: Learning to Bootstrap Robust Models for Combating Label Noise
- Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
- Brain Decodes Deep Nets
- Adaptive Slot Attention: Object Discovery with Dynamic Slot Number
- ESCAPE: Encoding Super-keypoints for Category-Agnostic Pose Estimation
- 4K4D: Real-Time 4D View Synthesis at 4K Resolution
- GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs
- PAPR in Motion: Seamless Point-level 3D Scene Interpolation
- A Bayesian Approach to OOD Robustness in Image Classification
- SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field
- Integrating Efficient Optimal Transport and Functional Maps For Unsupervised Shape Correspondence Learning
- VBench: Comprehensive Benchmark Suite for Video Generative Models
- MSU-4S - The Michigan State University Four Seasons Dataset
- ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks
- ColorPCR: Color Point Cloud Registration with Multi-Stage Geometric-Color Fusion
- NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation
- Positive-Unlabeled Learning by Latent Group-Aware Meta Disambiguation
- NeISF: Neural Incident Stokes Field for Geometry and Material Estimation
- Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation
- GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians
- Test-Time Linear Out-of-Distribution Detection
- Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction
- MVCPS-NeuS: Multi-view Constrained Photometric Stereo for Neural Surface Reconstruction
- GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?
- TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding
- ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models
- Gaussian Shadow Casting for Neural Characters
- GenZI: Zero-Shot 3D Human-Scene Interaction Generation
- TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
- Robust Synthetic-to-Real Transfer for Stereo Matching
- Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains
- Evaluating Transferability in Retrieval Tasks: An Approach Using MMD and Kernel Methods
- Efficient Solution of Point-Line Absolute Pose
- What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs
- DYSON: Dynamic Feature Space Self-Organization for Online Task-Free Class Incremental Learning
- Improving Graph Contrastive Learning via Adaptive Positive Sampling
- DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes
- Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
- Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency
- MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers
- InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning
- Neural Refinement for Absolute Pose Regression with Feature Synthesis
- How to Train Neural Field Representations: A Comprehensive Study and Benchmark
- Differentiable Neural Surface Refinement for Modeling Transparent Objects
- Multi-Level Neural Scene Graphs for Dynamic Urban Environments
- Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling
- Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform
- Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset
- TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model
- MuRF: Multi-Baseline Radiance Fields
- A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals
- Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction
- Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning
- Hierarchical Correlation Clustering and Tree Preserving Embedding
- CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers
- Neural Modes: Self-supervised Learning of Nonlinear Modal Subspaces
- GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding
- HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces
- LangSplat: 3D Language Gaussian Splatting
- Source-Free Domain Adaptation with Frozen Multimodal Foundation Model
- Distributionally Generative Augmentation for Fair Facial Attribute Classification
- CG-HOI: Contact-Guided 3D Human-Object Interaction Generation
- MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection
- MLP Can Be A Good Transformer Learner
- Traffic Scene Parsing through the TSP6K Dataset
- Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes
- HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting
- Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer
- FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
- Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
- DeMatch: Deep Decomposition of Motion Field for Two-View Correspondence Learning
- Text-Enhanced Data-free Approach for Federated Class-Incremental Learning
- UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes
- Multiway Point Cloud Mosaicking with Diffusion and Global Optimization
- Learning Equi-angular Representations for Online Continual Learning
- Text-to-3D using Gaussian Splatting
- MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures
- Convolutional Prompting meets Language Models for Continual Learning
- Pre-training Vision Models with Mandelbulb Variations
- CoGS: Controllable Gaussian Splatting
- A2XP: Towards Private Domain Generalization
- Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
- Universal Novelty Detection Through Adaptive Contrastive Learning
- OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees
- GS-IR: 3D Gaussian Splatting for Inverse Rendering
- MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark
- Deep Generative Model based Rate-Distortion for Image Downscaling Assessment
- Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange
- EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
- OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
- Disentangled Prompt Representation for Domain Generalization
- Deep Imbalanced Regression via Hierarchical Classification Adjustment
- StraightPCF: Straight Point Cloud Filtering
- Classes Are Not Equal: An Empirical Study on Image Recognition Fairness
- HouseCat6D - A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios
- A Noisy Elephant in the Room: Is Your Out-of-Distribution Detector Robust to Label Noise?
- ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object
- Perceptual Assessment and Optimization of HDR Image Rendering
- Improving Depth Completion via Depth Feature Upsampling
- eTraM: Event-based Traffic Monitoring Dataset
- CroSel: Cross Selection of Confident Pseudo Labels for Partial-Label Learning
- GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection
- Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields
- Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
- Absolute Pose from One or Two Scaled and Oriented Features
- LoS: Local Structure-Guided Stereo Matching
- TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding
- Instantaneous Perception of Moving Objects in 3D
- HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation
- Dr.Hair: Reconstructing Scalp-Connected Hair Strands without Pre-Training via Differentiable Rendering of Line Segments
- ReCoRe: Regularized Contrastive Representation Learning of World Model
- Insights from the Use of Previously Unseen Neural Architecture Search Datasets
- GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions
- Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement
- GARField: Group Anything with Radiance Fields
- IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images
- DUSt3R: Geometric 3D Vision Made Easy
- GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
- SpecNeRF: Gaussian Directional Encoding for Specular Reflections
- LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry
- EFHQ: Multi-purpose ExtremePose-Face-HQ dataset
- Learning to Rank Patches for Unbiased Image Redundancy Reduction
- MatSynth: A Modern PBR Materials Dataset
- Non-Rigid Structure-from-Motion: Temporally-Smooth Procrustean Alignment and Spatially-Variant Deformation Modeling
- DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes
- HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting
- Dynamic Cues-Assisted Transformer for Robust Point Cloud Registration
- EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
- GART: Gaussian Articulated Template Models
- Low-Resource Vision Challenges for Foundation Models
- Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes
- Loopy-SLAM: Dense Neural SLAM with Loop Closures
- FAR: Flexible Accurate and Robust 6DoF Relative Camera Pose Estimation
- NeRFiller: Completing Scenes via Generative 3D Inpainting
- 3D Neural Edge Reconstruction
- TUMTraf V2X Cooperative Perception Dataset
- 3DInAction: Understanding Human Actions in 3D Point Clouds
- S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes
- Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields
- pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
- AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis
- Real-World Mobile Image Denoising Dataset with Efficient Baselines
- LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes
- LiDAR-Net: A Real-scanned 3D Point Cloud Dataset for Indoor Scenes
- Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
- Rich Human Feedback for Text-to-Image Generation
- FISBe: A Real-World Benchmark Dataset for Instance Segmentation of Long-Range Thin Filamentous Structures
- MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors
- Adapters Strike Back
- 360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries
- SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM
- Open-Set Domain Adaptation for Semantic Segmentation
- 360+x: A Panoptic Multi-modal Scene Understanding Dataset
- Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps
- Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM
- The Devil is in the Fine-Grained Details: Evaluating Open-Vocabulary Object Detectors for Fine-Grained Understanding
- Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling
- GES : Generalized Exponential Splatting for Efficient Radiance Field Rendering
- Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery
- Learning to Produce Semi-dense Correspondences for Visual Localization
- DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF
- D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object Detection
- DAVE - A Detect-and-Verify Paradigm for Low-Shot Counting
- Enhancing Visual Continual Learning with Language-Guided Supervision
- Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation
- Robust Depth Enhancement via Polarization Prompt Fusion Tuning
- NEAT: Distilling 3D Wireframes from Neural Attraction Fields
- Instance Tracking in 3D Scenes from Egocentric Videos
- NARUTO: Neural Active Reconstruction from Uncertain Target Observations
- NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation
- SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild
- Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It
- COLMAP-Free 3D Gaussian Splatting
- Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach
- Learning from Synthetic Human Group Activities
- CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning
- Mip-Splatting: Alias-free 3D Gaussian Splatting
- Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata
- PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion
- Map-Relative Pose Regression for Visual Re-Localization
- JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups
- OneFormer3D: One Transformer for Unified Point Cloud Segmentation
- WinSyn: : A High Resolution Testbed for Synthetic Data
- DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields
- DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization
- SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments
- NICE: Neurogenesis Inspired Contextual Encoding for Replay-free Class Incremental Learning
- Can Biases in ImageNet Models Explain Generalization?
- Towards Generalizing to Unseen Domains with Few Labels
- A Simple Recipe for Language-guided Domain Generalized Segmentation
- Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos
- RoMa: Robust Dense Feature Matching
- MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
- Three Pillars Improving Vision Foundation Model Distillation for Lidar
- OpenStreetView-5M: The Many Roads to Global Visual Geolocation
- Universal Semi-Supervised Domain Adaptation by Mitigating Common-Class Bias
- Real-time Acquisition and Reconstruction of Dynamic Volumes with Neural Structured Illumination
- Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation
- Edge-Aware 3D Instance Segmentation Network with Intelligent Semantic Prior
- Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds
- Systematic Comparison of Semi-supervised and Self-supervised Learning for Medical Image Classification
- NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation
- Small Steps and Level Sets: Fitting Neural Surface Models with Point Guidance
- Federated Online Adaptation for Deep Stereo
- Dynamic LiDAR Re-simulation using Compositional Neural Fields
- Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking
- XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold
- Backpropagation-free Network for 3D Test-time Adaptation
- From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation
- Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers
- Aligning Logits Generatively for Principled Black-Box Knowledge Distillation
- TULIP: Multi-camera 3D Precision Assessment of Parkinson’s Disease
- Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation
- Grounding and Enhancing Grid-based Models for Neural Fields
- GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
- FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization
- SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors
- Fully Geometric Panoramic Localization
- How Far Can We Compress Instant-NGP-Based NeRF?
- Object Dynamics Modeling with Hierarchical Point Cloud-based Representations
- GLACE: Global Local Accelerated Coordinate Encoding
- Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now
- S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data
- ParamISP: Learned Forward and Inverse ISPs using Camera Parameters
- LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation
- Osprey: Pixel Understanding with Visual Instruction Tuning
- LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
- SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation
- Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
- SHAP-EDITOR: Instruction-Guided Latent 3D Editing in Seconds
- Long-Tail Class Incremental Learning via Independent Sub-prototype Construction
- NAPGuard: Towards Detecting Naturalistic Adversarial Patches
- AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
- Active Prompt Learning in Vision Language Models
- Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks Methods and Applications
- Partial-to-Partial Shape Matching with Geometric Consistency
- Boosting Adversarial Training via Fisher-Rao Norm-based Regularization
- OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning
- ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation
- MoDE: CLIP Data Experts via Clustering
- Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transfomers
- Multi-modal Learning for Geospatial Vegetation Forecasting
- Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance
- Learned Lossless Image Compression based on Bit Plane Slicing
- VILA: On Pre-training for Visual Language Models
- Fine-Grained Bipartite Concept Factorization for Clustering
- Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment
- Noisy-Correspondence Learning for Text-to-Image Person Re-identification
- FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
- Neural Spline Fields for Burst Image Fusion and Layer Separation
- Data-Efficient Multimodal Fusion on a Single GPU
- Single Domain Generalization for Crowd Counting
- Efficient Hyperparameter Optimization with Adaptive Fidelity Identification
- Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion.
- CurveCloudNet: Processing Point Clouds with 1D Structure
- Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
- Unmixing Diffusion for Self-Supervised Hyperspectral Image Denoising
- KVQ: Kwai Video Quality Assessment for Short-form Videos
- MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation
- DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly
- Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion
- MAFA: Managing False Negatives for Vision-Language Pre-training
- Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models
- Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening
- DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model
- Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow
- VCoder: Versatile Vision Encoders for Multimodal Large Language Models
- CFAT: Unleashing Triangular Windows for Image Super-resolution
- PanoContext-Former: Panoramic Total Scene Understanding with a Transformer
- UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
- Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments
- One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
- Latent Modulated Function for Computational Optimal Continuous Image Representation
- SPECAT: SPatial-spEctral Cumulative-Attention Transformer for High-Resolution Hyperspectral Image Reconstruction
- OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning
- SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
- LEAD: Exploring Logit Space Evolution for Model Selection
- Distilling Semantic Priors from SAM to Efficient Image Restoration Models
- Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation
- Seeing Motion at Nighttime with an Event Camera
- Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
- Deep Video Inverse Tone Mapping Based on Temporal Clues
- ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation
- Beyond Average: Individualized Visual Scanpath Prediction
- Initialization Matters for Adversarial Transfer Learning
- Active Domain Adaptation with False Negative Prediction for Object Detection
- Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring
- Accept the Modality Gap: An Exploration in the Hyperbolic Space
- Projecting Trackable Thermal Patterns for Dynamic Computer Vision
- Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples
- Improving Generalization via Meta-Learning on Hard Samples
- Towards Robust Learning to Optimize with Theoretical Guarantees
- Learning to Remove Wrinkled Transparent Film with Polarized Prior
- PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor
- Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning
- EVS-assisted Joint Deblurring Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling
- WaveMo: Learning Wavefront Modulations to See Through Scattering
- A Physics-informed Low-rank Deep Neural Network for Blind and Universal Lens Aberration Correction
- UFC-Net: Unrolling Fixed-point Continuous Network for Deep Compressive Sensing
- SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
- SG-PGM: Partial Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks
- SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder
- Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans
- Analyzing and Improving the Training Dynamics of Diffusion Models
- MuGE: Multiple Granularity Edge Detection
- Coherence As Texture – Passive Textureless 3D Reconstruction by Self-interference
- Audio-Visual Segmentation via Unlabeled Frame Exploitation
- Understanding and Improving Source-free Domain Adaptation from a Theoretical Perspective
- SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting
- Large Language Models are Good Prompt Learners for Low-Shot Image Classification
- Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation
- PixelRNN: In-pixel Recurrent Neural Networks for End-to-end–optimized Perception with Neural Sensors
- PAD: Patch-Agnostic Defense against Adversarial Patch Attacks
- No More Ambiguity in 360° Room Layout via Bi-Layout Estimation
- Efficient Model Stealing Defense with Noise Transition Matrix
- ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models
- DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
- Regressor-Segmenter Mutual Prompt Learning for Crowd Counting
- Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models
- EGTR: Extracting Graph from Transformer for Scene Graph Generation
- Federated Generalized Category Discovery
- X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization
- Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
- Dynamic Prompt Optimizing for Text-to-Image Generation
- DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning
- Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
- Close Imitation of Expert Retouching for Black-and-White Photography
- Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models
- SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation
- Transfer CLIP for Generalizable Image Denoising
- Revisiting Adversarial Training at Scale
- Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization
- HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes
- A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
- G-FARS: Gradient-Field-based Auto-Regressive Sampling for 3D Part Grouping
- AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring
- 1-Lipschitz Layers Compared: Memory Speed and Certifiable Robustness
- Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing
- Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation
- Parameter Efficient Self-Supervised Geospatial Domain Adaptation
- Empowering Resampling Operation for Ultra-High-Definition Image Enhancement with Model-Aware Guidance
- Deep-TROJ: An Inference Stage Trojan Insertion Algorithm through Efficient Weight Replacement Attack
- LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network
- Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos
- Task-Driven Wavelets using Constrained Empirical Risk Minimization
- Bilateral Adaptation for Human-Object Interaction Detection with Occlusion-Robustness
- FedMef: Towards Memory-efficient Federated Dynamic Pruning
- Dual-Scale Transformer for Large-Scale Single-Pixel Imaging
- Multi-Task Dense Prediction via Mixture of Low-Rank Experts
- Boosting Flow-based Generative Super-Resolution Models via Learned Prior
- Relational Matching for Weakly Semi-Supervised Oriented Object Detection
- Semantics Distortion and Style Matter: Towards Source-free UDA for Panoramic Segmentation
- Learning Inclusion Matching for Animation Paint Bucket Colorization
- PerceptionGPT: Effectively Fusing Visual Perception into LLM
- Single View Refractive Index Tomography with Neural Fields
- Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
- OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
- Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement
- SeD: Semantic-Aware Discriminator for Image Super-Resolution
- CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning
- LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning
- From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers
- Snapshot Lidar: Fourier Embedding of Amplitude and Phase for Single-Image Depth Reconstruction
- CoSeR: Bridging Image and Language for Cognitive Super-Resolution
- Discovering Syntactic Interaction Clues for Human-Object Interaction Detection
- Instance-based Max-margin for Practical Few-shot Recognition
- Are Conventional SNNs Really Efficient? A Perspective from Network Quantization
- Time-Efficient Light-Field Acquisition Using Coded Aperture and Events
- Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation
- Generative Image Dynamics
- Towards Calibrated Multi-label Deep Neural Networks
- Amodal Ground Truth and Completion in the Wild
- Language-driven All-in-one Adverse Weather Removal
- AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation
- MRFS: Mutually Reinforcing Image Fusion and Segmentation
- Navigating Beyond Dropout: An Intriguing Solution towards Generalizable Image Super Resolution
- Disentangled Pre-training for Human-Object Interaction Detection
- TurboSL: Dense Accurate and Fast 3D by Neural Inverse Structured Light
- T-VSL: Text-Guided Visual Sound Source Localization in Mixtures
- DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models
- Coherent Temporal Synthesis for Incremental Action Segmentation
- SinSR: Diffusion-Based Image Super-Resolution in a Single Step
- SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching
- Physical Property Understanding from Language-Embedded Feature Fields
- Gradient-based Parameter Selection for Efficient Fine-Tuning
- Language-guided Image Reflection Separation
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
- Dispel Darkness for Better Fusion: A Controllable Visual Enhancer based on Cross-modal Conditional Adversarial Learning
- MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
- Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation
- DAP: A Dynamic Adversarial Patch for Evading Person Detectors
- Revamping Federated Learning Security from a Defender's Perspective: A Unified Defense with Homomorphic Encrypted Data Space
- Data Poisoning based Backdoor Attacks to Contrastive Learning
- NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models
- Overload: Latency Attacks on Object Detection for Edge Devices
- MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model
- Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence
- Distraction is All You Need: Memory-Efficient Image Immunization against Diffusion-Based Image Editing
- Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships
- DART: Implicit Doppler Tomography for Radar Novel View Synthesis
- Generative Multi-modal Models are Good Class Incremental Learners
- LAN: Learning to Adapt Noise for Image Denoising
- Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
- MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation
- MonoHair: High-Fidelity Hair Modeling from a Monocular Video
- Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization
- Unsupervised Blind Image Deblurring Based on Self-Enhancement
- Simple Semantic-Aided Few-Shot Learning
- Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI
- APISR: Anime Production Inspired Real-World Anime Super-Resolution
- Towards Backward-Compatible Continual Learning of Image Compression
- Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis
- AV-RIR: Audio-Visual Room Impulse Response Estimation
- Cyclic Learning for Binaural Audio Generation and Localization
- Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities
- Equivariant Plug-and-Play Image Reconstruction
- SonicVisionLM: Playing Sound with Vision Language Models
- Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding
- DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning
- Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
- Look-Up Table Compression for Efficient Image Restoration
- From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
- Unbiased Faster R-CNN for Single-source Domain Generalized Object Detection
- AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
- Looking Similar Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning
- MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models
- An Aggregation-Free Federated Learning for Tackling Data Heterogeneity
- Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement
- Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness
- Driving-Video Dehazing with Non-Aligned Regularization for Safety Assistance
- Diff-BGM: A Diffusion Model for Video Background Music Generation
- HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
- OneLLM: One Framework to Align All Modalities with Language
- Latency Correction for Event-guided Deblurring and Frame Interpolation
- Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning
- Online Task-Free Continual Generative and Discriminative Learning via Dynamic Cluster Memory
- Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution
- Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing
- SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection
- Transferable Structural Sparse Adversarial Attack Via Exact Group Sparsity Training
- CAMixerSR: Only Details Need More "Attention"
- Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball
- OMG-Seg: Is One Model Good Enough For All Segmentation?
- Open-Vocabulary Semantic Segmentation with Image Embedding Balancing
- SlowFormer: Adversarial Attack on Compute and Energy Consumption of Efficient Vision Transformers
- Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
- Unsupervised Deep Unrolling Networks for Phase Unwrapping
- Revisiting Adversarial Training Under Long-Tailed Distributions
- Dispersed Structured Light for Hyperspectral 3D Imaging
- Learning with Structural Labels for Learning with Noisy Labels
- CPP-Net: Embracing Multi-Scale Feature Fusion into Deep Unfolding CP-PPA Network for Compressive Sensing
- MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning
- NB-GTR: Narrow-Band Guided Turbulence Removal
- Prompt3D: Random Prompt Assisted Weakly-Supervised 3D Object Detection
- Prompt Learning via Meta-Regularization
- Byzantine-robust Decentralized Federated Learning via Dual-domain Clustering and Trust Bootstrapping
- Scalable 3D Registration via Truncated Entry-wise Absolute Residuals
- On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
- Discriminability-Driven Channel Selection for Out-of-Distribution Detection
- View-Category Interactive Sharing Transformer for Incomplete Multi-View Multi-Label Learning
- CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation
- Text-guided Explorable Image Super-resolution
- Towards HDR and HFR Video from Rolling-Mixed-Bit Spikings
- Adversarially Robust Few-shot Learning via Parameter Co-distillation of Similarity and Class Concept Learners
- DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models
- In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging
- Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
- SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing
- Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models
- GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation
- Passive Snapshot Coded Aperture Dual-Pixel RGB-D Imaging
- Generalized Event Cameras
- Event-based Visible and Infrared Fusion via Multi-task Collaboration
- Alchemist: Parametric Control of Material Properties with Diffusion Models
Receptions
Remarks
Tutorials
- Deep Stereo Matching in the Twenties
- Disentanglement and Compositionality in Computer Vision
- Machine Unlearning in Computer Vision: Foundations and Applications
- SCENIC: An Open-Source Probabilistic Programming System for Data Generation and Safety in AI-Based Autonomy
- Recent Advances in Vision Foundation Models
- Object-centric Representations in Computer Vision
- Robustness at Inference: Towards Explainability, Uncertainty, and Intervenability
- Efficient Homotopy Continuation for Solving Polynomial Systems in Computer Vision Applications
- Geospatial Computer Vision and Machine Learning for Large-Scale Earth Observation Data
- Edge AI in Action: Practical Approaches to Developing and Deploying Optimized Models
- Edge-Optimized Deep Learning: Harnessing Generative AI and Computer Vision with Open-Source Libraries
- 3D/4D Generation and Modeling with Generative Priors
- Contactless AI Healthcare using Cameras and Wireless Sensors
- Computational Design of Diverse Morphologies and Sensors for Vision and Robotics
- Learning Deep Low-dimensional Models from High-Dimensional Data: From Theory to Practice
- All You Need To Know About Point Cloud Understanding
- All You Need to Know about Self-Driving
- Towards Building AGI in Autonomy and Robotics
- End-to-End Autonomy: A New Era of Self-Driving
- From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond
- Full-Stack, GPU-based Acceleration of Deep Learning
- Diffusion-based Video Generative Models
- Unifying Graph Neural Networks across Spatial and Spectral Domains
Workshops
- Computer Vision for Mixed Reality
- Domain adaptation, Explainability and Fairness in AI for Medical Image Analysis (DEF-AI-MIA)
- Efficient Large Vision Models
- 8th AI City Challenge
- Multimodal Algorithmic Reasoning Workshop
- SyntaGen: Harnessing Generative Models for Synthetic Visual Datasets
- The 5th Face Anti-Spoofing Workshop
- The 7th Workshop and Challenge Bridging the Gap between Computational Photography and Visual Recognition (UG2+)
- 4th Workshop on CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling
- The Fifth Workshop on Fair, Data-efficient, and Trusted Computer Vision
- 2nd Workshop on Multimodal Content Moderation
- MetaFood Workshop (MTF)
- AI for 3D Generation
- 2nd Workshop on Scene Graphs and Graph Representation Learning
- ViLMa – Visual Localization and Mapping
- 1st Workshop on Dataset Distillation for Computer Vision
- VAND 2.0: Visual Anomaly and Novelty Detection
- Workshop on Computer Vision for Fashion, Art, and Design
- AI for Content Creation (AI4CC)
- New Challenges in 3D Human Understanding
- First Joint Egocentric Vision (EgoVis) Workshop
- First Workshop on Efficient and On-Device Generation (EDGE)
- 2nd Workshop on Foundation Models
- The 4th Workshop of Adversarial Machine Learning on Computer Vision: Robustness of Foundation Models
- 1st Workshop on Urban Scene Modeling: Where Vision Meets Photogrammetry and Graphics
- Second Workshop for Learning 3D with Multi-View Supervision
- AI4Space 2024
- 7th Workshop on Autonomous Driving (WAD)
- Foundation Models for Autonomous Systems
- Image Matching: Local Features and Beyond
- Populating Empty Cities – Virtual Humans for Robotics and Autonomous Driving
- Data Curation and Augmentation in Enhancing Medical Imaging Applications
- GenAI Media Generation Challenge for Computer Vision Workshop
- The Seventh International Workshop on Computer Vision for Physiological Measurement (CVPM)
- Workshop on Virtual Try-On
- Workshop on Graphic Design Understanding and Generation (GDUG)
- Fifth Workshop on Neural Architecture Search
- Agriculture-Vision: Challenges & Opportunities for Computer Vision in Agriculture
- VizWiz Grand Challenge: Describing Images and Videos Taken by Blind People
- Computer Vision for Materials Science Workshop
- The Future of Generative Visual Art
- Women in Computer Vision
- LatinX in Computer Vision Research Workshop
- The 5th Omnidirectional Computer Vision Workshop
- Third Workshop of Mobile Intelligent Photography & Imaging
- The 3rd Explainable AI for Computer Vision (XAI4CV) Workshop
- Workshop on Responsible Data
- GAZE 2024: The 6th International Workshop on Gaze Estimation and Prediction in the Wild
- RetailVision - Field Overview and Amazon Deep Dive
- 10th IEEE International Workshop on Computer Vision in Sports (CVsports)
- Equivariant Vision: From Theory to Practice
- 5th Workshop on Continual Learning in Computer Vision (CLVISION)
- 2nd Workshop on Compositional 3D Vision
- Visual Perception via Learning in an Open World
- 7th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues
- Data-Driven Autonomous Driving Simulation (DDASD)
- Synthetic Data for Computer Vision
- Workshop on Human Motion Generation
- 9th Workshop on Computer Vision for Microscopy Image Analysis
- 2nd Workshop on Generative Models for Computer Vision
- ReGenAI: First Workshop on Responsible Generative AI
- The 5th Annual Embodied AI Workshop
- Towards 3D Foundation Models: Progress and Prospects
- 7th MUltimodal Learning and Applications
- Vision and Language for Autonomous Driving and Robotics (VLADR)
- 4th Workshop and Challenge on Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings
- The Sixth Workshop on Deep Learning for Geometric Computing (DLGC 2024)
- The First Workshop on the Evaluation of Generative Foundation Models
- 20th Workshop on Perception Beyond the Visible Spectrum
- Embedded Vision Workshop
- 5th Workshop on Robot Visual Perception in Human Crowded Environments
- 6th Workshop and Competition on Affective Behavior Analysis in-the-wild
- (3rd) Monocular Depth Estimation Challenge
- Learning from Procedural Videos and Language: What is Next?
Report issues here.