Skip to yearly menu bar
Skip to main content
Main Navigation
CVPR
Code of Conduct
Create Profile
Reset / Forgot Password
Privacy Policy
Contact CVPR
HELP/FAQ
My Stuff
Login
Select Year: (2023)
2025
2024
2023
Home
Schedule
Workshops
Tutorials
Keynotes & Panels
Awards
Papers
Sponsors
Organizers
Browse
Visualization
mini
compact
topic
detail
Showing papers for
.
×
×
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
Defending Against Patch-Based Backdoor Attacks on Self-Supervised Learning
Gradient-Based Uncertainty Attribution for Explainable Bayesian Deep Learning
Learning Neural Parametric Head Models
Delivering Arbitrary-Modal Semantic Segmentation
High-Fidelity 3D Human Digitization From Single 2K Resolution Images
Panoptic Video Scene Graph Generation
FFCV: Accelerating Training by Removing Data Bottlenecks
A Data-Based Perspective on Transfer Learning
GLIGEN: Open-Set Grounded Text-to-Image Generation
Patch-Craft Self-Supervised Training for Correlated Image Denoising
Co-Salient Object Detection With Uncertainty-Aware Group Exchange-Masking
GamutMLP: A Lightweight MLP for Color Loss Recovery
ReDirTrans: Latent-to-Latent Translation for Gaze and Head Redirection
Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding
Why Is the Winner the Best?
RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving
ZegCLIP: Towards Adapting CLIP for Zero-Shot Semantic Segmentation
MaskSketch: Unpaired Structure-Guided Masked Image Generation
Video Probabilistic Diffusion Models in Projected Latent Space
Prefix Conditioning Unifies Language and Label Supervision
Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval
Visual Prompt Tuning for Generative Transfer Learning
GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection
Masked Wavelet Representation for Compact Neural Radiance Fields
MIME: Human-Aware 3D Scene Generation
BITE: Beyond Priors for Improved Three-D Dog Pose Estimation
3D Human Pose Estimation via Intuitive Physics
Heat Diffusion Based Multi-Scale and Geometric Structure-Aware Transformer for Mesh Segmentation
DKM: Dense Kernelized Feature Matching for Geometry Estimation
Balanced Product of Calibrated Experts for Long-Tailed Recognition
SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments
CIMI4D: A Large Multimodal Climbing Motion Dataset Under Human-Scene Interactions
FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits
Connecting Vision and Language With Video Localized Narratives
All-in-Focus Imaging From Event Focal Stack
Semi-Supervised Stereo-Based 3D Object Detection via Cross-View Consensus
Improved Test-Time Adaptation for Domain Generalization
Learning Common Rationale To Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems
METransformer: Radiology Report Generation by Transformer With Multiple Learnable Expert Tokens
Exact-NeRF: An Exploration of a Precise Volumetric Parameterization for Neural Radiance Fields
X3KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection
Learning Partial Correlation Based Deep Visual Representation for Image Classification
Implicit 3D Human Mesh Recovery Using Consistency With Pose and Shape From Unseen-View
CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval
Guided Recommendation for Model Fine-Tuning
Implicit View-Time Interpolation of Stereo Videos Using Multi-Plane Disparities and Non-Uniform Coordinates
FeatureBooster: Boosting Feature Descriptors With a Lightweight Neural Network
PlenVDB: Memory Efficient VDB-Based Radiance Fields for Fast Training and Rendering
MED-VT: Multiscale Encoder-Decoder Video Transformer With Application To Object Segmentation
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation
Handwritten Text Generation From Visual Archetypes
Hierarchical B-Frame Video Coding Using Two-Layer CANF Without Motion Coding
Multi-Modal Learning With Missing Modality via Shared-Specific Feature Modelling
Spatial-Temporal Concept Based Explanation of 3D ConvNets
Robust Model-Based Face Reconstruction Through Weakly-Supervised Outlier Segmentation
PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation
Learned Image Compression With Mixed Transformer-CNN Architectures
Weakly-Supervised Domain Adaptive Semantic Segmentation With Prototypical Contrastive Learning
Exploring and Exploiting Uncertainty for Incomplete Multi-View Classification
Seeing Through the Glass: Neural 3D Reconstruction of Object Inside a Transparent Container
Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style
Correspondence Transformers With Asymmetric Feature Learning and Matching Flow Super-Resolution
CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training
ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields
Continuous Intermediate Token Learning With Implicit Motion Manifold for Keyframe Based Motion Interpolation
Prompt-Guided Zero-Shot Anomaly Action Recognition Using Pretrained Deep Skeleton Features
A New Path: Scaling Vision-and-Language Navigation With Synthetic Instructions and Imitation Learning
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
BiFormer: Vision Transformer With Bi-Level Routing Attention
Dense Distinct Query for End-to-End Object Detection
Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection
Learning Locally Editable Virtual Humans
PCT-Net: Full Resolution Image Harmonization Using Pixel-Wise Color Transformations
All-in-One Image Restoration for Unknown Degradations Using Adaptive Discriminative Filters for Specific Degradations
ALSO: Automotive Lidar Self-Supervision by Occupancy Estimation
Unsupervised Object Localization: Observing the Background To Discover Objects
SCPNet: Semantic Scene Completion on Point Cloud
UMat: Uncertainty-Aware Single Image High Resolution Material Capture
Unified Keypoint-Based Action Recognition Framework via Structured Keypoint Pooling
Teleidoscopic Imaging System for Microscale 3D Shape Reconstruction
Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection
RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion
Learning 3D-Aware Image Synthesis With Unknown Pose Distribution
DynaFed: Tackling Client Data Heterogeneity With Global Dynamics
Enlarging Instance-Specific and Class-Specific Information for Open-Set Action Recognition
High-Fidelity 3D GAN Inversion by Pseudo-Multi-View Optimization
Blind Video Deflickering by Neural Filtering With a Flawed Atlas
Delving StyleGAN Inversion for Image Editing: A Foundation Latent Space Viewpoint
Multi-View Azimuth Stereo via Tangent Space Consistency
PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout
Evolved Part Masking for Self-Supervised Learning
PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360°
Scaling Up GANs for Text-to-Image Synthesis
Instant Domain Augmentation for LiDAR Semantic Segmentation
FaceLit: Neural 3D Relightable Faces
Deep Graph-Based Spatial Consistency for Robust Non-Rigid Point Cloud Registration
Paired-Point Lifting for Enhanced Privacy-Preserving Visual Localization
Global Vision Transformer Pruning With Hessian-Aware Saliency
Beyond mAP: Towards Better Evaluation of Instance Segmentation
3D Shape Reconstruction of Semi-Transparent Worms
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation
Robust Dynamic Radiance Fields
MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation
Detection of Out-of-Distribution Samples Using Binary Neuron Activation Patterns
Recovering 3D Hand Mesh Sequence From a Single Blurry Image: A New Dataset and Temporal Unfolding
Rotation-Invariant Transformer for Point Cloud Matching
Propagate and Calibrate: Real-Time Passive Non-Line-of-Sight Tracking
3D Neural Field Generation Using Triplane Diffusion
GLeaD: Improving GANs With a Generator-Leading Task
Training Debiased Subnetworks With Contrastive Weight Pruning
ViPLO: Vision Transformer Based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection
Restoration of Hand-Drawn Architectural Drawings Using Latent Space Mapping With Degradation Generator
Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation
EFEM: Equivariant Neural Field Expectation Maximization for 3D Object Segmentation Without Scene Supervision
Learning Decorrelated Representations Efficiently Using Fast Fourier Transform
V2V4Real: A Real-World Large-Scale Dataset for Vehicle-to-Vehicle Cooperative Perception
Towards Robust Tampered Text Detection in Document Image: New Dataset and New Solution
Make Landscape Flatter in Differentially Private Federated Learning
Re-Thinking Model Inversion Attacks Against Deep Neural Networks
GeoMVSNet: Learning Multi-View Stereo With Geometry Perception
ProxyFormer: Proxy Alignment Assisted Point Cloud Completion With Missing Part Sensitive Transformer
Instant-NVR: Instant Neural Volumetric Rendering for Human-Object Interactions From Monocular RGBD Stream
A Large-Scale Homography Benchmark
Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation
Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation
Learning Distortion Invariant Representation for Image Restoration From a Causality Perspective
Token Boosting for Robust Self-Supervised Visual Transformer Pre-Training
Feature Representation Learning With Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition
FJMP: Factorized Joint Multi-Agent Motion Prediction Over Learned Directed Acyclic Interaction Graphs
Generalizing Dataset Distillation via Deep Generative Prior
Learning on Gradients: Generalized Artifacts Representation for GAN-Generated Images Detection
NLOST: Non-Line-of-Sight Imaging With Transformer
Few-Shot Referring Relationships in Videos
Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars
Tensor4D: Efficient Neural 4D Decomposition for High-Fidelity Dynamic Reconstruction and Rendering
CloSET: Modeling Clothed Humans on Continuous Surface With Explicit Template Decomposition
Task Residual for Tuning Vision-Language Models
JRDB-Pose: A Large-Scale Dataset for Multi-Person Pose Estimation and Tracking
Best of Both Worlds: Multimodal Contrastive Learning With Tabular and Imaging Data
Learning Semantic-Aware Disentangled Representation for Flexible 3D Human Body Editing
Crowd3D: Towards Hundreds of People Reconstruction From a Single Image
CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
MOVES: Manipulated Objects in Video Enable Segmentation
OReX: Object Reconstruction From Planar Cross-Sections Using Neural Fields
An Erudite Fine-Grained Visual Classification Model
On-the-Fly Category Discovery
Confidence-Aware Personalized Federated Learning via Variational Expectation Maximization
Spatio-Focal Bidirectional Disparity Estimation From a Dual-Pixel Image
ECON: Explicit Clothed Humans Optimized via Normal Integration
Class Adaptive Network Calibration
STDLens: Model Hijacking-Resilient Federated Learning for Object Detection
Samples With Low Loss Curvature Improve Data Efficiency
A Practical Stereo Depth System for Smart Glasses
Quantum-Inspired Spectral-Spatial Pyramid Network for Hyperspectral Image Classification
ABCD: Arbitrary Bitwise Coefficient for De-Quantization
ScaleDet: A Scalable Multi-Dataset Object Detector
A Meta-Learning Approach to Predicting Performance and Data Requirements
Multi-View Stereo Representation Revist: Region-Aware MVSNet
Neural Intrinsic Embedding for Non-Rigid Point Cloud Matching
DeltaEdit: Exploring Text-Free Training for Text-Driven Image Manipulation
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs
Two-Stage Co-Segmentation Network Based on Discriminative Representation for Recovering Human Mesh From Videos
Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement
EfficientViT: Memory Efficient Vision Transformer With Cascaded Group Attention
LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data
MonoHuman: Animatable Human Neural Field From Monocular Video
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation
Out-of-Candidate Rectification for Weakly Supervised Semantic Segmentation
Real-Time Evaluation in Online Continual Learning: A New Hope
NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization
DP-NeRF: Deblurred Neural Radiance Field With Physical Scene Priors
Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression
A New Dataset Based on Images Taken by Blind People for Testing the Robustness of Image Classification Models Trained for ImageNet Categories
What Can Human Sketches Do for Object Detection?
Occlusion-Free Scene Recovery via Neural Radiance Fields
Incremental 3D Semantic Scene Graph Prediction From RGB Sequences
The Dark Side of Dynamic Routing Neural Networks: Towards Efficiency Backdoor Injection
SMOC-Net: Leveraging Camera Pose for Self-Supervised Monocular Object Pose Estimation
E2PN: Efficient SE(3)-Equivariant Point Network
Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric
Visual DNA: Representing and Comparing Images Using Distributions of Neuron Activations
LayoutDM: Discrete Diffusion Model for Controllable Layout Generation
Towards Flexible Multi-Modal Document Models
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
TRACE: 5D Temporal Regression of Avatars With Dynamic Cameras in 3D Environments
Variational Distribution Learning for Unsupervised Text-to-Image Generation
ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations
Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes
Painting 3D Nature in 2D: View Synthesis of Natural Scenes From a Single Semantic Mask
Semi-Supervised Learning Made Simple With Self-Supervised Clustering
GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds
NoPe-NeRF: Optimising Neural Radiance Field With No Pose Prior
Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation
MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding From Object Detection
MethaneMapper: Spectral Absorption Aware Hyperspectral Transformer for Methane Detection
Source-Free Adaptive Gaze Estimation by Uncertainty Reduction
Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction
DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-To-Fine Contrastive Ranking
Weakly Supervised Temporal Sentence Grounding With Uncertainty-Guided Self-Training
Light Source Separation and Intrinsic Image Decomposition Under AC Illumination
Sketch2Saliency: Learning To Detect Salient Objects From Human Drawings
Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR
Picture That Sketch: Photorealistic Image Generation From Abstract Sketches
CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not
SceneTrilogy: On Human Scene-Sketch and Its Complementarity With Photo and Text
Angelic Patches for Improving Third-Party Object Detector Performance
NeuralField-LDM: Scene Generation With Hierarchical Latent Diffusion Models
DANI-Net: Uncalibrated Photometric Stereo by Differentiable Shadow Handling, Anisotropic Reflectance Modeling, and Neural Inverse Rendering
Adversarially Masking Synthetic To Mimic Real: Adaptive Noise Injection for Point Cloud Segmentation Adaptation
Minimizing the Accumulated Trajectory Error To Improve Dataset Distillation
Phone2Proc: Bringing Robust Robots Into Our Chaotic World
Objaverse: A Universe of Annotated 3D Objects
Supervised Masked Knowledge Distillation for Few-Shot Transformers
Class-Incremental Exemplar Compression for Class-Incremental Learning
Continual Detection Transformer for Incremental Object Detection
Overcoming the Trade-Off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction
Analyzing and Diagnosing Pose Estimation With Attributions
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
AnyFlow: Arbitrary Scale Optical Flow With Implicit Neural Representation
On Distillation of Guided Diffusion Models
Are Data-Driven Explanations Robust Against Out-of-Distribution Data?
T-SEA: Transfer-Based Self-Ensemble Attack on Object Detection
ActMAD: Activation Matching To Align Distributions for Test-Time-Training
Video Test-Time Adaptation for Action Recognition
Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation
Neural Congealing: Aligning Images to a Joint Semantic Atlas
Modality-Invariant Visual Odometry for Embodied Vision
Improving Selective Visual Question Answering by Learning From Your Peers
Real-Time 6K Image Rescaling With Rate-Distortion Optimization
Distilling Neural Fields for Real-Time Articulated Shape Reconstruction
MaPLe: Multi-Modal Prompt Learning
Visibility Aware Human-Object Interaction Tracking From Single RGB Camera
X-Avatar: Expressive Human Avatars
Rethinking Gradient Projection Continual Learning: Stability / Plasticity Feature Space Decoupling
Unsupervised Space-Time Network for Temporally-Consistent Segmentation of Multiple Motions
Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding
1000 FPS HDR Video With a Spike-RGB Hybrid Camera
CLIP the Gap: A Single Domain Generalization Approach for Object Detection
Learning Transformations To Reduce the Geometric Shift in Object Detection
Music-Driven Group Choreography
Structured 3D Features for Reconstructing Controllable Avatars
Backdoor Cleansing With Unlabeled Data
PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces
Single Domain Generalization for LiDAR Semantic Segmentation
Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection
Neural Map Prior for Autonomous Driving
Dynamic Coarse-To-Fine Learning for Oriented Tiny Object Detection
“Seeing” Electric Network Frequency From Events
Recurrent Homography Estimation Using Homography-Guided Image Warping and Focus Transformer
Efficient Robust Principal Component Analysis via Block Krylov Iteration and CUR Decomposition
Reliable and Interpretable Personalized Federated Learning
Inverting the Imaging Process by Learning an Implicit Camera Model
WildLight: In-the-Wild Inverse Rendering With a Flashlight
Wide-Angle Rectification via Content-Aware Conformal Mapping
MEGANE: Morphable Eyeglass and Avatar Network
Hierarchical Discriminative Learning Improves Visual Representations of Biomedical Microscopy
Generalized Decoding for Pixel, Image, and Language
ScanDMM: A Deep Markov Model of Scanpath Prediction for 360° Images
CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis
Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion
AutoFocusFormer: Image Segmentation off the Grid
VIVE3D: Viewpoint-Independent Video Editing Using 3D-Aware GANs
Multi-View Reconstruction Using Signed Ray Distance Functions (SRDF)
OTAvatar: One-Shot Talking Face Avatar With Controllable Tri-Plane Rendering
SketchXAI: A First Look at Explainability for Human Sketches
MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training
Post-Processing Temporal Action Detection
SliceMatch: Geometry-Guided Aggregation for Cross-View Pose Estimation
M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis
Affordance Diffusion: Synthesizing Hand-Object Interactions
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Meta-Personalizing Vision-Language Models To Find Named Instances in Video
Language-Guided Music Recommendation for Video via Prompt Analogies
Ranking Regularization for Critical Rare Classes: Minimizing False Positives at a High True Positive Rate
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
ACR: Attention Collaboration-Based Regressor for Arbitrary Two-Hand Reconstruction
ShapeClipper: Scalable 3D Shape Learning From Single-View Images via Geometric and CLIP-Based Consistency
Learning Situation Hyper-Graphs for Video Question Answering
TarViS: A Unified Approach for Target-Based Video Segmentation
StepFormer: Self-Supervised Step Discovery and Localization in Instructional Videos
CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning
Generating Part-Aware Editable 3D Shapes Without 3D Supervision
AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training
Self-Supervised Blind Motion Deblurring With Deep Expectation Maximization
Clover: Towards a Unified Video-Language Alignment and Fusion Model
High-Fidelity Clothed Avatar Reconstruction From a Single Image
Topology-Guided Multi-Class Cell Context Generation for Digital Pathology
Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention
Detecting Everything in the Open World: Towards Universal Object Detection
CLIP2Protect: Protecting Facial Privacy Using Text-Guided Makeup via Adversarial Latent Search
Evading Forensic Classifiers With Attribute-Conditioned Adversarial Faces
Token Turing Machines
Temporally Consistent Online Depth Estimation Using Point-Based Fusion
SparsePose: Sparse-View Camera Pose Regression and Refinement
K-Planes: Explicit Radiance Fields in Space, Time, and Appearance
On the Benefits of 3D Pose and Tracking for Human Action Recognition
How You Feelin’? Learning Emotions and Mental States in Movie Scenes
GIVL: Improving Geographical Inclusivity of Vision-Language Models With Pre-Training Methods
A Whac-a-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others
HyperCUT: Video Sequence From a Single Blurry Image Using Unsupervised Ordering
DejaVu: Conditional Regenerative Learning To Enhance Dense Prediction
NIRVANA: Neural Implicit Representations of Videos With Adaptive Networks and Autoregressive Patch-Wise Modeling
SimpSON: Simplifying Photo Cleanup With Single-Click Distracting Object Segmentation Network
Align and Attend: Multimodal Summarization With Dual Contrastive Losses
HNeRV: A Hybrid Neural Representation for Videos
FlexNeRF: Photorealistic Free-Viewpoint Rendering of Moving Humans From Sparse Views
Teaching Matters: Investigating the Role of Supervision in Vision Transformers
Towards Scalable Neural Representation for Diverse Videos
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation
Learning Customized Visual Models With Retrieval-Augmented Knowledge
Probing Neural Representations of Scene Perception in a Hippocampally Dependent Task Using Artificial Neural Networks
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
Invertible Neural Skinning
Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement
3Mformer: Multi-Order Multi-Mode Transformer for Skeletal Action Recognition
LINe: Out-of-Distribution Detection by Leveraging Important Neurons
Open-Vocabulary Panoptic Segmentation With Text-to-Image Diffusion Models
Modernizing Old Photos Using Multiple References via Photorealistic Style Transfer
SUDS: Scalable Urban Dynamic Scenes
Octree Guided Unoriented Surface Reconstruction
Bayesian Posterior Approximation With Stochastic Ensembles
PROB: Probabilistic Objectness for Open World Object Detection
Consistent View Synthesis With Pose-Guided Diffusion Models
Guided Depth Super-Resolution by Deep Anisotropic Diffusion
Robust Mean Teacher for Continual and Gradual Test-Time Adaptation
itKD: Interchange Transfer-Based Knowledge Distillation for 3D Object Detection
Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement
EXCALIBUR: Encouraging and Evaluating Embodied Exploration
Freestyle Layout-to-Image Synthesis
Marching-Primitives: Shape Abstraction From Signed Distance Function
3D Concept Learning and Reasoning From Multi-View Images
Towards End-to-End Generative Modeling of Long Videos With Memory-Efficient Bidirectional Transformers
Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection
Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement
Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting
Burstormer: Burst Image Restoration and Enhancement Transformer
Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection
PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery
DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization
OPE-SR: Orthogonal Position Encoding for Designing a Parameter-Free Upsampling Module in Arbitrary-Scale Image Super-Resolution
RelightableHands: Efficient Neural Relighting of Articulated Hand Models
Query-Centric Trajectory Prediction
NICO++: Towards Better Benchmarking for Domain Generalization
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation
Fix the Noise: Disentangling Source Feature for Controllable Domain Translation
NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations
All in One: Exploring Unified Video-Language Pre-Training
MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds
PyPose: A Library for Robot Learning With Physics-Based Optimization
Directional Connectivity-Based Segmentation of Medical Images
AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation
DNF: Decouple and Feedback Network for Seeing in the Dark
Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking
Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion
Azimuth Super-Resolution for FMCW Radar in Autonomous Driving
Bitstream-Corrupted JPEG Images Are Restorable: Two-Stage Compensation and Alignment Framework for Image Restoration
MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery
Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos
TTA-COPE: Test-Time Adaptation for Category-Level Object Pose Estimation
Masked and Adaptive Transformer for Exemplar Based Image Translation
NAR-Former: Neural Architecture Representation Learning Towards Holistic Attributes Prediction
Boosting Weakly-Supervised Temporal Action Localization With Text Information
Imagic: Text-Based Real Image Editing With Diffusion Models
MEDIC: Remove Model Backdoors via Importance Driven Cloning
VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization
Feature Separation and Recalibration for Adversarial Robustness
Learning Visual Representations via Language-Guided Sampling
Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution
NIPQ: Noise Proxy-Based Integrated Pseudo-Quantization
Stimulus Verification Is a Universal and Effective Sampler in Multi-Modal Human Trajectory Prediction
3DAvatarGAN: Bridging Domains for Personalized Editable Avatars
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
Probabilistic Prompt Learning for Dense Prediction
Visibility Constrained Wide-Band Illumination Spectrum Design for Seeing-in-the-Dark
Leapfrog Diffusion Model for Stochastic Trajectory Prediction
EqMotion: Equivariant Multi-Agent Motion Prediction With Invariant Interaction Reasoning
Multiscale Tensor Decomposition and Rendering Equation Encoding for View Synthesis
X-Pruner: eXplainable Pruning for Vision Transformers
MDL-NAS: A Joint Multi-Domain Learning Framework for Vision Transformer
Poly-PC: A Polyhedral Network for Multiple Point Cloud Tasks at Once
Causally-Aware Intraoperative Imputation for Overall Survival Time Prediction
Virtual Sparse Convolution for Multimodal 3D Object Detection
Learning Human-to-Robot Handovers From Point Clouds
UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy
PartManip: Learning Cross-Category Generalizable Part Manipulation Policy From Point Cloud Observations
GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts
Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis
UDE: A Unified Driving Engine for Human Motion Generation
Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis From Monocular Image
Hand Avatar: Free-Pose Hand Animation and Rendering From Monocular Video
Spectral Bayesian Uncertainty for Image Super-Resolution
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Biomechanics-Guided Facial Action Unit Detection Through Force Modeling
Understanding and Improving Visual Prompting: A Label-Mapping Perspective
Egocentric Audio-Visual Object Localization
MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation
GFPose: Learning 3D Human Pose Prior With Gradient Fields
Quantum Multi-Model Fitting
EventNeRF: Neural Radiance Fields From a Single Colour Event Camera
Self-Supervised Pre-Training With Masked Shape Prediction for 3D Scene Understanding
Mofusion: A Framework for Denoising-Diffusion-Based Motion Synthesis
CCuantuMM: Cycle-Consistent Quantum-Hybrid Matching of Multiple Shapes
PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization
Bias-Eliminating Augmentation Learning for Debiased Federated Learning
GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
Indiscernible Object Counting in Underwater Scenes
Event-Based Frame Interpolation With Ad-Hoc Deblurring
iDisc: Internal Discretization for Monocular Depth Estimation
Quantitative Manipulation of Custom Attributes on 3D-Aware Image Synthesis
IFSeg: Image-Free Semantic Segmentation via Vision-Language Model
Histopathology Whole Slide Image Analysis With Heterogeneous Graph Representation Learning
Towards Unified Scene Text Spotting Based on Sequence Generation
Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding
ReCo: Region-Controlled Text-to-Image Generation
An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling
LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling
Polarimetric iToF: Measuring High-Fidelity Depth Through Scattering Media
Meta-Tuning Loss Functions and Data Augmentation for Few-Shot Object Detection
KD-DLGAN: Data Limited Image Generation via Knowledge Distillation
Multi-Object Manipulation via Object-Centric Neural Scattering Functions
Accidental Light Probes
Randomized Adversarial Training via Taylor Expansion
R2Former: Unified Retrieval and Reranking Transformer for Place Recognition
TopNet: Transformer-Based Object Placement Network for Image Compositing
Natural Language-Assisted Sign Language Recognition
Siamese DETR
Generating Aligned Pseudo-Supervision From Non-Aligned Data for Image Restoration in Under-Display Camera
Aligning Bag of Regions for Open-Vocabulary Object Detection
Nighttime Smartphone Reflective Flare Removal Using Optical Center Symmetry Prior
CelebV-Text: A Large-Scale Facial Text-Video Dataset
Correlational Image Modeling for Self-Supervised Visual Pre-Training
Learning Generative Structure Prior for Blind Text Image Super-Resolution
Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning
Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation From Image Sequence
Distribution Shift Inversion for Out-of-Distribution Prediction
NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions
Fine-Tuned CLIP Models Are Efficient Video Learners
Pointersect: Neural Rendering With Cloud-Ray Intersection
Viewpoint Equivariance for Multi-View 3D Object Detection
MobileOne: An Improved One Millisecond Mobile Backbone
Conditional Image-to-Video Generation With Latent Flow Diffusion Models
Scene-Aware Egocentric 3D Human Pose Estimation
ObjectStitch: Object Compositing With Diffusion Model
Towards Open-World Segmentation of Parts
SkyEye: Self-Supervised Bird’s-Eye-View Semantic Mapping Using Monocular Frontal View Images
Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation
DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection
FitMe: Deep Photorealistic 3D Morphable Model Avatars
Regularization of Polynomial Networks for Image Recognition
Improving Fairness in Facial Albedo Estimation via Visual-Textual Cues
MagicPony: Learning Articulated 3D Animals in the Wild
A Strong Baseline for Generalized Few-Shot Semantic Segmentation
Open-Set Likelihood Maximization for Few-Shot Learning
Recognizing Rigid Patterns of Unlabeled Point Clouds by Complete and Continuous Isometry Invariants With No False Negatives and No False Positives
Hierarchical Prompt Learning for Multi-Task Learning
Teaching Structured Vision & Language Concepts to Vision & Language Models
Few-Shot Learning With Visual Distribution Calibration and Cross-Modal Distribution Alignment
Explaining Image Classifiers With Multiscale Directional Image Representation
UTM: A Unified Multiple Object Tracking Model With Identity-Aware Feature Enhancement
PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes
Guiding Pseudo-Labels With Uncertainty Estimation for Source-Free Unsupervised Domain Adaptation
Learning To Exploit Temporal Structure for Biomedical Vision–Language Processing
BEV@DC: Bird’s-Eye View Assisted Training for Depth Completion
Robust Single Image Reflection Removal Against Adversarial Attacks
Dynamic Neural Network for Multi-Task Learning Searching Across Diverse Network Topologies
Change-Aware Sampling and Contrastive Learning for Satellite Images
PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers
Level-S$^2$fM: Structure From Motion on Neural Level Set of Implicit Surfaces
How To Prevent the Continuous Damage of Noises To Model Training?
SMAE: Few-Shot Learning for HDR Deghosting With Saturation-Aware Masked Autoencoders
A Unified HDR Imaging Method With Pixel and Patch Level
Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data
Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision
Knowledge Combination To Learn Rotated Detection Without Rotated Annotation
Reliability in Semantic Segmentation: Are We on the Right Track?
K3DN: Disparity-Aware Kernel Estimation for Dual-Pixel Defocus Deblurring
A New Benchmark: On the Utility of Synthetic Data With Blender for Bare Supervised Learning and Downstream Domain Adaptation
DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model
Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection
Bootstrap Your Own Prior: Towards Distribution-Agnostic Novel Class Discovery
Semantic Scene Completion With Cleaner Self
Visual Prompt Multi-Modal Tracking
AstroNet: When Astrocyte Meets Artificial Neural Network
MCF: Mutual Correction Framework for Semi-Supervised Medical Image Segmentation
3D Cinemagraphy From a Single Image
Data-Free Knowledge Distillation via Feature Exchange and Activation Region Constraint
Decomposed Cross-Modal Distillation for RGB-Based Temporal Action Detection
DiffusionRig: Learning Personalized Priors for Facial Appearance Editing
BiasBed – Rigorous Texture Bias Evaluation
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Ref-NPR: Reference-Based Non-Photorealistic Radiance Fields for Controllable Scene Stylization
CodeTalker: Speech-Driven 3D Facial Animation With Discrete Motion Prior
A Bag-of-Prototypes Representation for Dataset-Level Applications
Focused and Collaborative Feedback Integration for Interactive Image Segmentation
(ML)$^2$P-Encoder: On Exploration of Channel-Class Correlation for Multi-Label Zero-Shot Learning
Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning
Sampling Is Matter: Point-Guided 3D Human Mesh Reconstruction
Super-Resolution Neural Operator
RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension
JacobiNeRF: NeRF Shaping With Mutual Information Gradients
Category Query Learning for Human-Object Interaction Classification
Collaboration Helps Camera Overtake LiDAR in 3D Detection
Meta Omnium: A Benchmark for General-Purpose Learning-To-Learn
ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation
Hi4D: 4D Instance Segmentation of Close Human Interaction
Towards Better Stability and Adaptability: Improve Online Self-Training for Model Adaptation in Semantic Segmentation
MixSim: A Hierarchical Framework for Mixed Reality Traffic Simulation
Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving
Joint Appearance and Motion Learning for Efficient Rolling Shutter Correction
A Generalized Framework for Video Instance Segmentation
Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks
Class Prototypes Based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos
Revisiting Residual Networks for Adversarial Robustness
Mitigating Task Interference in Multi-Task Learning via Explicit Task Routing With Non-Learnable Primitives
Two-Shot Video Object Segmentation
HouseDiffusion: Vector Floorplan Generation via a Diffusion Model With Discrete and Continuous Denoising
Panoptic Compositional Feature Field for Editable Scene Rendering With Network-Inferred Labels via Metric Learning
A Characteristic Function-Based Method for Bottom-Up Human Pose Estimation
Energy-Efficient Adaptive 3D Sensing
Consistent Direct Time-of-Flight Video Depth Super-Resolution
DETRs With Hybrid Matching
Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization
EcoTTA: Memory-Efficient Continual Test-Time Adaptation via Self-Distilled Regularization
Progressive Random Convolutions for Single Domain Generalization
AutoLabel: CLIP-Based Framework for Open-Set Video Domain Adaptation
3D Line Mapping Revisited
DeepLSD: Line Segment Detection and Refinement With Deep Image Gradients
Multi-View Adversarial Discriminator: Mine the Non-Causal Factors for Object Detection in Unseen Domains
SegLoc: Learning Segmentation-Based Representations for Privacy-Preserving Visual Localization
RiDDLE: Reversible and Diversified De-Identification With Latent Encryptor
OpenScene: 3D Scene Understanding With Open Vocabularies
Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation From 2D Supervision
MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures
Learning Emotion Representations From Verbal and Nonverbal Communication
Understanding Masked Autoencoders via Hierarchical Latent Variable Models
Iterative Vision-and-Language Navigation
Relational Context Learning for Human-Object Interaction Detection
ViewNet: A Novel Projection-Based Backbone With View Pooling for Few-Shot Point Cloud Classification
SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation
Understanding the Robustness of 3D Object Detection With Bird’s-Eye-View Representations in Autonomous Driving
Human Pose Estimation in Extremely Low-Light Conditions
Boundary Unlearning: Rapid Forgetting of Deep Networks via Shifting the Decision Boundary
Sliced Optimal Partial Transport
TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization
Pseudo-Label Guided Contrastive Learning for Semi-Supervised Medical Image Segmentation
SINE: SINgle Image Editing With Text-to-Image Diffusion Models
Leveraging per Image-Token Consistency for Vision-Language Pre-Training
SeSDF: Self-Evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction
Block Selection Method for Using Feature Norm in Out-of-Distribution Detection
Relightable Neural Human Assets From Multi-View Gradient Illuminations
Markerless Camera-to-Robot Pose Estimation via Self-Supervised Sim-to-Real Transfer
DA-DETR: Domain Adaptive Detection Transformer With Information Fusion
Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors
Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning
NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis
HaLP: Hallucinating Latent Positives for Skeleton-Based Self-Supervised Learning of Actions
Better “CMOS” Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution
Finding Geometric Models by Clustering in the Consensus Space
3D-POP – An Automated Annotation Approach to Facilitate Markerless 2D-3D Tracking of Freely Moving Birds With Marker-Based Motion Capture
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition
3D Human Mesh Estimation From Virtual Markers
Rethinking Feature-Based Knowledge Distillation for Face Recognition
Chat2Map: Efficient Scene Mapping From Multi-Ego Conversations
Novel-View Acoustic Synthesis
High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity
Zero-Shot Referring Image Segmentation With Global-Local Context Features
AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR
Learning Attention As Disentangler for Compositional Zero-Shot Learning
Mask-Free OVIS: Open-Vocabulary Instance Segmentation Without Manual Mask Annotations
SE-ORNet: Self-Ensembling Orientation-Aware Network for Unsupervised Point Cloud Shape Correspondence
Adaptive Spot-Guided Transformer for Consistent Local Feature Matching
D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-Based Transformers
Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-Based Action Recognition
StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer
Box-Level Active Detection
Neural Scene Chronology
DynIBaR: Neural Dynamic Image-Based Rendering
Omnimatte3D: Associating Objects and Their Effects in Unconstrained Monocular Video
Controllable Light Diffusion for Portraits
TrojViT: Trojan Insertion in Vision Transformers
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction
DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets
Improving Robustness of Vision Transformers by Reducing Sensitivity To Patch Corruptions
PolyFormer: Referring Image Segmentation As Sequential Polygon Generation
Affordances From Human Videos as a Versatile Representation for Robotics
Improving Visual Grounding by Encouraging Consistent Gradient-Based Explanations
The Differentiable Lens: Compound Lens Search Over Glass Surfaces and Materials for Object Detection
Thermal Spread Functions (TSF): Physics-Guided Material Classification
WIRE: Wavelet Implicit Neural Representations
BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision
DrapeNet: Garment Generation and Self-Supervised Draping
3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud
Integrally Pre-Trained Transformer Pyramid Networks
DeepSolo: Let Transformer Decoder With Explicit Points Solo for Text Spotting
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
Optimal Transport Minimization: Crowd Localization on Density Maps for Semi-Supervised Counting
SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency
REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
Blur Interpolation Transformer for Real-World Motion From Blur
High-Fidelity Event-Radiance Recovery via Transient Event Frequency
Learning Event Guided High Dynamic Range Video Reconstruction
Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection
Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding
DexArt: Benchmarking Generalizable Dexterous Manipulation With Articulated Objects
Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding
PersonNeRF: Personalized Reconstruction From Photo Collections
Teacher-Generated Spatial-Attention Labels Boost Robustness and Accuracy of Contrastive Models
Superclass Learning With Representation Enhancement
3D-Aware Multi-Class Image-to-Image Translation With NeRFs
Towards Unsupervised Object Detection From LiDAR Point Clouds
Unbalanced Optimal Transport: A Unified Framework for Object Detection
ORCa: Glossy Objects As Radiance-Field Cameras
Role of Transients in Two-Bounce Non-Line-of-Sight Imaging
Improving Table Structure Recognition With Visual-Alignment Sequential Coordinate Modeling
Class Relationship Embedded Learning for Source-Free Unsupervised Domain Adaptation
A Rotation-Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization
Towards Domain Generalization for Multi-View 3D Object Detection in Bird-Eye-View
Visual Recognition-Driven Image Restoration for Multiple Degradation With Intrinsic Semantics Recovery
Ingredient-Oriented Multi-Degradation Learning for Image Restoration
Ultra-High Resolution Segmentation With Ultra-Rich Context: A Novel Benchmark
Learning Sample Relationship for Exposure Correction
gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
PointConvFormer: Revenge of the Point-Based Convolution
Compression-Aware Video Super-Resolution
Mask-Free Video Instance Segmentation
Learning Accurate 3D Shape Based on Stereo Polarimetric Imaging
MARLIN: Masked Autoencoder for Facial Video Representation LearnINg
CafeBoost: Causal Feature Boost To Eliminate Task-Induced Bias for Class Incremental Learning
EfficientSCI: Densely Connected Network With Space-Time Factorization for Large-Scale Video Snapshot Compressive Imaging
Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-Time Mobile Telepresence
Siamese Image Modeling for Self-Supervised Vision Representation Learning
Learning To Retain While Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation
Ground-Truth Free Meta-Learning for Deep Compressive Sampling
Neumann Network With Recursive Kernels for Single Image Defocus Deblurring
Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
High-Fidelity Guided Image Synthesis With Latent Diffusion Models
Procedure-Aware Pretraining for Instructional Video Understanding
Neural Part Priors: Learning To Optimize Part-Based Object Completion in RGB-D Scans
Hierarchical Video-Moment Retrieval and Step-Captioning
Generative Semantic Segmentation
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Efficient Movie Scene Detection Using State-Space Transformers
Neuralangelo: High-Fidelity Neural Surface Reconstruction
Sparse Multi-Modal Graph Transformer With Shared-Context Processing for Representation Learning of Giga-Pixel Images
Spider GAN: Leveraging Friendly Neighbors To Accelerate GAN Training
ASPnet: Action Segmentation With Shared-Private Representation of Multiple Data Sources
Overlooked Factors in Concept-Based Explanations: Dataset Choice, Concept Learnability, and Human Capability
Multiple Instance Learning via Iterative Self-Paced Supervised Contrastive Learning
Model Barrier: A Compact Un-Transferable Isolation Domain for Model Intellectual Property Protection
FLEX: Full-Body Grasping Without Full-Body Grasps
A Soma Segmentation Benchmark in Full Adult Fly Brain
NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization
Doubly Right Object Recognition: A Why Prompt for Visual Rationales
Contrastive Semi-Supervised Learning for Underwater Image Restoration via Reliable Bank
Adaptive Human Matting for Dynamic Videos
Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery From Sparse Image Ensemble
Pix2map: Cross-Modal Retrieval for Inferring Street Maps From Images
NewsNet: A Novel Dataset for Hierarchical Temporal Segmentation
Large-Capacity and Flexible Video Steganography via Invertible Neural Network
PVO: Panoptic Visual Odometry
Infinite Photorealistic Worlds Using Procedural Generation
3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point Clouds
Virtual Occlusions Through Implicit Depth
Improving Zero-Shot Generalization and Robustness of Multi-Modal Models
StarCraftImage: A Dataset for Prototyping Spatial Reasoning Methods for Multi-Agent Environments
DistilPose: Tokenized Pose Regression With Heatmap Distillation
LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation
VectorFloorSeg: Two-Stream Graph Attention Network for Vectorized Roughcast Floorplan Segmentation
Progressive Transformation Learning for Leveraging Virtual Images in Training
OCELOT: Overlapped Cell on Tissue Dataset for Histopathology
Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss
BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning
Super-CLEVR: A Virtual Benchmark To Diagnose Domain Robustness in Visual Reasoning
A-Cap: Anticipation Captioning With Commonsense Knowledge
NoisyTwins: Class-Consistent and Diverse Image Generation Through StyleGANs
Semi-Supervised Parametric Real-World Image Harmonization
ALTO: Alternating Latent Topologies for Implicit 3D Reconstruction
LEGO-Net: Learning Regular Rearrangements of Objects in Rooms
SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene
Camouflaged Instance Segmentation via Explicit De-Camouflaging
DualRel: Semi-Supervised Mitochondria Segmentation From a Prototype Perspective
Rethinking the Correlation in Few-Shot Segmentation: A Buoys View
Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization
Dynamic Generative Targeted Attacks With Pattern Injection
SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations
How to Backdoor Diffusion Models?
Heterogeneous Continual Learning
Minimizing Maximum Model Discrepancy for Transferable Black-Box Targeted Attacks
DyNCA: Real-Time Dynamic Texture Synthesis Using Neural Cellular Automata
Learning Correspondence Uncertainty via Differentiable Nonlinear Least Squares
Novel Class Discovery for 3D Point Cloud Semantic Segmentation
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models
Generative Bias for Robust Visual Question Answering
DF-Platter: Multi-Face Heterogeneous Deepfake Dataset
Scalable, Detailed and Mask-Free Universal Photometric Stereo
Scaling Language-Image Pre-Training via Masking
TempSAL – Uncovering Temporal Information for Deep Saliency Prediction
Spatiotemporal Self-Supervised Learning for Point Clouds in the Wild
LOGO: A Long-Form Video Dataset for Group Action Quality Assessment
Learning Compact Representations for LiDAR Completion and Generation
Divide and Conquer: Answering Questions With Object Factorization and Compositional Reasoning
StyleRF: Zero-Shot 3D Style Transfer of Neural Radiance Fields
Conditional Generation of Audio From Video via Foley Analogies
Learning Semantic Relationship Among Instances for Image-Text Matching
Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning
Unsupervised Continual Semantic Adaptation Through Neural Rendering
OrienterNet: Visual Localization in 2D Public Maps With Neural Matching
CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective
OpenGait: Revisiting Gait Recognition Towards Better Practicality
LidarGait: Benchmarking 3D Gait Recognition With Point Clouds
OneFormer: One Transformer To Rule Universal Image Segmentation
Graph Transformer GANs for Graph-Constrained House Generation
Text-Guided Unsupervised Latent Transformation for Multi-Attribute Image Manipulation
A Unified Knowledge Distillation Framework for Deep Directed Graphical Models
GANHead: Towards Generative Animatable Neural Head Avatars
MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective
Real-Time Neural Light Field on Mobile Devices
Unsupervised Volumetric Animation
Make-a-Story: Visual Memory Conditioned Consistent Story Generation
Unknown Sniffer for Object Detection: Don’t Turn a Blind Eye to Unknown Objects
CF-Font: Content Fusion for Few-Shot Font Generation
Combining Implicit-Explicit View Correlation for Light Field Semantic Segmentation
Local Connectivity-Based Density Estimation for Face Clustering
BAAM: Monocular 3D Pose and Shape Reconstruction With Bi-Contextual Attention Module and Attention-Guided Modeling
Autonomous Manipulation Learning for Similar Deformable Objects via Only One Demonstration
Efficient Loss Function by Minimizing the Detrimental Effect of Floating-Point Errors on Gradient-Based Attacks
Efficient RGB-T Tracking via Cross-Modality Distillation
Fair Federated Medical Image Segmentation via Client Contribution Estimation
Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring
Turning a CLIP Model Into a Scene Text Detector
Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields
BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image
Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-World
Implicit Diffusion Models for Continuous Super-Resolution
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning
SmartAssign: Learning a Smart Knowledge Assignment Strategy for Deraining and Desnowing
Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based Action Recognition
Multiclass Confidence and Localization Calibration for Object Detection
Long-Term Visual Localization With Mobile Sensors
Efficient and Explicit Modelling of Image Hierarchies for Image Restoration
Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation
Class Balanced Adaptive Pseudo Labeling for Federated Semi-Supervised Learning
AutoRecon: Automated 3D Object Discovery and Reconstruction
PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection
Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation
TensoIR: Tensorial Inverse Rendering
Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning
RepMode: Learning to Re-Parameterize Diverse Experts for Subcellular Structure Prediction
NeUDF: Leaning Neural Unsigned Distance Fields With Volume Rendering
NeAT: Learning Neural Implicit Surfaces With Arbitrary Topologies From Multi-View Images
On the Convergence of IRLS and Its Variants in Outlier-Robust Estimation
Metadata-Based RAW Reconstruction via Implicit Neural Functions
Adaptive Patch Deformation for Textureless-Resilient Multi-View Stereo
RobustNeRF: Ignoring Distractors With Robust Losses
DiffCollage: Parallel Generation of Large Content With Diffusion Models
Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting
Improving Cross-Modal Retrieval With Set of Diverse Embeddings
PointCMP: Contrastive Mask Prediction for Self-Supervised Learning on Point Cloud Videos
3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention
Mapping Degeneration Meets Label Evolution: Learning Infrared Small Target Detection With Single Point Supervision
Object Pop-Up: Can We Infer 3D Objects and Their Poses From Human Interactions Alone?
Real-Time Controllable Denoising for Image and Video
Probabilistic Debiasing of Scene Graphs
Weak-Shot Object Detection Through Mutual Knowledge Transfer
Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks
Recurrence Without Recurrence: Stable Video Landmark Detection With Deep Equilibrium Models
SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting With Neural Radiance Fields
DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network
CutMIB: Boosting Light Field Super-Resolution via Multi-View Image Blending
Adaptive Channel Sparsity for Federated Learning Under System Heterogeneity
Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection
ScaleFL: Resource-Adaptive Federated Learning With Heterogeneous Clients
NVTC: Nonlinear Vector Transform Coding
Slimmable Dataset Condensation
HRDFuse: Monocular 360° Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions
Exploring Intra-Class Variation Factors With Learnable Cluster Prompts for Semi-Supervised Image Synthesis
MetaCLUE: Towards Comprehensive Visual Metaphors Research
Unsupervised 3D Shape Reconstruction by Part Retrieval and Assembly
Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields
Weakly Supervised Semantic Segmentation via Adversarial Learning of Classifier and Reconstructor
Cross-Guided Optimization of Radiance Fields With Multi-View Image Super-Resolution for High-Resolution Novel View Synthesis
Reconstructing Animatable Categories From Videos
Removing Objects From Neural Radiance Fields
Planning-Oriented Autonomous Driving
BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion
Detecting Backdoors in Pre-Trained Encoders
Data-Efficient Large Scale Place Recognition With Graded Similarity Supervision
Back to the Source: Diffusion-Driven Adaptation To Test-Time Corruption
Where Is My Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization
VisFusion: Visibility-Aware Online 3D Scene Reconstruction From Videos
CoMFormer: Continual Learning in Semantic and Panoptic Segmentation
NeMo: Learning 3D Neural Motion Fields From Multiple Video Instances of the Same Action
Seeing Beyond the Brain: Conditional Diffusion Model With Sparse Masked Modeling for Vision Decoding
Boosting Video Object Segmentation via Space-Time Correspondence Learning
Exploring and Utilizing Pattern Imbalance
TransFlow: Transformer As Flow Learner
Detecting and Grounding Multi-Modal Media Manipulation
Learning and Aggregating Lane Graphs for Urban Automated Driving
ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding
FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER
Text2Scene: Text-Driven Indoor Scene Stylization With Part-Aware Details
GANmouflage: 3D Object Nondetection With Texture Fields
Vision Transformer With Super Token Sampling
Reproducible Scaling Laws for Contrastive Language-Image Learning
Interactive Segmentation of Radiance Fields
V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting
GeoMAE: Masked Geometric Target Prediction for Self-Supervised Point Cloud Pre-Training
GP-VTON: Towards General Purpose Virtual Try-On via Collaborative Local-Flow Global-Parsing Learning
One-Shot High-Fidelity Talking-Head Synthesis With Deformable Neural Radiance Field
LSTFE-Net:Long Short-Term Feature Enhancement Network for Video Small Object Detection
Decoupled Multimodal Distilling for Emotion Recognition
Discrete Point-Wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition
Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information
Bi-Directional Distribution Alignment for Transductive Zero-Shot Learning
AGAIN: Adversarial Training With Attribution Span Enlargement and Hybrid Feature Fusion
Promoting Semantic Connectivity: Dual Nearest Neighbors Contrastive Learning for Unsupervised Domain Generalization
Generalized Relation Modeling for Transformer Tracking
3D Video Object Detection With Learnable Object-Centric Global Optimization
Flexible-Cm GAN: Towards Precise 3D Dose Prediction in Radiotherapy
CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability
Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection
Deep Dive Into Gradients: Better Optimization for 3D Object Detection With Gradient-Corrected IoU Supervision
PeakConv: Learning Peak Receptive Field for Radar Semantic Segmentation
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization
Learning Video Representations From Large Language Models
Cut and Learn for Unsupervised Object Detection and Instance Segmentation
ImageBind: One Embedding Space To Bind Them All
OmniMAE: Single Model Masked Pretraining on Images and Videos
Universal Instance Perception As Object Discovery and Retrieval
GM-NeRF: Learning Generalizable Model-Based Neural Radiance Fields From Multi-View Images
SeqTrack: Sequence to Sequence Learning for Visual Object Tracking
ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data
Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation
Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories
Improving Generalization of Meta-Learning With Inverted Regularization at Inner-Level
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images
Zero-Shot Noise2Noise: Efficient Image Denoising Without Any Data
Egocentric Auditory Attention Localization in Conversations
Therbligs in Action: Video Understanding Through Motion Primitives
Learning Analytical Posterior Probability for Human Mesh Recovery
Vision Transformers Are Parameter-Efficient Audio-Visual Learners
Perspective Fields for Single Image Camera Calibration
CoralStyleCLIP: Co-Optimized Region and Layer Selection for Image Editing
LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization
Adversarial Normalization: I Can Visualize Everything (ICE)
Gated Stereo: Joint Depth Estimation From Gated and Wide-Baseline Active Stereo Cues
Building Rearticulable Models for Arbitrary 3D Objects From 4D Point Clouds
GazeNeRF: 3D-Aware Gaze Redirection With Neural Radiance Fields
Disentangling Writer and Character Styles for Handwriting Generation
MP-Former: Mask-Piloted Transformer for Image Segmentation
Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR
OmniCity: Omnipotent City Understanding With Multi-Level and Multi-View Images
YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors
Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations
Affordance Grounding From Demonstration Video To Target Image
A Large-Scale Robustness Analysis of Video Action Recognition Models
Solving 3D Inverse Problems Using Pre-Trained 2D Diffusion Models
Parallel Diffusion Models of Operator and Image for Blind Inverse Problems
Transformer-Based Unified Recognition of Two Hands Manipulating Objects
ARO-Net: Learning Implicit Fields From Anchored Radial Observations
PIVOT: Prompting for Video Continual Learning
Rate Gradient Approximation Attack Threats Deep Spiking Neural Networks
ISBNet: A 3D Point Cloud Instance Segmentation Network With Instance-Aware Sampling and Box-Aware Dynamic Convolution
Object Detection With Self-Supervised Scene Adaptation
Megahertz Light Steering Without Moving Parts
SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
ObjectMatch: Robust Registration Using Canonical Object Correspondences
PanelNet: Understanding 360 Indoor Environment via Panel Representation
Selective Structured State-Spaces for Long-Form Video Understanding
Movies2Scenes: Using Movie Metadata To Learn Scene Representation
PMatch: Paired Masked Image Modeling for Dense Geometric Matching
TeSLA: Test-Time Self-Learning With Automatic Adversarial Augmentation
RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer
Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning With Masked Autoencoders
3D Human Keypoints Estimation From Point Clouds in the Wild Without Human Labels
ViP3D: End-to-End Visual Trajectory Prediction via 3D Agent Queries
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
Neural Pixel Composition for 3D-4D View Synthesis From Multi-Views
Self-Supervised Representation Learning for CAD
Vision Transformers Are Good Mask Auto-Labelers
VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion
Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions
Benchmarking Self-Supervised Learning on Diverse Pathology Datasets
Learning To Fuse Monocular and Multi-View Cues for Multi-Frame Depth Estimation in Dynamic Scenes
VILA: Learning Image Aesthetics From User Comments With Vision-Language Pretraining
Are Deep Neural Networks SMARTer Than Second Graders?
C-SFDA: A Curriculum Learning Aided Self-Training Framework for Efficient Source Free Domain Adaptation
Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization
Detecting Backdoors During the Inference Stage Based on Corruption Robustness Consistency
BEVHeight: A Robust Framework for Vision-Based Roadside 3D Object Detection
expOSE: Accurate Initialization-Free Projective Factorization Using Exponential Regularization
Unite and Conquer: Plug & Play Multi-Modal Synthesis Using Diffusion Models
Open-Vocabulary Attribute Detection
Preserving Linear Separability in Continual Learning by Backward Feature Projection
GINA-3D: Learning To Generate Implicit Neural Assets in the Wild
Affection: Learning Affective Explanations for Real-World Visual Data
SCADE: NeRFs from Space Carving With Ambiguity-Aware Depth Estimates
Visual Programming: Compositional Visual Reasoning Without Training
Multi-Realism Image Compression With a Conditional Generator
Clothed Human Performance Capture With a Double-Layer Neural Radiance Fields
Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions
H2ONet: Hand-Occlusion-and-Orientation-Aware Network for Real-Time 3D Hand Mesh Reconstruction
Event-Based Blurry Frame Interpolation Under Blind Exposure
Achieving a Better Stability-Plasticity Trade-Off via Auxiliary Networks in Continual Learning
Re-Basin via Implicit Sinkhorn Differentiation
Efficient Scale-Invariant Generator With Column-Row Entangled Pixel Synthesis
3D Video Loops From Asynchronous Input
BASiS: Batch Aligned Spectral Embedding Space
Temporal Interpolation Is All You Need for Dynamic Neural Radiance Fields
DiGA: Distil To Generalize and Then Adapt for Domain Adaptive Semantic Segmentation
Open-Vocabulary Semantic Segmentation With Mask-Adapted CLIP
Mask3D: Pre-Training 2D Vision Transformers by Learning Masked 3D Priors
Privacy-Preserving Representations Are Not Enough: Recovering Scene Content From Camera Poses
PointCert: Point Cloud Classification With Deterministic Certified Robustness Guarantees
Revealing the Dark Secrets of Masked Image Modeling
Human Pose As Compositional Tokens
Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning
Meta Compositional Referring Expression Segmentation
SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation
Balanced Spherical Grid for Egocentric View Synthesis
OvarNet: Towards Open-Vocabulary Object Attribute Recognition
AutoAD: Movie Description in Context
Visual Recognition by Request
Wavelet Diffusion Models Are Fast and Scalable Image Generators
HOLODIFFUSION: Training a 3D Diffusion Model Using 2D Images
TopDiG: Class-Agnostic Topological Directional Graph Extraction From Remote Sensing Images
Progressive Backdoor Erasing via Connecting Backdoor and Adversarial Attacks
Normalizing Flow Based Feature Synthesis for Outlier-Aware Object Detection
Side Adapter Network for Open-Vocabulary Semantic Segmentation
TOPLight: Lightweight Neural Networks With Task-Oriented Pretraining for Visible-Infrared Recognition
PIP-Net: Patch-Based Intuitive Prototypes for Interpretable Image Classification
Blemish-Aware and Progressive Face Retouching With Limited Paired Data
Self-Guided Diffusion Models
Leveraging Temporal Context in Low Representational Power Regimes
Multi-Granularity Archaeological Dating of Chinese Bronze Dings Based on a Knowledge-Guided Relation Graph
Depth Estimation From Indoor Panoramas With Neural Scene Representation
Progressive Spatio-Temporal Alignment for Efficient Event-Based Motion Estimation
Learning Expressive Prompting With Residuals for Vision Transformers
Sharpness-Aware Gradient Matching for Domain Generalization
UV Volumes for Real-Time Rendering of Editable Free-View Human Performance
Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields
Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network
BEV-LaneDet: An Efficient 3D Lane Detection Based on Virtual Camera via Key-Points
Text-Visual Prompting for Efficient 2D Temporal Video Grounding
NIKI: Neural Inverse Kinematics With Invertible Neural Networks for 3D Human Pose and Shape Estimation
Learning Transferable Spatiotemporal Representations From Natural Script Knowledge
Diffusion-Based Signed Distance Fields for 3D Shape Generation
HDR Imaging With Spatially Varying Signal-to-Noise Ratios
ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders
Audio-Visual Grouping Network for Sound Localization From Mixtures
Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture
Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations
Structured Kernel Estimation for Photon-Limited Deconvolution
Hard Patches Mining for Masked Image Modeling
Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning
Decentralized Learning With Multi-Headed Distillation
Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction
Learning Transformation-Predictive Representations for Detection and Description of Local Features
Graph Representation for Order-Aware Visual Transformation
MoDi: Unconditional Motion Synthesis From Diverse Data
PSVT: End-to-End Multi-Person 3D Pose and Shape Estimation With Progressive Video Transformers
Style Projected Clustering for Domain Generalized Semantic Segmentation
Learning Steerable Function for Efficient Image Resampling
Enhanced Multimodal Representation Learning With Cross-Modal KD
Improving Graph Representation for Point Cloud Segmentation via Attentive Filtering
BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation
Zero-Shot Dual-Lens Super-Resolution
Toward RAW Object Detection: A New Benchmark and a New Model
MAGVIT: Masked Generative Video Transformer
Continuous Landmark Detection With 3D Queries
ToThePoint: Efficient Contrastive Learning of 3D Point Clouds via Recycling
FashionSAP: Symbols and Attributes Prompt for Fine-Grained Fashion Vision-Language Pre-Training
Neural Voting Field for Camera-Space 3D Hand Pose Estimation
On Calibrating Semantic Segmentation Models: Analyses and an Algorithm
Multimodal Prompting With Missing Modalities for Visual Recognition
The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training
MetaPortrait: Identity-Preserving Talking Head Generation With Fast Personalized Adaptation
Perception and Semantic Aware Regularization for Sequential Confidence Calibration
Trainable Projected Gradient Method for Robust Fine-Tuning
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Masked Images Are Counterfactual Samples for Robust Fine-Tuning
SparseFusion: Distilling View-Conditioned Diffusion for 3D Reconstruction
One-Shot Model for Mixed-Precision Quantization
The Resource Problem of Using Linear Layer Leakage Attack in Federated Learning
OCTET: Object-Aware Counterfactual Explanations
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers
Shakes on a Plane: Unsupervised Depth Estimation From Unstabilized Photography
Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-Shot Learning With Hyperspherical Embeddings
On the Effects of Self-Supervision and Contrastive Alignment in Deep Multi-View Clustering
DeepVecFont-v2: Exploiting Transformers To Synthesize Vector Fonts With Higher Quality
Blowing in the Wind: CycleNet for Human Cinemagraphs From Still Images
Reconstructing Signing Avatars From Video Using Linguistic Priors
Four-View Geometry With Unknown Radial Distortion
Both Style and Distortion Matter: Dual-Path Unsupervised Domain Adaptation for Panoramic Semantic Segmentation
PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation
DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection
Fuzzy Positive Learning for Semi-Supervised Semantic Segmentation
Out-of-Distributed Semantic Pruning for Robust Semi-Supervised Learning
Domain Generalized Stereo Matching via Hierarchical Visual Transformation
Quality-Aware Pre-Trained Models for Blind Image Quality Assessment
Fine-Grained Audible Video Description
Modeling the Distributional Uncertainty for Salient Object Detection Models
Masked Representation Learning for Domain Generalized Stereo Matching
Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling
Decoupling MaxLogit for Out-of-Distribution Detection
Federated Learning With Data-Agnostic Distribution Fusion
OVTrack: Open-Vocabulary Multiple Object Tracking
CaPriDe Learning: Confidential and Private Decentralized Learning Based on Encryption-Friendly Distillation Loss
StyLess: Boosting the Transferability of Adversarial Examples
HOICLIP: Efficient Knowledge Transfer for HOI Detection With Vision-Language Models
HandsOff: Labeled Dataset Generation With No Additional Human Annotations
Feature Shrinkage Pyramid for Camouflaged Object Detection With Transformers
Improving Visual Representation Learning Through Perceptual Understanding
Automatic High Resolution Wire Segmentation and Removal
PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing
Visual Atoms: Pre-Training Vision Transformers With Sinusoidal Waves
Zero-Shot Generative Model Adaptation via Image-Specific Prompt Learning
CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation
Enhancing Deformable Local Features by Jointly Learning To Detect and Describe Keypoints
PRISE: Demystifying Deep Lucas-Kanade With Strongly Star-Convex Constraints for Multimodel Image Alignment
G-MSM: Unsupervised Multi-Shape Matching With Graph-Based Affinity Priors
Power Bundle Adjustment for Large-Scale 3D Reconstruction
Behind the Scenes: Density Fields for Single View Reconstruction
Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments
Relational Space-Time Query in Long-Form Videos
Semidefinite Relaxations for Robust Multiview Triangulation
Adjustment and Alignment for Unbiased Open Set Domain Adaptation
Learning Federated Visual Prompt in Null Space for MRI Reconstruction
Domain Expansion of Image Generators
NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models
Backdoor Defense via Deconfounded Representation Learning
Complementary Intrinsics From Neural Radiance Fields and CNNs for Outdoor Scene Relighting
HumanGen: Generating Human Radiance Fields With Explicit Priors
NeRDi: Single-View NeRF Synthesis With Language-Guided Diffusion As General Image Priors
SPARF: Neural Radiance Fields From Sparse and Noisy Poses
Devil’s on the Edges: Selective Quad Attention for Scene Graph Generation
Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models
REC-MV: REconstructing 3D Dynamic Cloth From Monocular Videos
MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency
MVImgNet: A Large-Scale Dataset of Multi-View Images
UniSim: A Neural Closed-Loop Sensor Simulator
SFD2: Semantic-Guided Feature Detection and Description
Towards Effective Visual Representations for Partial-Label Learning
ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects
HARP: Personalized Hand Reconstruction From a Monocular RGB Video
Making Vision Transformers Efficient From a Token Sparsification View
MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering
Position-Guided Text Prompt for Vision-Language Pre-Training
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning With Multimodal Models
Polarized Color Image Denoising
Multi Domain Learning for Motion Magnification
SeaThru-NeRF: Neural Radiance Fields in Scattering Media
DIFu: Depth-Guided Implicit Function for Clothed Human Reconstruction
Panoptic Lifting for 3D Scene Understanding With Neural Fields
DualVector: Unsupervised Vector Font Synthesis With Dual-Part Representation
SVGformer: Representation Learning for Continuous Vector Graphics Using Transformers
GeoVLN: Learning Geometry-Enhanced Visual Representation With Slot Attention for Vision-and-Language Navigation
MAIR: Multi-View Attention Inverse Rendering With 3D Spatially-Varying Lighting Estimation
Learning 3D Scene Priors With 2D Supervision
ProTéGé: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding
Video Compression With Entropy-Constrained Neural Representations
Learning From Unique Perspectives: User-Aware Saliency Modeling
Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders
Starting From Non-Parametric Networks for 3D Point Cloud Analysis
NeRFLix: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-Viewpoint MiXer
Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields
TriVol: Point Cloud Rendering via Triple Volumes
DR2: Diffusion-Based Robust Degradation Remover for Blind Face Restoration
ABLE-NeRF: Attention-Based Rendering With Learnable Embeddings for Neural Radiance Field
Exploring Structured Semantic Prior for Multi Label Recognition With Incomplete Labels
LG-BPN: Local and Global Blind-Patch Network for Self-Supervised Real-World Denoising
CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP
FlexiViT: One Model for All Patch Sizes
CLIPPO: Image-and-Language Understanding From Pixels Only
DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling
BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration
DivClust: Controlling Diversity in Deep Clustering
On Data Scaling in Masked Image Modeling
Masked Image Training for Generalizable Deep Image Denoising
ScaleKD: Distilling Scale-Aware Knowledge in Small Object Detector
Shape-Aware Text-Driven Layered Video Editing
Generalizable Implicit Neural Representations via Instance Pattern Composers
Behavioral Analysis of Vision-and-Language Navigation Agents
HierVL: Learning Hierarchical Video-Language Embeddings
Learning Geometry-Aware Representations by Sketching
Pruning Parameterization With Bi-Level Optimization for Efficient Semantic Segmentation on the Edge
Unsupervised Deep Probabilistic Approach for Partial Point Cloud Registration
StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping
Federated Domain Generalization With Generalization Adjustment
STMixer: A One-Stage Sparse Action Detector
Learning Discriminative Representations for Skeleton Based Action Recognition
On the Effectiveness of Partial Variance Reduction in Federated Learning With Heterogeneous Data
Seeing With Sound: Long-range Acoustic Beamforming for Multimodal Scene Understanding
Unicode Analogies: An Anti-Objectivist Visual Reasoning Challenge
Slide-Transformer: Hierarchical Vision Transformer With Local Self-Attention
Dual-Bridging With Adversarial Noise Generation for Domain Adaptive rPPG Estimation
L-CoIns: Language-Based Colorization With Instance Awareness
Diversity-Aware Meta Visual Prompting
Tunable Convolutions With Parametric Multi-Loss Optimization
Efficient View Synthesis and 3D-Based Multi-Frame Denoising With Multiplane Feature Representations
An Image Quality Assessment Dataset for Portraits
FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding
High-Fidelity Generalized Emotional Talking Face Generation With Multi-Modal Emotion Space Learning
MonoATT: Online Monocular 3D Object Detection With Adaptive Token Transformer
Learning Visibility Field for Detailed 3D Human Reconstruction and Relighting
Complexity-Guided Slimmable Decoder for Efficient Deep Video Compression
Glocal Energy-Based Learning for Few-Shot Open-Set Recognition
MACARONS: Mapping and Coverage Anticipation With RGB Online Self-Supervision
Conjugate Product Graphs for Globally Optimal 2D-3D Shape Matching
Real-Time Multi-Person Eyeblink Detection in the Wild for Untrimmed Video
RUST: Latent Neural Scene Representations From Unposed Imagery
Humans As Light Bulbs: 3D Human Reconstruction From Thermal Reflection
What You Can Reconstruct From a Shadow
Uncertainty-Aware Vision-Based Metric Cross-View Geolocalization
Stare at What You See: Masked Image Modeling Without Reconstruction
Network-Free, Unsupervised Semantic Segmentation With Synthetic Images
Q-DETR: An Efficient Low-Bit Quantized Detection Transformer
Progressively Optimized Local Radiance Fields for Robust View Synthesis
Hierarchical Neural Memory Network for Low Latency Event Processing
Attention-Based Point Cloud Edge Sampling
Initialization Noise in Image Gradients and Saliency Maps
A Light Touch Approach to Teaching Transformers Multi-View Geometry
Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm
DynamicStereo: Consistent Dynamic Depth From Stereo Videos
RealFusion: 360° Reconstruction of Any Object From a Single Image
PC2: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction
Learning Conditional Attributes for Compositional Zero-Shot Learning
Masked Autoencoders Enable Efficient Knowledge Distillers
DINN360: Deformable Invertible Neural Network for Latitude-Aware 360° Image Rescaling
One-Stage 3D Whole-Body Mesh Recovery With Component Aware Transformer
Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes
Optimization-Inspired Cross-Attention Transformer for Compressive Sensing
Understanding Imbalanced Semantic Segmentation Through Neural Collapse
Hierarchical Dense Correlation Distillation for Few-Shot Segmentation
Transformer-Based Learned Optimization
NeuFace: Realistic 3D Neural Face Rendering From Multi-View Images
Residual Degradation Learning Unfolding Framework With Mixing Priors Across Spectral and Spatial for Compressive Spectral Imaging
Revisiting Rolling Shutter Bundle Adjustment: Toward Accurate and Fast Solution
SMPConv: Self-Moving Point Representations for Continuous Convolution
Diffusion-Based Generation, Optimization, and Planning in 3D Scenes
LayoutDM: Transformer-Based Diffusion Model for Layout Generation
Decoupling-and-Aggregating for Image Exposure Correction
JAWS: Just a Wild Shot for Cinematic Transfer in Neural Radiance Fields
SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries
Learning To Measure the Point Cloud Reconstruction Loss in a Representation Space
Learning Neural Proto-Face Field for Disentangled 3D Face Modeling in the Wild
Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations
ACL-SPC: Adaptive Closed-Loop System for Self-Supervised Point Cloud Completion
FEND: A Future Enhanced Distribution-Aware Contrastive Learning Framework for Long-Tail Trajectory Prediction
Annealing-Based Label-Transfer Learning for Open World Object Detection
Instance-Aware Domain Generalization for Face Anti-Spoofing
Geometric Visual Similarity Learning in 3D Medical Image Self-Supervised Pre-Training
Unsupervised Deep Asymmetric Stereo Matching With Spatially-Adaptive Self-Similarity
No One Left Behind: Improving the Worst Categories in Long-Tailed Learning
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer
Sample-Level Multi-View Graph Clustering
Unlearnable Clusters: Towards Label-Agnostic Unlearnable Examples
Multi-Label Compound Expression Recognition: C-EXPR Database & Network
Multi-Concept Customization of Text-to-Image Diffusion
Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction
Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network
Parameter Efficient Local Implicit Image Function Network for Face Segmentation
Revisiting Reverse Distillation for Anomaly Detection
Perception-Oriented Single Image Super-Resolution Using Optimal Objective Estimation
VGFlow: Visibility Guided Flow Network for Human Reposing
Turning Strengths Into Weaknesses: A Certified Robustness Inspired Attack Framework Against Graph Neural Networks
Center Focusing Network for Real-Time LiDAR Panoptic Segmentation
Harmonious Teacher for Cross-Domain Object Detection
SelfME: Self-Supervised Motion Learning for Micro-Expression Recognition
Mask-Guided Matting in the Wild
Self-Positioning Point-Based Transformer for Point Cloud Understanding
Few-Shot Geometry-Aware Keypoint Localization
Instant Multi-View Head Capture Through Learnable Registration
Trade-Off Between Robustness and Accuracy of Vision Transformers
A Loopback Network for Explainable Microvascular Invasion Classification
Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization
Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers
KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
Image Quality-Aware Diagnosis via Meta-Knowledge Co-Embedding
Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers
Search-Map-Search: A Frame Selection Paradigm for Action Recognition
DISC: Learning From Noisy Labels via Dynamic Instance-Specific Selection and Correction
Renderable Neural Radiance Map for Visual Navigation
Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation
Learning To Generate Image Embeddings With User-Level Differential Privacy
Persistent Nature: A Generative Model of Unbounded 3D Worlds
Improving Vision-and-Language Navigation by Generating Future-View Image Semantics
Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-per-Second
Deep Semi-Supervised Metric Learning With Mixed Label Propagation
Unbiased Scene Graph Generation in Videos
Neural Rate Estimator and Unsupervised Learning for Efficient Distributed Image Analytics in Split-DNN Models
RealImpact: A Dataset of Impact Sound Fields for Real Objects
RMLVQA: A Margin Loss Approach for Visual Question Answering With Language Biases
Lookahead Diffusion Probabilistic Models for Refining Mean Estimation
Learning Articulated Shape With Keypoint Pseudo-Labels From Web Images
Modular Memorability: Tiered Representations for Video Memorability Prediction
Shifted Diffusion for Text-to-Image Generation
CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation
Attribute-Preserving Face Dataset Anonymization via Latent Code Optimization
MetaViewer: Towards a Unified Multi-View Representation
Sequential Training of GANs Against GAN-Classifiers Reveals Correlated “Knowledge Gaps” Present Among Independently Trained GAN Instances
Improving Robustness of Semantic Segmentation to Motion-Blur Using Class-Centric Augmentation
Instance-Specific and Model-Adaptive Supervision for Semi-Supervised Semantic Segmentation
MotionDiffuser: Controllable Multi-Agent Motion Prediction Using Diffusion
Train-Once-for-All Personalization
Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts
You Do Not Need Additional Priors or Regularizers in Retinex-Based Low-Light Image Enhancement
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
Semantic-Conditional Diffusion Networks for Image Captioning
Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM
CAP: Robust Point Cloud Classification via Semantic and Structural Modeling
Jedi: Entropy-Based Localization and Removal of Adversarial Patches
Generating Features With Increased Crop-Related Diversity for Few-Shot Object Detection
iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition
Coreset Sampling From Open-Set for Fine-Grained Self-Supervised Learning
Adaptive Data-Free Quantization
High-Frequency Stereo Matching Network
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models
Principles of Forgetting in Domain-Incremental Semantic Segmentation in Adverse Weather Conditions
Two-Way Multi-Label Loss
Transferable Adversarial Attacks on Vision Transformers With Token Gradient Regularization
Robust 3D Shape Classification via Non-Local Graph Attention Network
Single View Scene Scale Estimation Using Scale Field
Solving Relaxations of MAP-MRF Problems: Combinatorial In-Face Frank-Wolfe Directions
AUNet: Learning Relations Between Action Units for Face Forgery Detection
Learning a 3D Morphable Face Reflectance Model From Low-Cost Data
Frame-Event Alignment and Fusion Network for High Frame Rate Tracking
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model
CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data
Learning To Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space
Introducing Competition To Boost the Transferability of Targeted Adversarial Examples Through Clean Feature Mixup
DINER: Disorder-Invariant Implicit Neural Representation
DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium
Manipulating Transfer Learning for Property Inference
Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks
MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset
Leveraging Inter-Rater Agreement for Classification in the Presence of Noisy Labels
Logical Implications for Visual Question Answering Consistency
Independent Component Alignment for Multi-Task Learning
Open Vocabulary Semantic Segmentation With Patch Aligned Contrastive Learning
MMG-Ego4D: Multimodal Generalization in Egocentric Action Recognition
Deep Deterministic Uncertainty: A New Simple Baseline
SViTT: Temporal Learning of Sparse Video-Text Transformers
ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model With Knowledge-Enhanced Mixture-of-Denoising-Experts
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
Open-Set Representation Learning Through Combinatorial Embedding
DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection
HypLiLoc: Towards Effective LiDAR Pose Regression With Hyperbolic Fusion
Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation
Local-Guided Global: Paired Similarity Representation for Visual Reinforcement Learning
Boosting Accuracy and Robustness of Student Models via Adaptive Adversarial Distillation
Hyperspherical Embedding for Point Cloud Completion
Efficient Hierarchical Entropy Model for Learned Point Cloud Compression
Improving the Transferability of Adversarial Samples by Path-Augmented Method
SIEDOB: Semantic Image Editing by Disentangling Object and Background
GradMA: A Gradient-Memory-Based Accelerated Federated Learning With Alleviated Catastrophic Forgetting
Few-Shot Class-Incremental Learning via Class-Aware Bilateral Distillation
Neural Lens Modeling
A Probabilistic Framework for Lifelong Test-Time Adaptation
ZBS: Zero-Shot Background Subtraction via Instance-Level Background Modeling and Foreground Selection
DeAR: Debiasing Vision-Language Models With Additive Residuals
Deep Depth Estimation From Thermal Image
3D GAN Inversion With Facial Symmetry Prior
You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?
Generalization Matters: Loss Minima Flattening via Parameter Hybridization for Efficient Online Knowledge Distillation
BiasAdv: Bias-Adversarial Augmentation for Model Debiasing
PartMix: Regularization Strategy To Learn Part Discovery for Visible-Infrared Person Re-Identification
DART: Diversify-Aggregate-Repeat Training Improves Generalization of Neural Networks
Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting
Towards Practical Plug-and-Play Diffusion Models
PMR: Prototypical Modal Rebalance for Multimodal Learning
Decoupling Learning and Remembering: A Bilevel Memory Framework With Knowledge Projection for Task-Incremental Learning
Exemplar-FreeSOLO: Enhancing Unsupervised Instance Segmentation With Exemplars
Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning
PEFAT: Boosting Semi-Supervised Medical Image Classification via Pseudo-Loss Estimation and Feature Adversarial Training
Privacy-Preserving Adversarial Facial Features
MAGVLT: Masked Generative Vision-and-Language Transformer
Deep Random Projector: Accelerated Deep Image Prior
BEV-Guided Multi-Modality Fusion for Driving Perception
Dealing With Cross-Task Class Discrimination in Online Continual Learning
Tree Instance Segmentation With Temporal Contour Graph
Rethinking Few-Shot Medical Segmentation: A Vector Quantization View
NeFII: Inverse Rendering for Reflectance Decomposition With Near-Field Indirect Illumination
Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification
SlowLiDAR: Increasing the Latency of LiDAR-Based Detection Using Adversarial Examples
Unsupervised Intrinsic Image Decomposition With LiDAR Intensity
RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation With Natural Prompts
Single Image Backdoor Inversion via Robust Smoothed Classifiers
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision
Train/Test-Time Adaptation With Retrieval
Hierarchical Fine-Grained Image Forgery Detection and Localization
MammalNet: A Large-Scale Video Benchmark for Mammal Recognition and Behavior Understanding
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval
Contrastive Mean Teacher for Domain Adaptive Object Detectors
TMO: Textured Mesh Acquisition of Objects With a Mobile Device by Using Differentiable Rendering
InstantAvatar: Learning Avatars From Monocular Video in 60 Seconds
Neural Volumetric Memory for Visual Locomotion Control
Efficient On-Device Training via Gradient Filtering
SmartBrush: Text and Shape Guided Object Inpainting With Diffusion Model
NIFF: Alleviating Forgetting in Generalized Few-Shot Object Detection via Neural Instance Feature Forging
Unpaired Image-to-Image Translation With Shortest Path Regularization
NeRFLight: Fast and Light Neural Radiance Fields Using a Shared Feature Grid
PEAL: Prior-Embedded Explicit Attention Learning for Low-Overlap Point Cloud Registration
Transforming Radiance Field With Lipschitz Network for Photorealistic 3D Scene Stylization
Test Time Adaptation With Regularized Loss for Weakly Supervised Salient Object Detection
Fantastic Breaks: A Dataset of Paired 3D Scans of Real-World Broken Objects and Their Complete Counterparts
InstructPix2Pix: Learning To Follow Image Editing Instructions
Cross-Domain 3D Hand Pose Estimation With Dual Modalities
Source-Free Video Domain Adaptation With Spatial-Temporal-Historical Consistency Learning
PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers
SmallCap: Lightweight Image Captioning Prompted With Retrieval Augmentation
Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision
Learning To Render Novel Views From Wide-Baseline Stereo Pairs
Neural Texture Synthesis With Guided Correspondence
AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers
Robust Test-Time Adaptation in Dynamic Scenarios
AnchorFormer: Point Cloud Completion From Discriminative Nodes
Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures
Transformer Scale Gate for Semantic Segmentation
AccelIR: Task-Aware Image Compression for Accelerating Neural Restoration
A Unified Spatial-Angular Structured Light for Single-View Acquisition of Shape and Reflectance
SCOTCH and SODA: A Transformer Video Shadow Detection Framework
MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation
Neuralizer: General Neuroimage Analysis Without Re-Training
MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins
Detecting Human-Object Contact in Images
Efficient Verification of Neural Networks Against LVM-Based Specifications
Recurrent Vision Transformers for Object Detection With Event Cameras
SLACK: Stable Learning of Augmentations With Cold-Start and KL Regularization
SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy
Diversity-Measurable Anomaly Detection
Visual Localization Using Imperfect 3D Models From the Internet
LANA: A Language-Capable Navigator for Instruction Following and Generation
MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition
HOTNAS: Hierarchical Optimal Transport for Neural Architecture Search
Co-Training 2L Submodels for Visual Recognition
Learning Rotation-Equivariant Features for Visual Correspondence
CFA: Class-Wise Calibrated Fair Adversarial Training
VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models
Hint-Aug: Drawing Hints From Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning
Fine-Grained Classification With Noisy Labels
Learning a Practical SDR-to-HDRTV Up-Conversion Using New Dataset and Degradation Models
BBDM: Image-to-Image Translation With Brownian Bridge Diffusion Models
Regularize Implicit Neural Representation by Itself
Multivariate, Multi-Frequency and Multimodal: Rethinking Graph Neural Networks for Emotion Recognition in Conversation
Elastic Aggregation for Federated Optimization
Learning a Deep Color Difference Metric for Photographic Images
Learning Debiased Representations via Conditional Attribute Interpolation
Cooperation or Competition: Avoiding Player Domination for Multi-Target Robustness via Adaptive Budgets
Re-GAN: Data-Efficient GANs Training via Architectural Reconfiguration
Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective
CHMATCH: Contrastive Hierarchical Matching and Robust Adaptive Threshold Boosted Semi-Supervised Learning
Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards
Deformable Mesh Transformer for 3D Human Mesh Recovery
WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding
Bit-Shrinking: Limiting Instantaneous Sharpness for Improving Post-Training Quantization
NÜWA-LIP: Language-Guided Image Inpainting With Defect-Free VQGAN
A Practical Upper Bound for the Worst-Case Attribution Deviations
Understanding Masked Image Modeling via Learning Occlusion Invariant Feature
Robust Generalization Against Photon-Limited Corruptions via Worst-Case Sharpness Minimization
Imitation Learning As State Matching via Differentiable Physics
Improving Generalization With Domain Convex Game
Learning To Generate Text-Grounded Mask for Open-World Semantic Segmentation From Only Image-Text Pairs
CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language
Enhancing Multiple Reliability Measures via Nuisance-Extended Information Bottleneck
Where We Are and What We’re Looking At: Query Based Worldwide Image Geo-Localization Using Hierarchies and Scenes
Boosting Transductive Few-Shot Fine-Tuning With Margin-Based Uncertainty Weighting and Probability Regularization
How To Prevent the Poor Performance Clients for Personalized Federated Learning?
Generalist: Decoupling Natural and Robust Generalization
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers
FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning
From Images to Textual Prompts: Zero-Shot Visual Question Answering With Frozen Large Language Models
Architectural Backdoors in Neural Networks
CUDA: Convolution-Based Unlearnable Datasets
Simulated Annealing in Early Layers Leads to Better Generalization
Critical Learning Periods for Multisensory Integration in Deep Networks
Boosting Low-Data Instance Segmentation by Unsupervised Pre-Training With Saliency Prompt
Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition
Foundation Model Drives Weakly Incremental Learning for Semantic Segmentation
Learning Neural Volumetric Representations of Dynamic Humans in Minutes
Frame Interpolation Transformer and Uncertainty Guidance
Interventional Bag Multi-Instance Learning on Whole-Slide Pathological Images
Enhanced Stable View Synthesis
Video Event Restoration Based on Keyframes for Video Anomaly Detection
Towards Transferable Targeted Adversarial Examples
Leverage Interactive Affinity for Affordance Learning
Interactive and Explainable Region-Guided Radiology Report Generation
PHA: Patch-Wise High-Frequency Augmentation for Transformer-Based Person Re-Identification
Towards Unbiased Volume Rendering of Neural Implicit Surfaces With Geometry Priors
MixNeRF: Modeling a Ray With Mixture Density for Novel View Synthesis From Sparse Inputs
StyleGene: Crossover and Mutation of Region-Level Facial Genes for Kinship Face Synthesis
Fair Scratch Tickets: Finding Fair Sparse Networks Without Weight Training
PermutoSDF: Fast Multi-View Reconstruction With Implicit Surfaces Using Permutohedral Lattices
PCR: Proxy-Based Contrastive Replay for Online Class-Incremental Continual Learning
Boundary-Aware Backward-Compatible Representation via Adversarial Learning in Image Retrieval
PointClustering: Unsupervised Point Cloud Pre-Training Using Transformation Invariance in Clustering
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Deep Polarization Reconstruction With PDAVIS Events
NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction
PDPP:Projected Diffusion for Procedure Planning in Instructional Videos
CP3: Channel Pruning Plug-In for Point-Based Networks
ERM-KTP: Knowledge-Level Machine Unlearning via Knowledge Transfer
Few-Shot Semantic Image Synthesis With Class Affinity Transfer
Differentiable Architecture Search With Random Features
GKEAL: Gaussian Kernel Embedded Analytic Learning for Few-Shot Class Incremental Task
Extracting Class Activation Maps From Non-Discriminative Features As Well
A Simple Framework for Text-Supervised Semantic Segmentation
Region-Aware Pretraining for Open-Vocabulary Object Detection With Vision Transformers
Can’t Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders
Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition
sRGB Real Noise Synthesizing With Neighboring Correlation-Aware Noise Model
Streaming Video Model
Orthogonal Annotation Benefits Barely-Supervised Medical Image Segmentation
PanoSwin: A Pano-Style Swin Transformer for Panorama Understanding
All Are Worth Words: A ViT Backbone for Diffusion Models
CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition With Variational Alignment
Language Adaptive Weight Generation for Multi-Task Visual Grounding
VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking
GeoLayoutLM: Geometric Pre-Training for Visual Information Extraction
Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models
Modeling Entities As Semantic Points for Visual Information Extraction in the Wild
Single Image Depth Prediction Made Better: A Multivariate Gaussian Take
DaFKD: Domain-Aware Federated Knowledge Distillation
Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild
Revisiting the P3P Problem
Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification
DropKey for Vision Transformer
BiCro: Noisy Correspondence Rectification for Multi-Modality Data via Bi-Directional Cross-Modal Similarity Consistency
DA Wand: Distortion-Aware Selection Using Neural Mesh Parameterization
Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding
Unsupervised Visible-Infrared Person Re-Identification via Progressive Graph Matching and Alternate Learning
GCFAgg: Global and Cross-View Feature Aggregation for Multi-View Clustering
An Actor-Centric Causality Graph for Asynchronous Temporal Inference in Group Activity
À-La-Carte Prompt Tuning (APT): Combining Distinct Data via Composable Prompting
Equiangular Basis Vectors
Q: How To Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Hybrid Active Learning via Deep Clustering for Video Action Detection
Constrained Evolutionary Diffusion Filter for Monocular Endoscope Tracking
MMVC: Learned Multi-Mode Video Compression With Block-Based Prediction Mode Selection and Density-Adaptive Entropy Coding
Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation
Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
Adaptive Graph Convolutional Subspace Clustering
LightedDepth: Video Depth Estimation in Light of Limited Inference View Angles
Two-Stream Networks for Weakly-Supervised Temporal Action Localization With Semantic-Aware Mechanisms
OpenMix: Exploring Outlier Samples for Misclassification Detection
DyLiN: Making Light Field Networks Dynamic
ProphNet: Efficient Agent-Centric Motion Forecasting With Anchor-Informed Proposals
Meta Architecture for Point Cloud Analysis
Generating Anomalies for Video Anomaly Detection With Prompt-Based Feature Mapping
RWSC-Fusion: Region-Wise Style-Controlled Fusion Network for the Prohibited X-Ray Security Image Synthesis
Robust Outlier Rejection for 3D Registration With Variational Bayes
Architecture, Dataset and Model-Scale Agnostic Data-Free Meta-Learning
BKinD-3D: Self-Supervised 3D Keypoint Discovery From Multi-View Videos
Dual-Path Adaptation From Image to Video Transformers
RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training
Meta-Learning With a Geometry-Adaptive Preconditioner
Passive Micron-Scale Time-of-Flight With Sunlight Interferometry
Swept-Angle Synthetic Wavelength Interferometry
Indescribable Multi-Modal Spatial Evaluator
Abstract Visual Reasoning: An Algebraic Approach for Solving Raven’s Progressive Matrices
Decoupling Human and Camera Motion From Videos in the Wild
Unifying Vision, Text, and Layout for Universal Document Processing
Flow Supervision for Deformable NeRF
Learning From Noisy Labels With Decoupled Meta Label Purifier
Multi-Sensor Large-Scale Dataset for Multi-View 3D Reconstruction
OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis
Latency Matters: Real-Time Action Forecasting Transformer
ViTs for SITS: Vision Transformers for Satellite Image Time Series
Learning Joint Latent Space EBM Prior Model for Multi-Layer Generator
Efficient Map Sparsification Based on 2D and 3D Discretized Grids
LVQAC: Lattice Vector Quantization Coupled With Spatially Adaptive Companding for Efficient Learned Image Compression
Pose-Disentangled Contrastive Learning for Self-Supervised Facial Representation
Probabilistic Knowledge Distillation of Face Ensembles
Multi-Modal Gait Recognition via Effective Spatial-Temporal Feature Fusion
DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning
Kernel Aware Resampler
Document Image Shadow Removal Guided by Color-Aware Background
Temporal Consistent 3D LiDAR Representation Learning for Semantic Perception in Autonomous Driving
EMT-NAS:Transferring Architectural Knowledge Between Tasks From Different Datasets
CompletionFormer: Depth Completion With Convolutions and Vision Transformers
Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
Re-Thinking Federated Active Learning Based on Inter-Class Diversity
Physical-World Optical Adversarial Attacks on 3D Face Recognition
DATE: Domain Adaptive Product Seeker for E-Commerce
Trap Attention: Monocular Depth Estimation With Manual Traps
Integral Neural Networks
Full or Weak Annotations? An Adaptive Strategy for Budget-Constrained Annotation Campaigns
Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective
Switchable Representation Learning Framework With Self-Compatibility
Neural Fourier Filter Bank
Exploring Data Geometry for Continual Learning
QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation
Learning Neural Duplex Radiance Fields for Real-Time View Synthesis
FlowGrad: Controlling the Output of Generative ODEs With Gradients
PointVector: A Vector Representation in Point Cloud Analysis
Data-Driven Feature Tracking for Event Cameras
Deep Curvilinear Editing: Commutative and Nonlinear Image Manipulation for Pretrained Deep Generative Model
ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-Real Novel View Synthesis via Contrastive Learning
Multi-Agent Automated Machine Learning
Inversion-Based Style Transfer With Diffusion Models
Computational Flash Photography Through Intrinsics
Range-Nullspace Video Frame Interpolation With Focalized Motion Estimation
Robust and Scalable Gaussian Process Regression and Its Applications
OmniVidar: Omnidirectional Depth Estimation From Multi-Fisheye Images
Semi-Weakly Supervised Object Kinematic Motion Prediction
VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution
Task-Specific Fine-Tuning via Variational Information Bottleneck for Weakly-Supervised Pathology Whole Slide Image Classification
DynamicDet: A Unified Dynamic Architecture for Object Detection
Toward Stable, Interpretable, and Lightweight Hyperspectral Super-Resolution
Optimal Proposal Learning for Deployable End-to-End Pedestrian Detection
IDGI: A Framework To Eliminate Explanation Noise From Integrated Gradients
DPE: Disentanglement of Pose and Expression for General Video Portrait Editing
Fast Monocular Scene Reconstruction With Global-Sparse Local-Dense Grids
Bi-Level Meta-Learning for Few-Shot Domain Generalization
Class-Balancing Diffusion Models
Difficulty-Based Sampling for Debiased Contrastive Representation Learning
The Treasure Beneath Multiple Annotations: An Uncertainty-Aware Edge Detector
Towards Trustable Skin Cancer Diagnosis via Rewriting Model’s Decision
DCFace: Synthetic Face Generation With Dual Condition Diffusion Model
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
Frame Flexible Network
Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention
Sparsely Annotated Semantic Segmentation With Adaptive Gaussian Mixtures
Diverse 3D Hand Gesture Prediction From Body Dynamics by Bilateral Hand Disentanglement
STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection
Spatial-Frequency Mutual Learning for Face Super-Resolution
Inverse Rendering of Translucent Objects Using Physical and Neural Renderers
Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection
MOT: Masked Optimal Transport for Partial Domain Adaptation
Spatio-Temporal Pixel-Level Contrastive Learning-Based Source-Free Domain Adaptation for Video Semantic Segmentation
Rethinking Federated Learning With Domain Shift: A Prototype View
Learning Attribute and Class-Specific Representation Duet for Fine-Grained Fashion Analysis
NaQ: Leveraging Narrations As Queries To Supervise Episodic Memory
Ensemble-Based Blackbox Attacks on Dense Prediction
Implicit Neural Head Synthesis via Controllable Local Deformation Fields
Realistic Saliency Guided Image Enhancement
CIRCLE: Capture in Rich Contextual Environments
Reducing the Label Bias for Timestamp Supervised Temporal Action Segmentation
Modality-Agnostic Debiasing for Single Domain Generalization
Learning Action Changes by Measuring Verb-Adverb Textual Relationships
DeGPR: Deep Guided Posterior Regularization for Multi-Class Cell Detection and Counting
Seeing What You Miss: Vision-Language Pre-Training With Semantic Completion Learning
On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks
Learning To Name Classes for Vision and Language Models
NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers
Comprehensive and Delicate: An Efficient Transformer for Image Restoration
MoStGAN-V: Video Generation With Temporal Motion Styles
Weakly Supervised Class-Agnostic Motion Prediction for Autonomous Driving
Avatars Grow Legs: Generating Smooth Human Motion From Sparse Tracking Inputs With Diffusion Model
Referring Image Matting
Rethinking Image Super Resolution From Long-Tailed Distribution Learning Perspective
DiffusioNeRF: Regularizing Neural Radiance Fields With Denoising Diffusion Models
Self-Supervised Super-Plane for Neural 3D Reconstruction
Implicit Surface Contrastive Clustering for LiDAR Point Clouds
BlendFields: Few-Shot Example-Driven Facial Modeling
Fast Point Cloud Generation With Straight Flows
Leveraging Hidden Positives for Unsupervised Semantic Segmentation
BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects
PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation
Test of Time: Instilling Video-Language Models With a Sense of Time
Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class
vMAP: Vectorised Object Mapping for Neural Field SLAM
POEM: Reconstructing Hand in a Point Embedded Multi-View Stereo
NeRF-Supervised Deep Stereo
High-Fidelity 3D Face Generation From Natural Language Descriptions
Spatial-Then-Temporal Self-Supervised Learning for Video Correspondence
Adaptive Plasticity Improvement for Continual Learning
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
CIGAR: Cross-Modality Graph Reasoning for Domain Adaptive Object Detection
MHPL: Minimum Happy Points Learning for Active Source Free Domain Adaptation
Defining and Quantifying the Emergence of Sparse Concepts in DNNs
LiDAR-in-the-Loop Hyperparameter Optimization
Revisiting Rotation Averaging: Uncertainties and Robust Losses
Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners
A Probabilistic Attention Model With Occlusion-Aware Texture Regression for 3D Hand Reconstruction From a Single RGB Image
Accelerated Coordinate Encoding: Learning to Relocalize in Minutes Using RGB and Poses
Label Information Bottleneck for Label Enhancement
MOSO: Decomposing MOtion, Scene and Object for Video Prediction
Edges to Shapes to Concepts: Adversarial Augmentation for Robust Vision
EVAL: Explainable Video Anomaly Localization
Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space
CAPE: Camera View Position Embedding for Multi-View 3D Object Detection
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision
Noisy Correspondence Learning With Meta Similarity Correction
Robust Multiview Point Cloud Registration With Reliable Pose Graph Initialization and History Reweighting
MeMaHand: Exploiting Mesh-Mano Interaction for Single Image Two-Hand Reconstruction
Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition
BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation
Memory-Friendly Scalable Super-Resolution via Rewinding Lottery Ticket Hypothesis
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
High-Fidelity and Freely Controllable Talking Head Video Generation
On the Stability-Plasticity Dilemma of Class-Incremental Learning
Multilateral Semantic Relations Modeling for Image Text Retrieval
Practical Network Acceleration With Tiny Sets
Learning To Exploit the Sequence-Specific Prior Knowledge for Image Processing Pipelines Optimization
On the Pitfall of Mixup for Uncertainty Calibration
Few-Shot Non-Line-of-Sight Imaging With Signal-Surface Collaborative Regularization
Differentiable Shadow Mapping for Efficient Inverse Graphics
FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction
Texture-Guided Saliency Distilling for Unsupervised Salient Object Detection
PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection
Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation
Sphere-Guided Training of Neural Implicit Surfaces
Color Backdoor: A Robust Poisoning Attack in Color Space
Explicit Visual Prompting for Low-Level Structure Segmentations
VQACL: A Novel Visual Question Answering Continual Learning Setting
Non-Line-of-Sight Imaging With Signal Superresolution Network
Unsupervised Contour Tracking of Live Cells by Mechanical and Cycle Consistency Losses
Context-Based Trit-Plane Coding for Progressive Image Compression
Adaptive Sparse Convolutional Networks With Global Context Enhancement for Faster Object Detection on Drone Images
Deep Frequency Filtering for Domain Generalization
Self-Supervised AutoFlow
ScarceNet: Animal Pose Estimation With Scarce Annotations
MELTR: Meta Loss Transformer for Learning To Fine-Tune Video Foundation Models
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering
Crossing the Gap: Domain Generalization for Image Captioning
Semantic Ray: Learning a Generalizable Semantic Field With Cross-Reprojection Attention
Generalized UAV Object Detection via Frequency Domain Disentanglement
Text With Knowledge Graph Augmented Transformer for Video Captioning
StyleAdv: Meta Style Adversarial Training for Cross-Domain Few-Shot Learning
FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation
Physically Adversarial Infrared Patches With Learnable Shapes and Locations
Multi-Level Logit Distillation
TriDet: Temporal Action Detection With Relative Boundary Modeling
Dimensionality-Varying Diffusion Process
Fast Contextual Scene Graph Generation With Unbiased Context Augmentation
Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection
CapDet: Unifying Dense Captioning and Open-World Detection Pretraining
PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds
Conditional Text Image Generation With Diffusion Models
Compacting Binary Neural Networks by Sparse Kernel Selection
A General Regret Bound of Preconditioned Gradient Method for DNN Training
Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
Neural Video Compression With Diverse Contexts
Controllable Mesh Generation Through Sparse Latent Point Diffusion Models
Balanced Energy Regularization Loss for Out-of-Distribution Detection
Private Image Generation With Dual-Purpose Auxiliary Classifier
Pixels, Regions, and Objects: Multiple Enhancement for Salient Object Detection
AdaptiveMix: Improving GAN Training via Feature Space Shrinkage
CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution
Black-Box Sparse Adversarial Attack via Multi-Objective Optimisation
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition With Pre-Trained Vision-Language Models
Person Image Synthesis via Denoising Diffusion Model
Policy Adaptation From Foundation Model Feedback
Augmentation Matters: A Simple-Yet-Effective Approach to Semi-Supervised Semantic Segmentation
Neural Fields Meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning
Learning Dynamic Style Kernels for Artistic Style Transfer
Robust Unsupervised StyleGAN Image Restoration
Localized Semantic Feature Mixers for Efficient Pedestrian Detection in Autonomous Driving
Conflict-Based Cross-View Consistency for Semi-Supervised Semantic Segmentation
Learning To Detect Mirrors From Videos via Dual Correspondences
Cross-Domain Image Captioning With Discriminative Finetuning
SfM-TTR: Using Structure From Motion for Test-Time Refinement of Single-View Depth Networks
TINC: Tree-Structured Implicit Neural Compression
Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification
TBP-Former: Learning Temporal Bird’s-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-Training Model
A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction From In-the-Wild Images
HS-Pose: Hybrid Scope Feature Extraction for Category-Level Object Pose Estimation
PAniC-3D: Stylized Single-View 3D Reconstruction From Portraits of Anime Characters
SGLoc: Scene Geometry Encoding for Outdoor LiDAR Localization
SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail
Resource-Efficient RGBD Aerial Tracking
Density-Insensitive Unsupervised Domain Adaption on 3D Object Detection
Neural Transformation Fields for Arbitrary-Styled Font Generation
Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos
Ham2Pose: Animating Sign Language Notation Into Pose Sequences
Towards Modality-Agnostic Person Re-Identification With Descriptive Query
Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger
Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer
Hard Sample Matters a Lot in Zero-Shot Quantization
Constructing Deep Spiking Neural Networks From Artificial Neural Networks With Knowledge Distillation
Class Attention Transfer Based Knowledge Distillation
Global and Local Mixture Consistency Cumulative Learning for Long-Tailed Visual Recognitions
Egocentric Video Task Translation
3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions
High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition
Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence
Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection
Dynamic Conceptional Contrastive Learning for Generalized Category Discovery
Local 3D Editing via 3D Distillation of CLIP Knowledge
EvShutter: Transforming Events for Unconstrained Rolling Shutter Correction
Learning Personalized High Quality Volumetric Head Avatars From Monocular RGB Videos
HGNet: Learning Hierarchical Geometry From Points, Edges, and Surfaces
DINER: Depth-Aware Image-Based NEural Radiance Fields
A New Comprehensive Benchmark for Semi-Supervised Video Anomaly Detection and Anticipation
Multi-Modal Representation Learning With Text-Driven Soft Masks
Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving
Linking Garment With Person via Semantically Associated Landmarks for Virtual Try-On
2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection
Discriminating Known From Unknown Objects via Structure-Enhanced Recurrent Variational AutoEncoder
Generative Diffusion Prior for Unified Image Restoration and Enhancement
OmniAL: A Unified CNN Framework for Unsupervised Anomaly Localization
Revisiting the Stack-Based Inverse Tone Mapping
Towards Realistic Long-Tailed Semi-Supervised Learning: Consistency Is All You Need
Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field
AeDet: Azimuth-Invariant Multi-View 3D Object Detection
HyperMatch: Noise-Tolerant Semi-Supervised Learning via Relaxed Contrastive Constraint
Feature Alignment and Uniformity for Test Time Adaptation
Unifying Layout Generation With a Decoupled Diffusion Model
Good Is Bad: Causality Inspired Cloth-Debiasing for Cloth-Changing Person Re-Identification
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
Semi-Supervised 2D Human Pose Estimation Driven by Position Inconsistency Pseudo Label Correction Module
Multiplicative Fourier Level of Detail
Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning
Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks
Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator
SteerNeRF: Accelerating NeRF Rendering via Smooth Viewpoint Trajectory
Collaborative Noisy Label Cleaner: Learning Scene-Aware Trailers for Multi-Modal Highlight Detection in Movies
FreeNeRF: Improving Few-Shot Neural Rendering With Free Frequency Regularization
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Hyperbolic Contrastive Learning for Visual Representations Beyond Objects
MotionTrack: Learning Robust Short-Term and Long-Term Motions for Multi-Object Tracking
CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
B-Spline Texture Coefficients Estimator for Screen Content Image Super-Resolution
Multimodal Industrial Anomaly Detection via Hybrid Fusion
GD-MAE: Generative Decoder for MAE Pre-Training on LiDAR Point Clouds
Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation
RONO: Robust Discriminative Learning With Noisy Labels for 2D-3D Cross-Modal Retrieval
ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos
Parts2Words: Learning Joint Embedding of Point Clouds and Texts by Bidirectional Matching Between Parts and Words
Efficient Second-Order Plane Adjustment
Deep Hashing With Minimal-Distance-Separated Hash Centers
RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension
Adaptive Assignment for Geometry Aware Local Feature Matching
ConZIC: Controllable Zero-Shot Image Captioning by Sampling-Based Polishing
Curricular Object Manipulation in LiDAR-Based Object Detection
Fully Self-Supervised Depth Estimation From Defocus Clue
Post-Training Quantization on Diffusion Models
Rebalancing Batch Normalization for Exemplar-Based Class-Incremental Learning
Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures
HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation
Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization
AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation
EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata
High-Res Facial Appearance Capture From Polarized Smartphone Images
Feature Aggregated Queries for Transformer-Based Video Object Detectors
Ambiguous Medical Image Segmentation Using Diffusion Models
Twin Contrastive Learning With Noisy Labels
Partial Network Cloning
Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation
Understanding Deep Generative Models With Generalized Empirical Likelihoods
Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising
InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions
Adaptive Annealing for Robust Geometric Estimation
Self-Supervised 3D Scene Flow Estimation Guided by Superpoints
Learning Optical Expansion From Scale Matching
Watch or Listen: Robust Audio-Visual Speech Recognition With Visual Corruption Modeling and Reliability Scoring
Grid-Guided Neural Radiance Fields for Large Urban Scenes
SpaText: Spatio-Textual Representation for Controllable Image Generation
Local Implicit Ray Function for Generalizable Radiance Field Representation
Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process
Weakly Supervised Video Representation Learning With Unaligned Text for Sequential Videos
The Best Defense Is a Good Offense: Adversarial Augmentation Against Adversarial Attacks
Handy: Towards a High Fidelity 3D Hand Shape and Appearance Model
Learning a Depth Covariance Function
TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers
DiffRF: Rendering-Guided 3D Radiance Field Diffusion
Clothing-Change Feature Augmentation for Person Re-Identification
Learnable Skeleton-Aware 3D Point Cloud Sampling
TrojDiff: Trojan Attacks on Diffusion Models With Diverse Targets
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment
Modeling Inter-Class and Intra-Class Constraints in Novel Class Discovery
GarmentTracking: Category-Level Garment Pose Tracking
Benchmarking Robustness of 3D Object Detection to Common Corruptions
Generic-to-Specific Distillation of Masked Autoencoders
Dynamic Focus-Aware Positional Queries for Semantic Segmentation
Physics-Guided ISO-Dependent Sensor Noise Modeling for Extreme Low-Light Photography
Compositor: Bottom-Up Clustering and Compositing for Robust Part and Object Segmentation
Visual-Language Prompt Tuning With Knowledge-Guided Context Optimization
Context-Aware Pretraining for Efficient Blind Image Decomposition
VecFontSDF: Learning To Reconstruct and Synthesize High-Quality Vector Fonts via Signed Distance Functions
Towards Better Gradient Consistency for Neural Signed Distance Functions via Level Set Alignment
PACO: Parts and Attributes of Common Objects
Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-Based Active Learning
MetaMix: Towards Corruption-Robust Continual Learning With Temporally Self-Adaptive Data Transformation
UniHCP: A Unified Model for Human-Centric Perceptions
Learning To Zoom and Unzoom
Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels
Implicit Identity Driven Deepfake Face Swapping Detection
Prototypical Residual Networks for Anomaly Detection and Localization
Bridging Search Region Interaction With Template for RGB-T Tracking
COT: Unsupervised Domain Adaptation With Clustering and Optimal Transport
Tangentially Elongated Gaussian Belief Propagation for Event-Based Incremental Optical Flow Estimation
HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation
Photo Pre-Training, but for Sketch
PREIM3D: 3D Consistent Precise Image Attribute Editing From a Single Image
ConQueR: Query Contrast Voxel-DETR for 3D Object Detection
OcTr: Octree-Based Transformer for 3D Object Detection
RILS: Masked Visual Reconstruction in Language Semantic Space
Image Cropping With Spatial-Aware Feature and Rank Consistency
Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial Attacks
DropMAE: Masked Autoencoders With Spatial-Attention Dropout for Tracking Tasks
Delving Into Discrete Normalizing Flows on SO(3) Manifold for Probabilistic Rotation Modeling
3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification
GeoNet: Benchmarking Unsupervised Adaptation Across Geographies
Joint HDR Denoising and Fusion: A Real-World Mobile HDR Image Dataset
PATS: Patch Area Transportation With Subdivision for Local Feature Matching
SINE: Semantic-Driven Image-Based NeRF Editing With Prior-Guided Editing Field
Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video
Learning To Detect and Segment for Open Vocabulary Object Detection
Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection
Dynamic Graph Learning With Content-Guided Spatial-Frequency Relation Reasoning for Deepfake Detection
OSAN: A One-Stage Alignment Network To Unify Multimodal Alignment and Unsupervised Domain Adaptation
Ultrahigh Resolution Image/Video Matting With Spatio-Temporal Sparsity
MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices
WeatherStream: Light Transport Automation of Single Image Deweathering
Normal-Guided Garment UV Prediction for Human Re-Texturing
Depth Estimation From Camera Image and mmWave Radar Point Cloud
RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo
pCON: Polarimetric Coordinate Networks for Neural Scene Representations
Deep Factorized Metric Learning
Improving Image Recognition by Retrieving From Web-Scale Image-Text Data
Video-Text As Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Semantic Prompt for Few-Shot Image Recognition
SVFormer: Semi-Supervised Video Transformer for Action Recognition
Exploring the Relationship Between Architectural Design and Adversarially Robust Generalization
Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching
Federated Incremental Semantic Segmentation
Revisiting Prototypical Network for Cross Domain Few-Shot Learning
Task Difficulty Aware Parameter Allocation & Regularization for Lifelong Learning
Two-View Geometry Scoring Without Correspondences
AltFreezing for More General Video Face Forgery Detection
CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Decompose More and Aggregate Better: Two Closer Looks at Frequency Representation Learning for Human Motion Prediction
Motion Information Propagation for Neural Video Compression
Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition From Egocentric RGB Videos
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
GFIE: A Dataset and Baseline for Gaze-Following From 2D to 3D in Indoor Environments
Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning
NeuralPCI: Spatio-Temporal Neural Field for 3D Point Cloud Multi-Frame Non-Linear Interpolation
LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction
Model-Agnostic Gender Debiased Image Captioning
Bootstrapping Objectness From Videos by Relaxed Common Fate and Visual Grouping
Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval
Delving Into Shape-Aware Zero-Shot Semantic Segmentation
Cascade Evidential Learning for Open-World Weakly-Supervised Temporal Action Localization
NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds
ProtoCon: Pseudo-Label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-Supervised Learning
Mixed Autoencoder for Self-Supervised Visual Representation Learning
DPF: Learning Dense Prediction Fields With Weak Supervision
MISC210K: A Large-Scale Dataset for Multi-Instance Semantic Correspondence
Similarity Metric Learning for RGB-Infrared Group Re-Identification
Exploring Discontinuity for Video Frame Interpolation
GradICON: Approximate Diffeomorphisms via Gradient Inverse Consistency
DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos
Polynomial Implicit Neural Representations for Large Diverse Datasets
Towards Better Decision Forests: Forest Alternating Optimization
CrOC: Cross-View Online Clustering for Dense Visual Representation Learning
Shape, Pose, and Appearance From a Single Image via Bootstrapped Radiance Field Inversion
Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation With Implicit Neural Representations
Target-Referenced Reactive Grasping for Dynamic Objects
ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration
Structured Sparsity Learning for Efficient Video Super-Resolution
Non-Contrastive Unsupervised Learning of Physiological Signals From Video
Weakly-Supervised Single-View Image Relighting
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
End-to-End Video Matting With Trimap Propagation
Human Body Shape Completion With Implicit Shape and Flow Learning
TinyMIM: An Empirical Study of Distilling MIM Pre-Trained Models
Plateau-Reduced Differentiable Path Tracing
Computationally Budgeted Continual Learning: What Does Matter?
Event-Based Shape From Polarization
Adversarially Robust Neural Architecture Search for Graph Neural Networks
An In-Depth Exploration of Person Re-Identification and Gait Recognition in Cloth-Changing Conditions
From Node Interaction To Hop Interaction: New Effective and Scalable Graph Learning Paradigm
Transfer4D: A Framework for Frugal Motion Capture and Deformation Transfer
HumanBench: Towards General Human-Centric Perception With Projector Assisted Pretraining
SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations
Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization
Instant Volumetric Head Avatars
Boundary-Enhanced Co-Training for Weakly Supervised Semantic Segmentation
CRAFT: Concept Recursive Activation FacTorization for Explainability
Don’t Lie to Me! Robust and Efficient Explainability With Verified Perturbation Analysis
HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics
Learning To Predict Scene-Level Implicit 3D From Posed RGBD Data
Prompting Large Language Models With Answer Heuristics for Knowledge-Based Visual Question Answering
Harmonious Feature Learning for Interactive Hand-Object Pose Estimation
Decompose, Adjust, Compose: Effective Normalization by Playing With Frequency for Domain Generalization
Habitat-Matterport 3D Semantics Dataset
Reinforcement Learning-Based Black-Box Model Inversion Attacks
PIRLNav: Pretraining With Imitation and RL Finetuning for ObjectNav
DC2: Dual-Camera Defocus Control by Learning To Refocus
The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects
Rethinking Domain Generalization for Face Anti-Spoofing: Separability and Alignment
Grad-PU: Arbitrary-Scale Point Cloud Upsampling via Gradient Descent With Learned Distance Functions
Texts as Images in Prompt Tuning for Multi-Label Image Recognition
Unsupervised Inference of Signed Distance Functions From Single Sparse Point Clouds Without Learning Priors
Neural Vector Fields: Implicit Representation by Explicit Learning
Zero-Shot Pose Transfer for Unrigged Stylized 3D Characters
Complete 3D Human Reconstruction From a Single Incomplete Image
PartDistillation: Learning Parts From Instance Segmentation
EDICT: Exact Diffusion Inversion via Coupled Transformations
PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models
Regularized Vector Quantization for Tokenized Image Synthesis
EDGE: Editable Dance Generation From Music
Low-Light Image Enhancement via Structure Modeling and Guidance
Where Is My Spot? Few-Shot Image Generation via Latent Subspace Optimization
Bilateral Memory Consolidation for Continual Learning
Structure Aggregation for Cross-Spectral Stereo Image Guided Denoising
What Happened 3 Seconds Ago? Inferring the Past With Thermal Imaging
Contrastive Grouping With Transformer for Referring Image Segmentation
Learning To Segment Every Referring Object Point by Point
Event-Guided Person Re-Identification via Sparse-Dense Complementary Learning
RIDCP: Revitalizing Real Image Dehazing via High-Quality Codebook Priors
Dynamic Graph Enhanced Contrastive Learning for Chest X-Ray Report Generation
GlassesGAN: Eyewear Personalization Using Synthetic Appearance Discovery and Targeted Subspace Modeling
Gaussian Label Distribution Learning for Spherical Image Object Detection
Long Range Pooling for 3D Large-Scale Scene Understanding
Learning Spatial-Temporal Implicit Neural Representations for Event-Guided Video Super-Resolution
PaletteNeRF: Palette-Based Appearance Editing of Neural Radiance Fields
Ego-Body Pose Estimation via Ego-Head Pose Estimation
Towards Stable Human Pose Estimation via Cross-View Fusion and Foot Stabilization
Command-Driven Articulated Object Understanding and Manipulation
ReasonNet: End-to-End Driving With Temporal and Global Reasoning
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce
Adaptive Sparse Pairwise Loss for Object Re-Identification
FedSeg: Class-Heterogeneous Federated Learning for Semantic Segmentation
UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration
Iterative Next Boundary Detection for Instance Segmentation of Tree Rings in Microscopy Images of Shrub Cross Sections
HyperReel: High-Fidelity 6-DoF Video With Ray-Conditioned Sampling
Compressing Volumetric Radiance Fields to 1 MB
EC2: Emergent Communication for Embodied Control
Joint Visual Grounding and Tracking With Natural Language Specification
TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning With Structure-Trajectory Prompted Reconstruction for Person Re-Identification
DAA: A Delta Age AdaIN Operation for Age Estimation via Binary Code Transformer
Evading DeepFake Detectors via Adversarial Statistical Consistency
DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation
Disentangled Representation Learning for Unsupervised Neural Quantization
Zero-Shot Model Diagnosis
Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild
Learning Bottleneck Concepts in Image Classification
Multi-Centroid Task Descriptor for Dynamic Class Incremental Inference
Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection
CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching
CXTrack: Improving 3D Point Cloud Tracking With Contextual Information
Efficient Multimodal Fusion via Interactive Prompting
MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors
AsyFOD: An Asymmetric Adaptation Paradigm for Few-Shot Domain Adaptive Object Detection
NerVE: Neural Volumetric Edges for Parametric Curve Extraction From Point Cloud
Referring Multi-Object Tracking
Paint by Example: Exemplar-Based Image Editing With Diffusion Models
Interactive Cartoonization With Controllable Perceptual Factors
Rethinking the Approximation Error in 3D Surface Fitting for Point Cloud Normal Estimation
Weakly Supervised Monocular 3D Object Detection Using Multi-View Projection and Direction Consistency
Representing Volumetric Videos As Dynamic MLP Maps
3D-Aware Face Swapping
Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language
NeuWigs: A Neural Dynamic Model for Volumetric Hair Capture and Animation
Proximal Splitting Adversarial Attack for Semantic Segmentation
Data-Free Sketch-Based Image Retrieval
CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion
Spherical Transformer for LiDAR-Based 3D Recognition
Adaptive Global Decay Process for Event Cameras
Adapting Shortcut With Normalizing Flow: An Efficient Tuning Framework for Visual Recognition
IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction
SOOD: Towards Semi-Supervised Oriented Object Detection
Towards Artistic Image Aesthetics Assessment: A Large-Scale Dataset and a New Method
Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking
Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning
Spatially Adaptive Self-Supervised Learning for Real-World Image Denoising
IterativePFN: True Iterative Point Cloud Filtering
3D-Aware Facial Landmark Detection via Multi-View Consistent Training on Synthetic Data
Semi-Supervised Domain Adaptation With Source Label Adaptation
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
Visual-Tactile Sensing for In-Hand Object Reconstruction
MaLP: Manipulation Localization Using a Proactive Scheme
Logical Consistency and Greater Descriptive Power for Facial Hair Attribute Learning
N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection
ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal
Learning Human Mesh Recovery in 3D Scenes
Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation
Curricular Contrastive Regularization for Physics-Aware Single Image Dehazing
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
Raw Image Reconstruction With Learned Compact Metadata
End-to-End 3D Dense Captioning With Vote2Cap-DETR
Generating Human Motion From Textual Descriptions With Discrete Representations
RefSR-NeRF: Towards High Fidelity and Super Resolution View Synthesis
Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration
Human Guided Ground-Truth Generation for Realistic Image Super-Resolution
DiffPose: Toward More Reliable 3D Pose Estimation
SQUID: Deep Feature In-Painting for Unsupervised Anomaly Detection
DegAE: A New Pretraining Paradigm for Low-Level Vision
RankMix: Data Augmentation for Weakly Supervised Learning of Classifying Whole Slide Images With Diverse Sizes and Imbalanced Categories
Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting
Revisiting Self-Similarity: Structural Embedding for Image Retrieval
Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild
Towards Bridging the Performance Gaps of Joint Energy-Based Models
FFF: Fragment-Guided Flexible Fitting for Building Complete Protein Structures
Layout-Based Causal Inference for Object Navigation
POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery
Coaching a Teachable Student
Shape-Constraint Recurrent Flow for 6D Object Pose Estimation
Catch Missing Details: Image Reconstruction With Frequency Augmented Variational Autoencoder
Rigidity-Aware Detection for 6D Object Pose Estimation
ShadowNeuS: Neural SDF Reconstruction by Shadow Ray Supervision
ALOFT: A Lightweight MLP-Like Architecture With Dynamic Low-Frequency Transform for Domain Generalization
The Enemy of My Enemy Is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training
Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style
Primitive Generation and Semantic-Related Alignment for Universal Zero-Shot Segmentation
Semi-Supervised Hand Appearance Recovery via Structure Disentanglement and Dual Adversarial Discrimination
TryOnDiffusion: A Tale of Two UNets
Breaking the “Object” in Video Object Segmentation
SunStage: Portrait Reconstruction and Relighting Using the Sun as a Light Stage
Object Discovery From Motion-Guided Tokens
Batch Model Consolidation: A Multi-Task Model Consolidation Framework
Dense Network Expansion for Class Incremental Learning
IMP: Iterative Matching and Pose Estimation With Adaptive Pooling
LightPainter: Interactive Portrait Relighting With Freehand Scribble
Unified Pose Sequence Modeling
VindLU: A Recipe for Effective Video-and-Language Pretraining
MAGE: MAsked Generative Encoder To Unify Representation Learning and Image Synthesis
I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation
Continual Semantic Segmentation With Automatic Memory Sample Selection
Regularizing Second-Order Influences for Continual Learning
Boost Vision Transformer With GPU-Friendly Sparsity and Quantization
Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning
NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-Shot Real Image Animation
HairStep: Transfer Synthetic to Real Using Strand and Depth Maps for Single-View 3D Hair Modeling
VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud
F2-NeRF: Fast Neural Radiance Field Training With Free Camera Trajectories
Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training
Geometry and Uncertainty-Aware 3D Point Cloud Class-Incremental Semantic Segmentation
UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird’s-Eye View
ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing
Image Super-Resolution Using T-Tetromino Pixels
Highly Confident Local Structure Based Consensus Graph Learning for Incomplete Multi-View Clustering
Non-Contrastive Learning Meets Language-Image Pre-Training
Dynamic Inference With Grounding Based Vision and Language Models
A Simple Baseline for Video Restoration With Grouped Spatial-Temporal Shift
Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences Between Pretrained Generative Models
Analyzing Physical Impacts Using Transient Surface Wave Imaging
Deep Learning of Partial Graph Matching via Differentiable Top-K
LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs
Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
Binarizing Sparse Convolutional Networks for Efficient Point Cloud Analysis
Hunting Sparsity: Density-Guided Contrastive Learning for Semi-Supervised Semantic Segmentation
Autoregressive Visual Tracking
LinK: Linear Kernel for LiDAR-Based 3D Perception
Weakly Supervised Segmentation With Point Annotations for Histopathology Images via Contrast-Based Variational Model
KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation
DARE-GRAM: Unsupervised Domain Adaptation Regression by Aligning Inverse Gram Matrices
FCC: Feature Clusters Compression for Long-Tailed Visual Recognition
DartBlur: Privacy Preservation With Detection Artifact Suppression
Efficient Frequency Domain-Based Transformers for High-Quality Image Deblurring
Focus on Details: Online Multi-Object Tracking With Diverse Fine-Grained Representation
Transfer Knowledge From Head to Tail: Uncertainty Calibration Under Long-Tailed Distribution
Deep Stereo Video Inpainting
Open-Set Semantic Segmentation for Point Clouds via Adversarial Prototype Framework
CLAMP: Prompt-Based Contrastive Learning for Connecting Language and Animal Pose
Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior
RGBD2: Generative Scene Synthesis via Incremental View Inpainting Using RGBD Diffusion Models
Prototype-Based Embedding Network for Scene Graph Generation
Backdoor Defense via Adaptively Splitting Poisoned Dataset
GaitGCI: Generative Counterfactual Intervention for Gait Recognition
Bi-Directional Feature Fusion Generative Adversarial Network for Ultra-High Resolution Pathological Image Virtual Re-Staining
Vector Quantization With Self-Attention for Quality-Independent Representation Learning
Fine-Grained Face Swapping via Regional GAN Inversion
VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking
Frequency-Modulated Point Cloud Rendering With Easy Editing
TAPS3D: Text-Guided 3D Textured Shape Generation From Pseudo Supervision
Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains
Enhancing the Self-Universality for Transferable Targeted Attacks
Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes
A Unified Pyramid Recurrent Network for Video Frame Interpolation
SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation
Images Speak in Images: A Generalist Painter for In-Context Visual Learning
FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation
Rethinking Optical Flow From Geometric Matching Consistent Perspective
MMANet: Margin-Aware Distillation and Modality-Aware Regularization for Incomplete Multimodal Learning
Self-Supervised Implicit Glyph Attention for Text Recognition
Semi-Supervised Video Inpainting With Cycle Consistency Constraints
Patch-Based 3D Natural Scene Generation From a Single Example
Listening Human Behavior: 3D Human Pose Estimation With Acoustic Signals
Mobile User Interface Element Detection via Adaptively Prompt Tuning
High-Fidelity Facial Avatar Reconstruction From Monocular Video With Generative Priors
Anchor3DLane: Learning To Regress 3D Anchors for Monocular 3D Lane Detection
BioNet: A Biologically-Inspired Network for Face Recognition
PyramidFlow: High-Resolution Defect Contrastive Localization Using Pyramid Normalizing Flow
Neural Kaleidoscopic Space Sculpting
Accelerating Dataset Distillation via Model Augmentation
ProD: Prompting-To-Disentangle Domain Knowledge for Cross-Domain Few-Shot Image Classification
Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-Identification
Neural Dependencies Emerging From Learning Massive Categories
DeFeeNet: Consecutive 3D Human Motion Prediction With Deviation Feedback
LP-DIF: Learning Local Pattern-Specific Deep Implicit Function for 3D Objects and Scenes
A Dynamic Multi-Scale Voxel Flow Network for Video Prediction
Dionysus: Recovering Scene Structures by Dividing Into Semantic Pieces
Being Comes From Not-Being: Open-Vocabulary Text-to-Motion Generation With Wordless Training
Unsupervised Cumulative Domain Adaptation for Foggy Scene Optical Flow
DynaMask: Dynamic Mask Selection for Instance Segmentation
HandNeRF: Neural Radiance Fields for Animatable Interacting Hands
Simultaneously Short- and Long-Term Temporal Modeling for Semi-Supervised Video Semantic Segmentation
Devil Is in the Queries: Advancing Mask Transformers for Real-World Medical Image Segmentation and Out-of-Distribution Localization
Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
Learning With Fantasy: Semantic-Aware Virtual Contrastive Constraint for Few-Shot Class-Incremental Learning
Towards Accurate Image Coding: Improved Autoregressive Image Generation With Dynamic Vector Quantization
MSMDFusion: Fusing LiDAR and Camera at Multiple Scales With Multi-Depth Seeds for 3D Object Detection
Learning Anchor Transformations for 3D Garment Animation
LaserMix for Semi-Supervised LiDAR Semantic Segmentation
Enhanced Training of Query-Based Object Detection via Selective Query Recollection
SCoDA: Domain Adaptive Shape Completion for Real Scans
VideoTrack: Learning To Track Objects via Video Transformer
Probing Sentiment-Oriented Pre-Training Inspired by Human Sentiment Perception Mechanism
Towards Professional Level Crowd Annotation of Expert Domain Data
MoDAR: Using Motion Forecasting for 3D Object Detection in Point Cloud Sequences
iQuery: Instruments As Queries for Audio-Visual Sound Separation
RGB No More: Minimally-Decoded JPEG Vision Transformers
Label-Free Liver Tumor Segmentation
Zero-Shot Object Counting
Exploring Incompatible Knowledge Transfer in Few-Shot Image Generation
DepGraph: Towards Any Structural Pruning
GRES: Generalized Referring Expression Segmentation
Tracking Through Containers and Occluders in the Wild
Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection
Neural Preset for Color Style Transfer
Vid2Avatar: 3D Avatar Reconstruction From Videos in the Wild via Self-Supervised Scene Decomposition
CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset
NeRF-RPN: A General Framework for Object Detection in NeRFs
Diffusion-SDF: Text-To-Shape via Voxelized Diffusion
PointAvatar: Deformable Point-Based Head Avatars From Videos
Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries
DLBD: A Self-Supervised Direct-Learned Binary Descriptor
MSF: Motion-Guided Sequential Fusion for Efficient 3D Object Detection From Point Cloud Sequences
Multi-Space Neural Radiance Fields
HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes With Iterative Intertwined Regularization
Learning With Noisy Labels via Self-Supervised Adversarial Noisy Masking
Toward Accurate Post-Training Quantization for Image Super Resolution
Generating Holistic 3D Human Motion From Speech
TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation
End-to-End Vectorized HD-Map Construction With Piecewise Bezier Curve
Accelerating Vision-Language Pretraining With Free Language Modeling
OSRT: Omnidirectional Image Super-Resolution With Distortion-Aware Transformer
Interactive Segmentation As Gaussion Process Classification
Probability-Based Global Cross-Modal Upsampling for Pansharpening
Boosting Verified Training for Robust Image Classifications via Abstraction
PVT-SSD: Single-Stage 3D Object Detector With Point-Voxel Transformer
StyleRes: Transforming the Residuals for Real Image Editing With StyleGAN
Breaching FedMD: Image Recovery via Paired-Logits Inversion Attack
FrustumFormer: Adaptive Instance-Aware Resampling for Multi-View 3D Detection
Edge-Aware Regional Message Passing Controller for Image Forgery Localization
PD-Quant: Post-Training Quantization Based on Prediction Difference Metric
Learning Geometric-Aware Properties in 2D Representation Using Lightweight CAD Models, or Zero Real 3D Pairs
Learning To Dub Movies via Hierarchical Prosody Models
Binary Latent Diffusion
NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds
Neural Kernel Surface Reconstruction
You Can Ground Earlier Than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos
Hybrid Neural Rendering for Large-Scale Scenes With Motion Blur
Self-Supervised Non-Uniform Kernel Estimation With Flow-Based Motion Prior for Blind Image Deblurring
Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark
The Devil Is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-Guided Mask Representation
Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation
OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation
Skinned Motion Retargeting With Residual Perception of Motion Semantics & Geometry
Multispectral Video Semantic Segmentation: A Benchmark Dataset and Baseline
3D Registration With Maximal Cliques
Inferring and Leveraging Parts From Object Shape for Improving Semantic Image Synthesis
LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion
Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning
A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation From a Single RGB Image
Distilling Vision-Language Pre-Training To Collaborate With Weakly-Supervised Temporal Action Localization
Discriminator-Cooperated Feature Map Distillation for GAN Compression
Balancing Logit Variation for Long-Tailed Semantic Segmentation
Efficient Mask Correction for Click-Based Interactive Image Segmentation
Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval
Simple Cues Lead to a Strong Multi-Object Tracker
MixTeacher: Mining Promising Labels With Mixed Scale Teacher for Semi-Supervised Object Detection
Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond
Unifying Short and Long-Term Tracking With Graph Hierarchies
Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo
Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast for Autonomous Driving
NeRF-DS: Neural Radiance Fields for Dynamic Specular Objects
CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network With Large Input
System-Status-Aware Adaptive Network for Online Streaming Video Understanding
On the Difficulty of Unpaired Infrared-to-Visible Video Translation: Fine-Grained Content-Rich Patches Transfer
Gloss Attention for Gloss-Free Sign Language Translation
FAC: 3D Representation Learning via Foreground Aware Feature Contrast
InstMove: Instance Motion for Object-Centric Video Segmentation
Less Is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation
ResFormer: Scaling ViTs With Multi-Resolution Training
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model
Sibling-Attack: Rethinking Transferable Adversarial Attacks Against Face Recognition
Identity-Preserving Talking Face Generation With Landmark and Appearance Priors
Divide and Adapt: Active Domain Adaptation via Customized Learning
Putting People in Their Place: Affordance-Aware Human Insertion Into Scenes
Advancing Visual Grounding With Scene Knowledge: Benchmark and Method
Parametric Implicit Face Representation for Audio-Driven Facial Reenactment
Improved Distribution Matching for Dataset Condensation
Semi-DETR: Semi-Supervised Object Detection With Detection Transformers
SHS-Net: Learning Signed Hyper Surfaces for Oriented Normal Estimation of Point Clouds
DBARF: Deep Bundle-Adjusting Generalizable Neural Radiance Fields
Stitchable Neural Networks
Equivalent Transformation and Dual Stream Network Construction for Mobile Image Super-Resolution
Dynamic Aggregated Network for Gait Recognition
MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Action Recognition
Omni Aggregation Networks for Lightweight Image Super-Resolution
Masked Image Modeling With Local Multi-Scale Reconstruction
RaBit: Parametric Modeling of 3D Biped Cartoon Characters With a Topological-Consistent Dataset
Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation With Cross-Scale Distortion Awareness
Global-to-Local Modeling for Video-Based 3D Human Pose and Shape Estimation
Top-Down Visual Attention From Analysis by Synthesis
Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
SimpleNet: A Simple Network for Image Anomaly Detection and Localization
DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis
IS-GGT: Iterative Scene Graph Generation With Generative Transformers
MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID
Executing Your Commands via Motion Diffusion in Latent Space
NEF: Neural Edge Fields for 3D Parametric Curve Reconstruction From Multi-View Images
FLAG3D: A 3D Fitness Activity Dataset With Language Instruction
Towards Universal Fake Image Detectors That Generalize Across Generative Models
NeuralUDF: Learning Unsigned Distance Fields for Multi-View Reconstruction of Surfaces With Arbitrary Topologies
Context De-Confounded Emotion Recognition
PA&DA: Jointly Sampling Path and Data for Consistent NAS
You Only Segment Once: Towards Real-Time Panoptic Segmentation
Activating More Pixels in Image Super-Resolution Transformer
DisWOT: Student Architecture Search for Distillation WithOut Training
Effective Ambiguity Attack Against Passport-Based DNN Intellectual Property Protection Schemes Through Fully Connected Layer Substitution
Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos
Object Pose Estimation With Statistical Guarantees: Conformal Keypoint Detection and Geometric Uncertainty Propagation
Dynamically Instance-Guided Adaptation: A Backward-Free Approach for Test-Time Domain Adaptive Semantic Segmentation
Pose Synchronization Under Multiple Pair-Wise Relative Poses
Open Set Action Recognition via Multi-Label Evidential Learning
Micron-BERT: BERT-Based Facial Micro-Expression Recognition
Genie: Show Me the Data for Quantization
Deep Graph Reprogramming
Generalizable Local Feature Pre-Training for Deformable Shape Analysis
Collaborative Diffusion for Multi-Modal Face Generation and Editing
Diffusion Probabilistic Model Made Slim
Unsupervised Domain Adaption With Pixel-Level Discriminator for Image-Aware Layout Generation
Token Contrast for Weakly-Supervised Semantic Segmentation
VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval
BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields
Class-Conditional Sharpness-Aware Minimization for Deep Long-Tailed Recognition
Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution
Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
DoNet: Deep De-Overlapping Network for Cytology Instance Segmentation
PointListNet: Deep Learning on 3D Point Lists
NeuralLift-360: Lifting an In-the-Wild 2D Photo to a 3D Object With 360° Views
Representation Learning for Visual Object Tracking by Masked Appearance Transfer
Boosting Detection in Crowd Analysis via Underutilized Output Features
Endpoints Weight Fusion for Class Incremental Semantic Segmentation
Symmetric Shape-Preserving Autoencoder for Unsupervised Real Scene Point Cloud Completion
LiDAR2Map: In Defense of LiDAR-Based Semantic Map Construction Using Online Camera Distillation
StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator
Masked Motion Encoding for Self-Supervised Video Representation Learning
In-Hand 3D Object Scanning From an RGB Sequence
SceneComposer: Any-Level Semantic Image Synthesis
QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity
Magic3D: High-Resolution Text-to-3D Content Creation
Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction
3D-Aware Conditional Image Synthesis
GeneCIS: A Benchmark for General Conditional Image Similarity
Neighborhood Attention Transformer
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding
Camouflaged Object Detection With Feature Decomposition and Edge Reconstruction
Iterative Proposal Refinement for Weakly-Supervised Video Grounding
SIM: Semantic-Aware Instance Mask Generation for Box-Supervised Instance Segmentation
Deep Discriminative Spatial and Temporal Network for Efficient Video Deblurring
Continuous Sign Language Recognition With Correlation Network
Learning a Sparse Transformer Network for Effective Image Deraining
Iterative Geometry Encoding Volume for Stereo Matching
Look Before You Match: Instance Understanding Matters in Video Object Segmentation
Decoupled Semantic Prototypes Enable Learning From Diverse Annotation Types for Semi-Weakly Segmentation in Expert-Driven Domains
Neuron Structure Modeling for Generalizable Remote Physiological Measurement
Long-Tailed Visual Recognition via Self-Heterogeneous Integration With Knowledge Excavation
MDQE: Mining Discriminative Query Embeddings To Segment Occluded Instances on Challenging Videos
One-to-Few Label Assignment for End-to-End Dense Detection
Intrinsic Physical Concepts Discovery With Object-Centric Predictive Models
Uncertainty-Aware Unsupervised Image Deblurring With Deep Residual Prior
EditableNeRF: Editing Topologically Varying Neural Radiance Fields by Key Points
Progressive Open Space Expansion for Open-Set Model Attribution
Seeing a Rose in Five Thousand Ways
Unsupervised Sampling Promoting for Stochastic Human Trajectory Prediction
Object-Goal Visual Navigation via Effective Exploration of Relations Among Historical Navigation States
Large-Scale Training Data Search for Object Re-Identification
AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation
Use Your Head: Improving Long-Tail Video Recognition
The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction
Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Deep Arbitrary-Scale Image Super-Resolution via Scale-Equivariance Pursuit
Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection
HexPlane: A Fast Representation for Dynamic Scenes
AdamsFormer for Spatial Action Localization in the Future
HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization
Content-Aware Token Sharing for Efficient Semantic Segmentation With Vision Transformers
Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation
Multiview Compressive Coding for 3D Reconstruction
Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners
Fake It Till You Make It: Learning Transferable Representations From Synthetic ImageNet Clones
How Can Objects Help Action Recognition?
Understanding and Improving Features Learned in Deep Functional Maps
Soft Augmentation for Image Classification
GraVoS: Voxel Selection for 3D Point-Cloud Detection
RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation
Learned Two-Plane Perspective Prior Based Image Resampling for Efficient Object Detection
Adversarial Counterfactual Visual Explanations
Tracking Multiple Deformable Objects in Egocentric Videos
CUF: Continuous Upsampling Filters
Curvature-Balanced Feature Manifold Learning for Long-Tailed Classification
DIP: Dual Incongruity Perceiving Network for Sarcasm Detection
LipFormer: High-Fidelity and Generalizable Talking Face Generation With a Pre-Learned Facial Codebook
Meta-Causal Learning for Single Domain Generalization
Mind the Label Shift of Augmentation-Based Graph OOD Generalization
BAEFormer: Bi-Directional and Early Interaction Transformers for Bird’s Eye View Semantic Segmentation
Learning a Simple Low-Light Image Enhancer From Paired Low-Light Instances
MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving
Weakly Supervised Posture Mining for Fine-Grained Classification
A Light Weight Model for Active Speaker Detection
Graphics Capsule: Learning Hierarchical 3D Face Representations From 2D Images
Network Expansion for Practical Training Acceleration
Upcycling Models Under Domain and Category Shift
CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection
TIPI: Test Time Adaptation With Transformation Invariance
BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks
Joint Video Multi-Frame Interpolation and Deblurring Under Unknown Exposure Time
CLOTH4D: A Dataset for Clothed Human Reconstruction
Deep Incomplete Multi-View Clustering With Cross-View Partial Sample and Prototype Alignment
Rethinking Out-of-Distribution (OOD) Detection: Masked Image Modeling Is All You Need
DETR With Additional Global Aggregation for Cross-Domain Weakly Supervised Object Detection
Learning Imbalanced Data With Vision Transformers
StructVPR: Distill Structural Knowledge With Weighting Samples for Visual Place Recognition
Asymmetric Feature Fusion for Image Retrieval
DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion
1% VS 100%: Parameter-Efficient Low Rank Adapter for Dense Predictions
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models
OT-Filter: An Optimal Transport Filter for Learning With Noisy Labels
Learning Adaptive Dense Event Stereo From the Image Domain
Progressive Neighbor Consistency Mining for Correspondence Pruning
Adversarial Robustness via Random Projection Filters
We use cookies to store which papers have been visited.
I agree
Successful Page Load
We use cookies to store which papers have been visited.
I agree