CVPR 2023 Papers

Skip to yearly menu bar Skip to main content

Layout:

mini compact topic detail

Defending Against Patch-Based Backdoor Attacks on Self-Supervised Learning

Gradient-Based Uncertainty Attribution for Explainable Bayesian Deep Learning

Learning Neural Parametric Head Models

Delivering Arbitrary-Modal Semantic Segmentation

High-Fidelity 3D Human Digitization From Single 2K Resolution Images

Panoptic Video Scene Graph Generation

FFCV: Accelerating Training by Removing Data Bottlenecks

A Data-Based Perspective on Transfer Learning

GLIGEN: Open-Set Grounded Text-to-Image Generation

Patch-Craft Self-Supervised Training for Correlated Image Denoising

Co-Salient Object Detection With Uncertainty-Aware Group Exchange-Masking

GamutMLP: A Lightweight MLP for Color Loss Recovery

ReDirTrans: Latent-to-Latent Translation for Gaze and Head Redirection

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

Why Is the Winner the Best?

RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

ZegCLIP: Towards Adapting CLIP for Zero-Shot Semantic Segmentation

MaskSketch: Unpaired Structure-Guided Masked Image Generation

Video Probabilistic Diffusion Models in Projected Latent Space

Prefix Conditioning Unifies Language and Label Supervision

Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval

Visual Prompt Tuning for Generative Transfer Learning

GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection

Masked Wavelet Representation for Compact Neural Radiance Fields

MIME: Human-Aware 3D Scene Generation

BITE: Beyond Priors for Improved Three-D Dog Pose Estimation

3D Human Pose Estimation via Intuitive Physics

Heat Diffusion Based Multi-Scale and Geometric Structure-Aware Transformer for Mesh Segmentation

DKM: Dense Kernelized Feature Matching for Geometry Estimation

Balanced Product of Calibrated Experts for Long-Tailed Recognition

SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments

CIMI4D: A Large Multimodal Climbing Motion Dataset Under Human-Scene Interactions

FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits

Connecting Vision and Language With Video Localized Narratives

All-in-Focus Imaging From Event Focal Stack

Semi-Supervised Stereo-Based 3D Object Detection via Cross-View Consensus

Improved Test-Time Adaptation for Domain Generalization

Learning Common Rationale To Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems

METransformer: Radiology Report Generation by Transformer With Multiple Learnable Expert Tokens

Exact-NeRF: An Exploration of a Precise Volumetric Parameterization for Neural Radiance Fields

X3KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection

Learning Partial Correlation Based Deep Visual Representation for Image Classification

Implicit 3D Human Mesh Recovery Using Consistency With Pose and Shape From Unseen-View

CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval

Guided Recommendation for Model Fine-Tuning

Implicit View-Time Interpolation of Stereo Videos Using Multi-Plane Disparities and Non-Uniform Coordinates

FeatureBooster: Boosting Feature Descriptors With a Lightweight Neural Network

PlenVDB: Memory Efficient VDB-Based Radiance Fields for Fast Training and Rendering

MED-VT: Multiscale Encoder-Decoder Video Transformer With Application To Object Segmentation

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

Handwritten Text Generation From Visual Archetypes

Hierarchical B-Frame Video Coding Using Two-Layer CANF Without Motion Coding

Multi-Modal Learning With Missing Modality via Shared-Specific Feature Modelling

Spatial-Temporal Concept Based Explanation of 3D ConvNets

Robust Model-Based Face Reconstruction Through Weakly-Supervised Outlier Segmentation

PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation

Learned Image Compression With Mixed Transformer-CNN Architectures

Weakly-Supervised Domain Adaptive Semantic Segmentation With Prototypical Contrastive Learning

Exploring and Exploiting Uncertainty for Incomplete Multi-View Classification

Seeing Through the Glass: Neural 3D Reconstruction of Object Inside a Transparent Container

Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

Correspondence Transformers With Asymmetric Feature Learning and Matching Flow Super-Resolution

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields

Continuous Intermediate Token Learning With Implicit Motion Manifold for Keyframe Based Motion Interpolation

Prompt-Guided Zero-Shot Anomaly Action Recognition Using Pretrained Deep Skeleton Features

A New Path: Scaling Vision-and-Language Navigation With Synthetic Instructions and Imitation Learning

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

BiFormer: Vision Transformer With Bi-Level Routing Attention

Dense Distinct Query for End-to-End Object Detection

Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection

Learning Locally Editable Virtual Humans

PCT-Net: Full Resolution Image Harmonization Using Pixel-Wise Color Transformations

All-in-One Image Restoration for Unknown Degradations Using Adaptive Discriminative Filters for Specific Degradations

ALSO: Automotive Lidar Self-Supervision by Occupancy Estimation

Unsupervised Object Localization: Observing the Background To Discover Objects

SCPNet: Semantic Scene Completion on Point Cloud

UMat: Uncertainty-Aware Single Image High Resolution Material Capture

Unified Keypoint-Based Action Recognition Framework via Structured Keypoint Pooling

Teleidoscopic Imaging System for Microscale 3D Shape Reconstruction

Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection

RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion

Learning 3D-Aware Image Synthesis With Unknown Pose Distribution

DynaFed: Tackling Client Data Heterogeneity With Global Dynamics

Enlarging Instance-Specific and Class-Specific Information for Open-Set Action Recognition

High-Fidelity 3D GAN Inversion by Pseudo-Multi-View Optimization

Blind Video Deflickering by Neural Filtering With a Flawed Atlas

Delving StyleGAN Inversion for Image Editing: A Foundation Latent Space Viewpoint

Multi-View Azimuth Stereo via Tangent Space Consistency

PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout

Evolved Part Masking for Self-Supervised Learning

PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360°

Scaling Up GANs for Text-to-Image Synthesis

Instant Domain Augmentation for LiDAR Semantic Segmentation

FaceLit: Neural 3D Relightable Faces

Deep Graph-Based Spatial Consistency for Robust Non-Rigid Point Cloud Registration

Paired-Point Lifting for Enhanced Privacy-Preserving Visual Localization

Global Vision Transformer Pruning With Hessian-Aware Saliency

Beyond mAP: Towards Better Evaluation of Instance Segmentation

3D Shape Reconstruction of Semi-Transparent Worms

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

Bias Mimicking: A Simple Sampling Approach for Bias Mitigation

Robust Dynamic Radiance Fields

MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation

Detection of Out-of-Distribution Samples Using Binary Neuron Activation Patterns

Recovering 3D Hand Mesh Sequence From a Single Blurry Image: A New Dataset and Temporal Unfolding

Rotation-Invariant Transformer for Point Cloud Matching

Propagate and Calibrate: Real-Time Passive Non-Line-of-Sight Tracking

3D Neural Field Generation Using Triplane Diffusion

GLeaD: Improving GANs With a Generator-Leading Task

Training Debiased Subnetworks With Contrastive Weight Pruning

ViPLO: Vision Transformer Based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection

Restoration of Hand-Drawn Architectural Drawings Using Latent Space Mapping With Degradation Generator

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

EFEM: Equivariant Neural Field Expectation Maximization for 3D Object Segmentation Without Scene Supervision

Learning Decorrelated Representations Efficiently Using Fast Fourier Transform

V2V4Real: A Real-World Large-Scale Dataset for Vehicle-to-Vehicle Cooperative Perception

Towards Robust Tampered Text Detection in Document Image: New Dataset and New Solution

Make Landscape Flatter in Differentially Private Federated Learning

Re-Thinking Model Inversion Attacks Against Deep Neural Networks

GeoMVSNet: Learning Multi-View Stereo With Geometry Perception

ProxyFormer: Proxy Alignment Assisted Point Cloud Completion With Missing Part Sensitive Transformer

Instant-NVR: Instant Neural Volumetric Rendering for Human-Object Interactions From Monocular RGBD Stream

A Large-Scale Homography Benchmark

Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation

Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation

Learning Distortion Invariant Representation for Image Restoration From a Causality Perspective

Token Boosting for Robust Self-Supervised Visual Transformer Pre-Training

Feature Representation Learning With Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition

FJMP: Factorized Joint Multi-Agent Motion Prediction Over Learned Directed Acyclic Interaction Graphs

Generalizing Dataset Distillation via Deep Generative Prior

Learning on Gradients: Generalized Artifacts Representation for GAN-Generated Images Detection

NLOST: Non-Line-of-Sight Imaging With Transformer

Few-Shot Referring Relationships in Videos

Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars

Tensor4D: Efficient Neural 4D Decomposition for High-Fidelity Dynamic Reconstruction and Rendering

CloSET: Modeling Clothed Humans on Continuous Surface With Explicit Template Decomposition

Task Residual for Tuning Vision-Language Models

JRDB-Pose: A Large-Scale Dataset for Multi-Person Pose Estimation and Tracking

Best of Both Worlds: Multimodal Contrastive Learning With Tabular and Imaging Data

Learning Semantic-Aware Disentangled Representation for Flexible 3D Human Body Editing

Crowd3D: Towards Hundreds of People Reconstruction From a Single Image

CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

MOVES: Manipulated Objects in Video Enable Segmentation

OReX: Object Reconstruction From Planar Cross-Sections Using Neural Fields

An Erudite Fine-Grained Visual Classification Model

On-the-Fly Category Discovery

Confidence-Aware Personalized Federated Learning via Variational Expectation Maximization

Spatio-Focal Bidirectional Disparity Estimation From a Dual-Pixel Image

ECON: Explicit Clothed Humans Optimized via Normal Integration

Class Adaptive Network Calibration

STDLens: Model Hijacking-Resilient Federated Learning for Object Detection

Samples With Low Loss Curvature Improve Data Efficiency

A Practical Stereo Depth System for Smart Glasses

Quantum-Inspired Spectral-Spatial Pyramid Network for Hyperspectral Image Classification

ABCD: Arbitrary Bitwise Coefficient for De-Quantization

ScaleDet: A Scalable Multi-Dataset Object Detector

A Meta-Learning Approach to Predicting Performance and Data Requirements

Multi-View Stereo Representation Revist: Region-Aware MVSNet

Neural Intrinsic Embedding for Non-Rigid Point Cloud Matching

DeltaEdit: Exploring Text-Free Training for Text-Driven Image Manipulation

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs

Two-Stage Co-Segmentation Network Based on Discriminative Representation for Recovering Human Mesh From Videos

Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement

EfficientViT: Memory Efficient Vision Transformer With Cascaded Group Attention

LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

MonoHuman: Animatable Human Neural Field From Monocular Video

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation

Out-of-Candidate Rectification for Weakly Supervised Semantic Segmentation

Real-Time Evaluation in Online Continual Learning: A New Hope

NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization

DP-NeRF: Deblurred Neural Radiance Field With Physical Scene Priors

Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression

A New Dataset Based on Images Taken by Blind People for Testing the Robustness of Image Classification Models Trained for ImageNet Categories

What Can Human Sketches Do for Object Detection?

Occlusion-Free Scene Recovery via Neural Radiance Fields

Incremental 3D Semantic Scene Graph Prediction From RGB Sequences

The Dark Side of Dynamic Routing Neural Networks: Towards Efficiency Backdoor Injection

SMOC-Net: Leveraging Camera Pose for Self-Supervised Monocular Object Pose Estimation

E2PN: Efficient SE(3)-Equivariant Point Network

Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric

Visual DNA: Representing and Comparing Images Using Distributions of Neuron Activations

LayoutDM: Discrete Diffusion Model for Controllable Layout Generation

Towards Flexible Multi-Modal Document Models

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

TRACE: 5D Temporal Regression of Avatars With Dynamic Cameras in 3D Environments

Variational Distribution Learning for Unsupervised Text-to-Image Generation

ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations

Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes

Painting 3D Nature in 2D: View Synthesis of Natural Scenes From a Single Semantic Mask

Semi-Supervised Learning Made Simple With Self-Supervised Clustering

GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds

NoPe-NeRF: Optimising Neural Radiance Field With No Pose Prior

Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation

MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding From Object Detection

MethaneMapper: Spectral Absorption Aware Hyperspectral Transformer for Methane Detection

Source-Free Adaptive Gaze Estimation by Uncertainty Reduction

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction

DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-To-Fine Contrastive Ranking

Weakly Supervised Temporal Sentence Grounding With Uncertainty-Guided Self-Training

Light Source Separation and Intrinsic Image Decomposition Under AC Illumination

Sketch2Saliency: Learning To Detect Salient Objects From Human Drawings

Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR

Picture That Sketch: Photorealistic Image Generation From Abstract Sketches

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

SceneTrilogy: On Human Scene-Sketch and Its Complementarity With Photo and Text

Angelic Patches for Improving Third-Party Object Detector Performance

NeuralField-LDM: Scene Generation With Hierarchical Latent Diffusion Models

DANI-Net: Uncalibrated Photometric Stereo by Differentiable Shadow Handling, Anisotropic Reflectance Modeling, and Neural Inverse Rendering

Adversarially Masking Synthetic To Mimic Real: Adaptive Noise Injection for Point Cloud Segmentation Adaptation

Minimizing the Accumulated Trajectory Error To Improve Dataset Distillation

Phone2Proc: Bringing Robust Robots Into Our Chaotic World

Objaverse: A Universe of Annotated 3D Objects

Supervised Masked Knowledge Distillation for Few-Shot Transformers

Class-Incremental Exemplar Compression for Class-Incremental Learning

Continual Detection Transformer for Incremental Object Detection

Overcoming the Trade-Off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction

Analyzing and Diagnosing Pose Estimation With Attributions

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

AnyFlow: Arbitrary Scale Optical Flow With Implicit Neural Representation

On Distillation of Guided Diffusion Models

Are Data-Driven Explanations Robust Against Out-of-Distribution Data?

T-SEA: Transfer-Based Self-Ensemble Attack on Object Detection

ActMAD: Activation Matching To Align Distributions for Test-Time-Training

Video Test-Time Adaptation for Action Recognition

Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation

Neural Congealing: Aligning Images to a Joint Semantic Atlas

Modality-Invariant Visual Odometry for Embodied Vision

Improving Selective Visual Question Answering by Learning From Your Peers

Real-Time 6K Image Rescaling With Rate-Distortion Optimization

Distilling Neural Fields for Real-Time Articulated Shape Reconstruction

MaPLe: Multi-Modal Prompt Learning

Visibility Aware Human-Object Interaction Tracking From Single RGB Camera

X-Avatar: Expressive Human Avatars

Rethinking Gradient Projection Continual Learning: Stability / Plasticity Feature Space Decoupling

Unsupervised Space-Time Network for Temporally-Consistent Segmentation of Multiple Motions

Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding

1000 FPS HDR Video With a Spike-RGB Hybrid Camera

CLIP the Gap: A Single Domain Generalization Approach for Object Detection

Learning Transformations To Reduce the Geometric Shift in Object Detection

Music-Driven Group Choreography

Structured 3D Features for Reconstructing Controllable Avatars

Backdoor Cleansing With Unlabeled Data

PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces

Single Domain Generalization for LiDAR Semantic Segmentation

Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

Neural Map Prior for Autonomous Driving

Dynamic Coarse-To-Fine Learning for Oriented Tiny Object Detection

“Seeing” Electric Network Frequency From Events

Recurrent Homography Estimation Using Homography-Guided Image Warping and Focus Transformer

Efficient Robust Principal Component Analysis via Block Krylov Iteration and CUR Decomposition

Reliable and Interpretable Personalized Federated Learning

Inverting the Imaging Process by Learning an Implicit Camera Model

WildLight: In-the-Wild Inverse Rendering With a Flashlight

Wide-Angle Rectification via Content-Aware Conformal Mapping

MEGANE: Morphable Eyeglass and Avatar Network

Hierarchical Discriminative Learning Improves Visual Representations of Biomedical Microscopy

Generalized Decoding for Pixel, Image, and Language

ScanDMM: A Deep Markov Model of Scanpath Prediction for 360° Images

CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis

Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion

AutoFocusFormer: Image Segmentation off the Grid

VIVE3D: Viewpoint-Independent Video Editing Using 3D-Aware GANs

Multi-View Reconstruction Using Signed Ray Distance Functions (SRDF)

OTAvatar: One-Shot Talking Face Avatar With Controllable Tri-Plane Rendering

SketchXAI: A First Look at Explainability for Human Sketches

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

Post-Processing Temporal Action Detection

SliceMatch: Geometry-Guided Aggregation for Cross-View Pose Estimation

M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis

Affordance Diffusion: Synthesizing Hand-Object Interactions

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Meta-Personalizing Vision-Language Models To Find Named Instances in Video

Language-Guided Music Recommendation for Video via Prompt Analogies

Ranking Regularization for Critical Rare Classes: Minimizing False Positives at a High True Positive Rate

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

ACR: Attention Collaboration-Based Regressor for Arbitrary Two-Hand Reconstruction

ShapeClipper: Scalable 3D Shape Learning From Single-View Images via Geometric and CLIP-Based Consistency

Learning Situation Hyper-Graphs for Video Question Answering

TarViS: A Unified Approach for Target-Based Video Segmentation

StepFormer: Self-Supervised Step Discovery and Localization in Instructional Videos

CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning

Generating Part-Aware Editable 3D Shapes Without 3D Supervision

AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training

Self-Supervised Blind Motion Deblurring With Deep Expectation Maximization

Clover: Towards a Unified Video-Language Alignment and Fusion Model

High-Fidelity Clothed Avatar Reconstruction From a Single Image

Topology-Guided Multi-Class Cell Context Generation for Digital Pathology

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

Detecting Everything in the Open World: Towards Universal Object Detection

CLIP2Protect: Protecting Facial Privacy Using Text-Guided Makeup via Adversarial Latent Search

Evading Forensic Classifiers With Attribute-Conditioned Adversarial Faces

Token Turing Machines

Temporally Consistent Online Depth Estimation Using Point-Based Fusion

SparsePose: Sparse-View Camera Pose Regression and Refinement

K-Planes: Explicit Radiance Fields in Space, Time, and Appearance

On the Benefits of 3D Pose and Tracking for Human Action Recognition

How You Feelin’? Learning Emotions and Mental States in Movie Scenes

GIVL: Improving Geographical Inclusivity of Vision-Language Models With Pre-Training Methods

A Whac-a-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others

HyperCUT: Video Sequence From a Single Blurry Image Using Unsupervised Ordering

DejaVu: Conditional Regenerative Learning To Enhance Dense Prediction

NIRVANA: Neural Implicit Representations of Videos With Adaptive Networks and Autoregressive Patch-Wise Modeling

SimpSON: Simplifying Photo Cleanup With Single-Click Distracting Object Segmentation Network

Align and Attend: Multimodal Summarization With Dual Contrastive Losses

HNeRV: A Hybrid Neural Representation for Videos

FlexNeRF: Photorealistic Free-Viewpoint Rendering of Moving Humans From Sparse Views

Teaching Matters: Investigating the Role of Supervision in Vision Transformers

Towards Scalable Neural Representation for Diverse Videos

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation

Learning Customized Visual Models With Retrieval-Augmented Knowledge

Probing Neural Representations of Scene Perception in a Hippocampally Dependent Task Using Artificial Neural Networks

ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

Invertible Neural Skinning

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement

3Mformer: Multi-Order Multi-Mode Transformer for Skeletal Action Recognition

LINe: Out-of-Distribution Detection by Leveraging Important Neurons

Open-Vocabulary Panoptic Segmentation With Text-to-Image Diffusion Models

Modernizing Old Photos Using Multiple References via Photorealistic Style Transfer

SUDS: Scalable Urban Dynamic Scenes

Octree Guided Unoriented Surface Reconstruction

Bayesian Posterior Approximation With Stochastic Ensembles

PROB: Probabilistic Objectness for Open World Object Detection

Consistent View Synthesis With Pose-Guided Diffusion Models

Guided Depth Super-Resolution by Deep Anisotropic Diffusion

Robust Mean Teacher for Continual and Gradual Test-Time Adaptation

itKD: Interchange Transfer-Based Knowledge Distillation for 3D Object Detection

Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement

EXCALIBUR: Encouraging and Evaluating Embodied Exploration

Freestyle Layout-to-Image Synthesis

Marching-Primitives: Shape Abstraction From Signed Distance Function

3D Concept Learning and Reasoning From Multi-View Images

Towards End-to-End Generative Modeling of Long Videos With Memory-Efficient Bidirectional Transformers

Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection

Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement

Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting

Burstormer: Burst Image Restoration and Enhancement Transformer

Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection

PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery

DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization

OPE-SR: Orthogonal Position Encoding for Designing a Parameter-Free Upsampling Module in Arbitrary-Scale Image Super-Resolution

RelightableHands: Efficient Neural Relighting of Articulated Hand Models

Query-Centric Trajectory Prediction

NICO++: Towards Better Benchmarking for Domain Generalization

Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation

Fix the Noise: Disentangling Source Feature for Controllable Domain Translation

NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations

All in One: Exploring Unified Video-Language Pre-Training

MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

PyPose: A Library for Robot Learning With Physics-Based Optimization

Directional Connectivity-Based Segmentation of Medical Images

AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation

DNF: Decouple and Feedback Network for Seeing in the Dark

Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking

Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion

Azimuth Super-Resolution for FMCW Radar in Autonomous Driving

Bitstream-Corrupted JPEG Images Are Restorable: Two-Stage Compensation and Alignment Framework for Image Restoration

MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery

Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos

TTA-COPE: Test-Time Adaptation for Category-Level Object Pose Estimation

Masked and Adaptive Transformer for Exemplar Based Image Translation

NAR-Former: Neural Architecture Representation Learning Towards Holistic Attributes Prediction

Boosting Weakly-Supervised Temporal Action Localization With Text Information

Imagic: Text-Based Real Image Editing With Diffusion Models

MEDIC: Remove Model Backdoors via Importance Driven Cloning

VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization

Feature Separation and Recalibration for Adversarial Robustness

Learning Visual Representations via Language-Guided Sampling

Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution

NIPQ: Noise Proxy-Based Integrated Pseudo-Quantization

Stimulus Verification Is a Universal and Effective Sampler in Multi-Modal Human Trajectory Prediction

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

Probabilistic Prompt Learning for Dense Prediction

Visibility Constrained Wide-Band Illumination Spectrum Design for Seeing-in-the-Dark

Leapfrog Diffusion Model for Stochastic Trajectory Prediction

EqMotion: Equivariant Multi-Agent Motion Prediction With Invariant Interaction Reasoning

Multiscale Tensor Decomposition and Rendering Equation Encoding for View Synthesis

X-Pruner: eXplainable Pruning for Vision Transformers

MDL-NAS: A Joint Multi-Domain Learning Framework for Vision Transformer

Poly-PC: A Polyhedral Network for Multiple Point Cloud Tasks at Once

Causally-Aware Intraoperative Imputation for Overall Survival Time Prediction

Virtual Sparse Convolution for Multimodal 3D Object Detection

Learning Human-to-Robot Handovers From Point Clouds

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

PartManip: Learning Cross-Category Generalizable Part Manipulation Policy From Point Cloud Observations

GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts

Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

UDE: A Unified Driving Engine for Human Motion Generation

Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis From Monocular Image

Hand Avatar: Free-Pose Hand Animation and Rendering From Monocular Video

Spectral Bayesian Uncertainty for Image Super-Resolution

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Biomechanics-Guided Facial Action Unit Detection Through Force Modeling

Understanding and Improving Visual Prompting: A Label-Mapping Perspective

Egocentric Audio-Visual Object Localization

MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation

GFPose: Learning 3D Human Pose Prior With Gradient Fields

Quantum Multi-Model Fitting

EventNeRF: Neural Radiance Fields From a Single Colour Event Camera

Self-Supervised Pre-Training With Masked Shape Prediction for 3D Scene Understanding

Mofusion: A Framework for Denoising-Diffusion-Based Motion Synthesis

CCuantuMM: Cycle-Consistent Quantum-Hybrid Matching of Multiple Shapes

PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization

Bias-Eliminating Augmentation Learning for Debiased Federated Learning

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

Indiscernible Object Counting in Underwater Scenes

Event-Based Frame Interpolation With Ad-Hoc Deblurring

iDisc: Internal Discretization for Monocular Depth Estimation

Quantitative Manipulation of Custom Attributes on 3D-Aware Image Synthesis

IFSeg: Image-Free Semantic Segmentation via Vision-Language Model

Histopathology Whole Slide Image Analysis With Heterogeneous Graph Representation Learning

Towards Unified Scene Text Spotting Based on Sequence Generation

Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding

ReCo: Region-Controlled Text-to-Image Generation

An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling

LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling

Polarimetric iToF: Measuring High-Fidelity Depth Through Scattering Media

Meta-Tuning Loss Functions and Data Augmentation for Few-Shot Object Detection

KD-DLGAN: Data Limited Image Generation via Knowledge Distillation

Multi-Object Manipulation via Object-Centric Neural Scattering Functions

Accidental Light Probes

Randomized Adversarial Training via Taylor Expansion

R2Former: Unified Retrieval and Reranking Transformer for Place Recognition

TopNet: Transformer-Based Object Placement Network for Image Compositing

Natural Language-Assisted Sign Language Recognition

Siamese DETR

Generating Aligned Pseudo-Supervision From Non-Aligned Data for Image Restoration in Under-Display Camera

Aligning Bag of Regions for Open-Vocabulary Object Detection

Nighttime Smartphone Reflective Flare Removal Using Optical Center Symmetry Prior

CelebV-Text: A Large-Scale Facial Text-Video Dataset

Correlational Image Modeling for Self-Supervised Visual Pre-Training

Learning Generative Structure Prior for Blind Text Image Super-Resolution

Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning

Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation From Image Sequence

Distribution Shift Inversion for Out-of-Distribution Prediction

NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions

Fine-Tuned CLIP Models Are Efficient Video Learners

Pointersect: Neural Rendering With Cloud-Ray Intersection

Viewpoint Equivariance for Multi-View 3D Object Detection

MobileOne: An Improved One Millisecond Mobile Backbone

Conditional Image-to-Video Generation With Latent Flow Diffusion Models

Scene-Aware Egocentric 3D Human Pose Estimation

ObjectStitch: Object Compositing With Diffusion Model

Towards Open-World Segmentation of Parts

SkyEye: Self-Supervised Bird’s-Eye-View Semantic Mapping Using Monocular Frontal View Images

Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation

DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection

FitMe: Deep Photorealistic 3D Morphable Model Avatars

Regularization of Polynomial Networks for Image Recognition

Improving Fairness in Facial Albedo Estimation via Visual-Textual Cues

MagicPony: Learning Articulated 3D Animals in the Wild

A Strong Baseline for Generalized Few-Shot Semantic Segmentation

Open-Set Likelihood Maximization for Few-Shot Learning

Recognizing Rigid Patterns of Unlabeled Point Clouds by Complete and Continuous Isometry Invariants With No False Negatives and No False Positives

Hierarchical Prompt Learning for Multi-Task Learning

Teaching Structured Vision & Language Concepts to Vision & Language Models

Few-Shot Learning With Visual Distribution Calibration and Cross-Modal Distribution Alignment

Explaining Image Classifiers With Multiscale Directional Image Representation

UTM: A Unified Multiple Object Tracking Model With Identity-Aware Feature Enhancement

PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes

Guiding Pseudo-Labels With Uncertainty Estimation for Source-Free Unsupervised Domain Adaptation

Learning To Exploit Temporal Structure for Biomedical Vision–Language Processing

BEV@DC: Bird’s-Eye View Assisted Training for Depth Completion

Robust Single Image Reflection Removal Against Adversarial Attacks

Dynamic Neural Network for Multi-Task Learning Searching Across Diverse Network Topologies

Change-Aware Sampling and Contrastive Learning for Satellite Images

PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers

Level-S$^2$fM: Structure From Motion on Neural Level Set of Implicit Surfaces

How To Prevent the Continuous Damage of Noises To Model Training?

SMAE: Few-Shot Learning for HDR Deghosting With Saturation-Aware Masked Autoencoders

A Unified HDR Imaging Method With Pixel and Patch Level

Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data

Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision

Knowledge Combination To Learn Rotated Detection Without Rotated Annotation

Reliability in Semantic Segmentation: Are We on the Right Track?

K3DN: Disparity-Aware Kernel Estimation for Dual-Pixel Defocus Deblurring

A New Benchmark: On the Utility of Synthetic Data With Blender for Bare Supervised Learning and Downstream Domain Adaptation

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model

Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection

Bootstrap Your Own Prior: Towards Distribution-Agnostic Novel Class Discovery

Semantic Scene Completion With Cleaner Self

Visual Prompt Multi-Modal Tracking

AstroNet: When Astrocyte Meets Artificial Neural Network

MCF: Mutual Correction Framework for Semi-Supervised Medical Image Segmentation

3D Cinemagraphy From a Single Image

Data-Free Knowledge Distillation via Feature Exchange and Activation Region Constraint

Decomposed Cross-Modal Distillation for RGB-Based Temporal Action Detection

DiffusionRig: Learning Personalized Priors for Facial Appearance Editing

BiasBed – Rigorous Texture Bias Evaluation

Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

Ref-NPR: Reference-Based Non-Photorealistic Radiance Fields for Controllable Scene Stylization

CodeTalker: Speech-Driven 3D Facial Animation With Discrete Motion Prior

A Bag-of-Prototypes Representation for Dataset-Level Applications

Focused and Collaborative Feedback Integration for Interactive Image Segmentation

(ML)$^2$P-Encoder: On Exploration of Channel-Class Correlation for Multi-Label Zero-Shot Learning

Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning

Sampling Is Matter: Point-Guided 3D Human Mesh Reconstruction

Super-Resolution Neural Operator

RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension

JacobiNeRF: NeRF Shaping With Mutual Information Gradients

Category Query Learning for Human-Object Interaction Classification

Collaboration Helps Camera Overtake LiDAR in 3D Detection

Meta Omnium: A Benchmark for General-Purpose Learning-To-Learn

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

Hi4D: 4D Instance Segmentation of Close Human Interaction

Towards Better Stability and Adaptability: Improve Online Self-Training for Model Adaptation in Semantic Segmentation

MixSim: A Hierarchical Framework for Mixed Reality Traffic Simulation

Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving

Joint Appearance and Motion Learning for Efficient Rolling Shutter Correction

A Generalized Framework for Video Instance Segmentation

Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks

Class Prototypes Based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos

Revisiting Residual Networks for Adversarial Robustness

Mitigating Task Interference in Multi-Task Learning via Explicit Task Routing With Non-Learnable Primitives

Two-Shot Video Object Segmentation

HouseDiffusion: Vector Floorplan Generation via a Diffusion Model With Discrete and Continuous Denoising

Panoptic Compositional Feature Field for Editable Scene Rendering With Network-Inferred Labels via Metric Learning

A Characteristic Function-Based Method for Bottom-Up Human Pose Estimation

Energy-Efficient Adaptive 3D Sensing

Consistent Direct Time-of-Flight Video Depth Super-Resolution

DETRs With Hybrid Matching

Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization

EcoTTA: Memory-Efficient Continual Test-Time Adaptation via Self-Distilled Regularization

Progressive Random Convolutions for Single Domain Generalization

AutoLabel: CLIP-Based Framework for Open-Set Video Domain Adaptation

3D Line Mapping Revisited

DeepLSD: Line Segment Detection and Refinement With Deep Image Gradients

Multi-View Adversarial Discriminator: Mine the Non-Causal Factors for Object Detection in Unseen Domains

SegLoc: Learning Segmentation-Based Representations for Privacy-Preserving Visual Localization

RiDDLE: Reversible and Diversified De-Identification With Latent Encryptor

OpenScene: 3D Scene Understanding With Open Vocabularies

Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation From 2D Supervision

MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures

Learning Emotion Representations From Verbal and Nonverbal Communication

Understanding Masked Autoencoders via Hierarchical Latent Variable Models

Iterative Vision-and-Language Navigation

Relational Context Learning for Human-Object Interaction Detection

ViewNet: A Novel Projection-Based Backbone With View Pooling for Few-Shot Point Cloud Classification

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

Understanding the Robustness of 3D Object Detection With Bird’s-Eye-View Representations in Autonomous Driving

Human Pose Estimation in Extremely Low-Light Conditions

Boundary Unlearning: Rapid Forgetting of Deep Networks via Shifting the Decision Boundary

Sliced Optimal Partial Transport

TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization

Pseudo-Label Guided Contrastive Learning for Semi-Supervised Medical Image Segmentation

SINE: SINgle Image Editing With Text-to-Image Diffusion Models

Leveraging per Image-Token Consistency for Vision-Language Pre-Training

SeSDF: Self-Evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction

Block Selection Method for Using Feature Norm in Out-of-Distribution Detection

Relightable Neural Human Assets From Multi-View Gradient Illuminations

Markerless Camera-to-Robot Pose Estimation via Self-Supervised Sim-to-Real Transfer

DA-DETR: Domain Adaptive Detection Transformer With Information Fusion

Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors

Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning

NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis

HaLP: Hallucinating Latent Positives for Skeleton-Based Self-Supervised Learning of Actions

Better “CMOS” Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution

Finding Geometric Models by Clustering in the Consensus Space

3D-POP – An Automated Annotation Approach to Facilitate Markerless 2D-3D Tracking of Freely Moving Birds With Marker-Based Motion Capture

TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition

3D Human Mesh Estimation From Virtual Markers

Rethinking Feature-Based Knowledge Distillation for Face Recognition

Chat2Map: Efficient Scene Mapping From Multi-Ego Conversations

Novel-View Acoustic Synthesis

High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity

Zero-Shot Referring Image Segmentation With Global-Local Context Features

AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR

Learning Attention As Disentangler for Compositional Zero-Shot Learning

Mask-Free OVIS: Open-Vocabulary Instance Segmentation Without Manual Mask Annotations

SE-ORNet: Self-Ensembling Orientation-Aware Network for Unsupervised Point Cloud Shape Correspondence

Adaptive Spot-Guided Transformer for Consistent Local Feature Matching

D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-Based Transformers

Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-Based Action Recognition

StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer

Box-Level Active Detection

Neural Scene Chronology

DynIBaR: Neural Dynamic Image-Based Rendering

Omnimatte3D: Associating Objects and Their Effects in Unconstrained Monocular Video

Controllable Light Diffusion for Portraits

TrojViT: Trojan Insertion in Vision Transformers

WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation

VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction

DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets

Improving Robustness of Vision Transformers by Reducing Sensitivity To Patch Corruptions

PolyFormer: Referring Image Segmentation As Sequential Polygon Generation

Affordances From Human Videos as a Versatile Representation for Robotics

Improving Visual Grounding by Encouraging Consistent Gradient-Based Explanations

The Differentiable Lens: Compound Lens Search Over Glass Surfaces and Materials for Object Detection

Thermal Spread Functions (TSF): Physics-Guided Material Classification

WIRE: Wavelet Implicit Neural Representations

BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision

DrapeNet: Garment Generation and Self-Supervised Draping

3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud

Integrally Pre-Trained Transformer Pyramid Networks

DeepSolo: Let Transformer Decoder With Explicit Points Solo for Text Spotting

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

Optimal Transport Minimization: Crowd Localization on Density Maps for Semi-Supervised Counting

SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency

REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory

Blur Interpolation Transformer for Real-World Motion From Blur

High-Fidelity Event-Radiance Recovery via Transient Event Frequency

Learning Event Guided High Dynamic Range Video Reconstruction

Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection

Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding

DexArt: Benchmarking Generalizable Dexterous Manipulation With Articulated Objects

Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding

PersonNeRF: Personalized Reconstruction From Photo Collections

Teacher-Generated Spatial-Attention Labels Boost Robustness and Accuracy of Contrastive Models

Superclass Learning With Representation Enhancement

3D-Aware Multi-Class Image-to-Image Translation With NeRFs

Towards Unsupervised Object Detection From LiDAR Point Clouds

Unbalanced Optimal Transport: A Unified Framework for Object Detection

ORCa: Glossy Objects As Radiance-Field Cameras

Role of Transients in Two-Bounce Non-Line-of-Sight Imaging

Improving Table Structure Recognition With Visual-Alignment Sequential Coordinate Modeling

Class Relationship Embedded Learning for Source-Free Unsupervised Domain Adaptation

A Rotation-Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization

Towards Domain Generalization for Multi-View 3D Object Detection in Bird-Eye-View

Visual Recognition-Driven Image Restoration for Multiple Degradation With Intrinsic Semantics Recovery

Ingredient-Oriented Multi-Degradation Learning for Image Restoration

Ultra-High Resolution Segmentation With Ultra-Rich Context: A Novel Benchmark

Learning Sample Relationship for Exposure Correction

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

PointConvFormer: Revenge of the Point-Based Convolution

Compression-Aware Video Super-Resolution

Mask-Free Video Instance Segmentation

Learning Accurate 3D Shape Based on Stereo Polarimetric Imaging

MARLIN: Masked Autoencoder for Facial Video Representation LearnINg

CafeBoost: Causal Feature Boost To Eliminate Task-Induced Bias for Class Incremental Learning

EfficientSCI: Densely Connected Network With Space-Time Factorization for Large-Scale Video Snapshot Compressive Imaging

Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-Time Mobile Telepresence

Siamese Image Modeling for Self-Supervised Vision Representation Learning

Learning To Retain While Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation

Ground-Truth Free Meta-Learning for Deep Compressive Sampling

Neumann Network With Recursive Kernels for Single Image Defocus Deblurring

Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation

Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

High-Fidelity Guided Image Synthesis With Latent Diffusion Models

Procedure-Aware Pretraining for Instructional Video Understanding

Neural Part Priors: Learning To Optimize Part-Based Object Completion in RGB-D Scans

Hierarchical Video-Moment Retrieval and Step-Captioning

Generative Semantic Segmentation

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Efficient Movie Scene Detection Using State-Space Transformers

Neuralangelo: High-Fidelity Neural Surface Reconstruction

Sparse Multi-Modal Graph Transformer With Shared-Context Processing for Representation Learning of Giga-Pixel Images

Spider GAN: Leveraging Friendly Neighbors To Accelerate GAN Training

ASPnet: Action Segmentation With Shared-Private Representation of Multiple Data Sources

Overlooked Factors in Concept-Based Explanations: Dataset Choice, Concept Learnability, and Human Capability

Multiple Instance Learning via Iterative Self-Paced Supervised Contrastive Learning

Model Barrier: A Compact Un-Transferable Isolation Domain for Model Intellectual Property Protection

FLEX: Full-Body Grasping Without Full-Body Grasps

A Soma Segmentation Benchmark in Full Adult Fly Brain

NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization

Doubly Right Object Recognition: A Why Prompt for Visual Rationales

Contrastive Semi-Supervised Learning for Underwater Image Restoration via Reliable Bank

Adaptive Human Matting for Dynamic Videos

Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery From Sparse Image Ensemble

Pix2map: Cross-Modal Retrieval for Inferring Street Maps From Images

NewsNet: A Novel Dataset for Hierarchical Temporal Segmentation

Large-Capacity and Flexible Video Steganography via Invertible Neural Network

PVO: Panoptic Visual Odometry

Infinite Photorealistic Worlds Using Procedural Generation

3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point Clouds

Virtual Occlusions Through Implicit Depth

Improving Zero-Shot Generalization and Robustness of Multi-Modal Models

StarCraftImage: A Dataset for Prototyping Spatial Reasoning Methods for Multi-Agent Environments

DistilPose: Tokenized Pose Regression With Heatmap Distillation

LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation

VectorFloorSeg: Two-Stream Graph Attention Network for Vectorized Roughcast Floorplan Segmentation

Progressive Transformation Learning for Leveraging Virtual Images in Training

OCELOT: Overlapped Cell on Tissue Dataset for Histopathology

Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss

BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning

Super-CLEVR: A Virtual Benchmark To Diagnose Domain Robustness in Visual Reasoning

A-Cap: Anticipation Captioning With Commonsense Knowledge

NoisyTwins: Class-Consistent and Diverse Image Generation Through StyleGANs

Semi-Supervised Parametric Real-World Image Harmonization

ALTO: Alternating Latent Topologies for Implicit 3D Reconstruction

LEGO-Net: Learning Regular Rearrangements of Objects in Rooms

SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene

Camouflaged Instance Segmentation via Explicit De-Camouflaging

DualRel: Semi-Supervised Mitochondria Segmentation From a Prototype Perspective

Rethinking the Correlation in Few-Shot Segmentation: A Buoys View

Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization

Dynamic Generative Targeted Attacks With Pattern Injection

SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction

Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations

How to Backdoor Diffusion Models?

Heterogeneous Continual Learning

Minimizing Maximum Model Discrepancy for Transferable Black-Box Targeted Attacks

DyNCA: Real-Time Dynamic Texture Synthesis Using Neural Cellular Automata

Learning Correspondence Uncertainty via Differentiable Nonlinear Least Squares

Novel Class Discovery for 3D Point Cloud Semantic Segmentation

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

Generative Bias for Robust Visual Question Answering

DF-Platter: Multi-Face Heterogeneous Deepfake Dataset

Scalable, Detailed and Mask-Free Universal Photometric Stereo

Scaling Language-Image Pre-Training via Masking

TempSAL – Uncovering Temporal Information for Deep Saliency Prediction

Spatiotemporal Self-Supervised Learning for Point Clouds in the Wild

LOGO: A Long-Form Video Dataset for Group Action Quality Assessment

Learning Compact Representations for LiDAR Completion and Generation

Divide and Conquer: Answering Questions With Object Factorization and Compositional Reasoning

StyleRF: Zero-Shot 3D Style Transfer of Neural Radiance Fields

Conditional Generation of Audio From Video via Foley Analogies

Learning Semantic Relationship Among Instances for Image-Text Matching

Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning

Unsupervised Continual Semantic Adaptation Through Neural Rendering

OrienterNet: Visual Localization in 2D Public Maps With Neural Matching

CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective

OpenGait: Revisiting Gait Recognition Towards Better Practicality

LidarGait: Benchmarking 3D Gait Recognition With Point Clouds

OneFormer: One Transformer To Rule Universal Image Segmentation

Graph Transformer GANs for Graph-Constrained House Generation

Text-Guided Unsupervised Latent Transformation for Multi-Attribute Image Manipulation

A Unified Knowledge Distillation Framework for Deep Directed Graphical Models

GANHead: Towards Generative Animatable Neural Head Avatars

MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos

Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective

Real-Time Neural Light Field on Mobile Devices

Unsupervised Volumetric Animation

Make-a-Story: Visual Memory Conditioned Consistent Story Generation

Unknown Sniffer for Object Detection: Don’t Turn a Blind Eye to Unknown Objects

CF-Font: Content Fusion for Few-Shot Font Generation

Combining Implicit-Explicit View Correlation for Light Field Semantic Segmentation

Local Connectivity-Based Density Estimation for Face Clustering

BAAM: Monocular 3D Pose and Shape Reconstruction With Bi-Contextual Attention Module and Attention-Guided Modeling

Autonomous Manipulation Learning for Similar Deformable Objects via Only One Demonstration

Efficient Loss Function by Minimizing the Detrimental Effect of Floating-Point Errors on Gradient-Based Attacks

Efficient RGB-T Tracking via Cross-Modality Distillation

Fair Federated Medical Image Segmentation via Client Contribution Estimation

Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring

Turning a CLIP Model Into a Scene Text Detector

Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields

BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image

Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-World

Implicit Diffusion Models for Continuous Super-Resolution

AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning

SmartAssign: Learning a Smart Knowledge Assignment Strategy for Deraining and Desnowing

Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based Action Recognition

Multiclass Confidence and Localization Calibration for Object Detection

Long-Term Visual Localization With Mobile Sensors

Efficient and Explicit Modelling of Image Hierarchies for Image Restoration

Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation

Class Balanced Adaptive Pseudo Labeling for Federated Semi-Supervised Learning

AutoRecon: Automated 3D Object Discovery and Reconstruction

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

TensoIR: Tensorial Inverse Rendering

Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning

RepMode: Learning to Re-Parameterize Diverse Experts for Subcellular Structure Prediction

NeUDF: Leaning Neural Unsigned Distance Fields With Volume Rendering

NeAT: Learning Neural Implicit Surfaces With Arbitrary Topologies From Multi-View Images

On the Convergence of IRLS and Its Variants in Outlier-Robust Estimation

Metadata-Based RAW Reconstruction via Implicit Neural Functions

Adaptive Patch Deformation for Textureless-Resilient Multi-View Stereo

RobustNeRF: Ignoring Distractors With Robust Losses

DiffCollage: Parallel Generation of Large Content With Diffusion Models

Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting

Improving Cross-Modal Retrieval With Set of Diverse Embeddings

PointCMP: Contrastive Mask Prediction for Self-Supervised Learning on Point Cloud Videos

3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention

Mapping Degeneration Meets Label Evolution: Learning Infrared Small Target Detection With Single Point Supervision

Object Pop-Up: Can We Infer 3D Objects and Their Poses From Human Interactions Alone?

Real-Time Controllable Denoising for Image and Video

Probabilistic Debiasing of Scene Graphs

Weak-Shot Object Detection Through Mutual Knowledge Transfer

Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks

Recurrence Without Recurrence: Stable Video Landmark Detection With Deep Equilibrium Models

SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting With Neural Radiance Fields

DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network

CutMIB: Boosting Light Field Super-Resolution via Multi-View Image Blending

Adaptive Channel Sparsity for Federated Learning Under System Heterogeneity

Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection

ScaleFL: Resource-Adaptive Federated Learning With Heterogeneous Clients

NVTC: Nonlinear Vector Transform Coding

Slimmable Dataset Condensation

HRDFuse: Monocular 360° Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions

Exploring Intra-Class Variation Factors With Learnable Cluster Prompts for Semi-Supervised Image Synthesis

MetaCLUE: Towards Comprehensive Visual Metaphors Research

Unsupervised 3D Shape Reconstruction by Part Retrieval and Assembly

Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields

Weakly Supervised Semantic Segmentation via Adversarial Learning of Classifier and Reconstructor

Cross-Guided Optimization of Radiance Fields With Multi-View Image Super-Resolution for High-Resolution Novel View Synthesis

Reconstructing Animatable Categories From Videos

Removing Objects From Neural Radiance Fields

Planning-Oriented Autonomous Driving

BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion

Detecting Backdoors in Pre-Trained Encoders

Data-Efficient Large Scale Place Recognition With Graded Similarity Supervision

Back to the Source: Diffusion-Driven Adaptation To Test-Time Corruption

Where Is My Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization

VisFusion: Visibility-Aware Online 3D Scene Reconstruction From Videos

CoMFormer: Continual Learning in Semantic and Panoptic Segmentation

NeMo: Learning 3D Neural Motion Fields From Multiple Video Instances of the Same Action

Seeing Beyond the Brain: Conditional Diffusion Model With Sparse Masked Modeling for Vision Decoding

Boosting Video Object Segmentation via Space-Time Correspondence Learning

Exploring and Utilizing Pattern Imbalance

TransFlow: Transformer As Flow Learner

Detecting and Grounding Multi-Modal Media Manipulation

Learning and Aggregating Lane Graphs for Urban Automated Driving

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER

Text2Scene: Text-Driven Indoor Scene Stylization With Part-Aware Details

GANmouflage: 3D Object Nondetection With Texture Fields

Vision Transformer With Super Token Sampling

Reproducible Scaling Laws for Contrastive Language-Image Learning

Interactive Segmentation of Radiance Fields

V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting

GeoMAE: Masked Geometric Target Prediction for Self-Supervised Point Cloud Pre-Training

GP-VTON: Towards General Purpose Virtual Try-On via Collaborative Local-Flow Global-Parsing Learning

One-Shot High-Fidelity Talking-Head Synthesis With Deformable Neural Radiance Field

LSTFE-Net:Long Short-Term Feature Enhancement Network for Video Small Object Detection

Decoupled Multimodal Distilling for Emotion Recognition

Discrete Point-Wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

Bi-Directional Distribution Alignment for Transductive Zero-Shot Learning

AGAIN: Adversarial Training With Attribution Span Enlargement and Hybrid Feature Fusion

Promoting Semantic Connectivity: Dual Nearest Neighbors Contrastive Learning for Unsupervised Domain Generalization

Generalized Relation Modeling for Transformer Tracking

3D Video Object Detection With Learnable Object-Centric Global Optimization

Flexible-Cm GAN: Towards Precise 3D Dose Prediction in Radiotherapy

CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability

Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection

Deep Dive Into Gradients: Better Optimization for 3D Object Detection With Gradient-Corrected IoU Supervision

PeakConv: Learning Peak Receptive Field for Radar Semantic Segmentation

TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization

Learning Video Representations From Large Language Models

Cut and Learn for Unsupervised Object Detection and Instance Segmentation

ImageBind: One Embedding Space To Bind Them All

OmniMAE: Single Model Masked Pretraining on Images and Videos

Universal Instance Perception As Object Discovery and Retrieval

GM-NeRF: Learning Generalizable Model-Based Neural Radiance Fields From Multi-View Images

SeqTrack: Sequence to Sequence Learning for Visual Object Tracking

ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data

Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation

Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Improving Generalization of Meta-Learning With Inverted Regularization at Inner-Level

Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images

Zero-Shot Noise2Noise: Efficient Image Denoising Without Any Data

Egocentric Auditory Attention Localization in Conversations

Therbligs in Action: Video Understanding Through Motion Primitives

Learning Analytical Posterior Probability for Human Mesh Recovery

Vision Transformers Are Parameter-Efficient Audio-Visual Learners

Perspective Fields for Single Image Camera Calibration

CoralStyleCLIP: Co-Optimized Region and Layer Selection for Image Editing

LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization

Adversarial Normalization: I Can Visualize Everything (ICE)

Gated Stereo: Joint Depth Estimation From Gated and Wide-Baseline Active Stereo Cues

Building Rearticulable Models for Arbitrary 3D Objects From 4D Point Clouds

GazeNeRF: 3D-Aware Gaze Redirection With Neural Radiance Fields

Disentangling Writer and Character Styles for Handwriting Generation

MP-Former: Mask-Piloted Transformer for Image Segmentation

Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR

OmniCity: Omnipotent City Understanding With Multi-Level and Multi-View Images

YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors

Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations

Affordance Grounding From Demonstration Video To Target Image

A Large-Scale Robustness Analysis of Video Action Recognition Models

Solving 3D Inverse Problems Using Pre-Trained 2D Diffusion Models

Parallel Diffusion Models of Operator and Image for Blind Inverse Problems

Transformer-Based Unified Recognition of Two Hands Manipulating Objects

ARO-Net: Learning Implicit Fields From Anchored Radial Observations

PIVOT: Prompting for Video Continual Learning

Rate Gradient Approximation Attack Threats Deep Spiking Neural Networks

ISBNet: A 3D Point Cloud Instance Segmentation Network With Instance-Aware Sampling and Box-Aware Dynamic Convolution

Object Detection With Self-Supervised Scene Adaptation

Megahertz Light Steering Without Moving Parts

SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes

Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

ObjectMatch: Robust Registration Using Canonical Object Correspondences

PanelNet: Understanding 360 Indoor Environment via Panel Representation

Selective Structured State-Spaces for Long-Form Video Understanding

Movies2Scenes: Using Movie Metadata To Learn Scene Representation

PMatch: Paired Masked Image Modeling for Dense Geometric Matching

TeSLA: Test-Time Self-Learning With Automatic Adversarial Augmentation

RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer

Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning With Masked Autoencoders

3D Human Keypoints Estimation From Point Clouds in the Wild Without Human Labels

ViP3D: End-to-End Visual Trajectory Prediction via 3D Agent Queries

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

Neural Pixel Composition for 3D-4D View Synthesis From Multi-Views

Self-Supervised Representation Learning for CAD

Vision Transformers Are Good Mask Auto-Labelers

VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion

Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions

Benchmarking Self-Supervised Learning on Diverse Pathology Datasets

Learning To Fuse Monocular and Multi-View Cues for Multi-Frame Depth Estimation in Dynamic Scenes

VILA: Learning Image Aesthetics From User Comments With Vision-Language Pretraining

Are Deep Neural Networks SMARTer Than Second Graders?

C-SFDA: A Curriculum Learning Aided Self-Training Framework for Efficient Source Free Domain Adaptation

Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

Detecting Backdoors During the Inference Stage Based on Corruption Robustness Consistency

BEVHeight: A Robust Framework for Vision-Based Roadside 3D Object Detection

expOSE: Accurate Initialization-Free Projective Factorization Using Exponential Regularization

Unite and Conquer: Plug & Play Multi-Modal Synthesis Using Diffusion Models

Open-Vocabulary Attribute Detection

Preserving Linear Separability in Continual Learning by Backward Feature Projection

GINA-3D: Learning To Generate Implicit Neural Assets in the Wild

Affection: Learning Affective Explanations for Real-World Visual Data

SCADE: NeRFs from Space Carving With Ambiguity-Aware Depth Estimates

Visual Programming: Compositional Visual Reasoning Without Training

Multi-Realism Image Compression With a Conditional Generator

Clothed Human Performance Capture With a Double-Layer Neural Radiance Fields

Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions

H2ONet: Hand-Occlusion-and-Orientation-Aware Network for Real-Time 3D Hand Mesh Reconstruction

Event-Based Blurry Frame Interpolation Under Blind Exposure

Achieving a Better Stability-Plasticity Trade-Off via Auxiliary Networks in Continual Learning

Re-Basin via Implicit Sinkhorn Differentiation

Efficient Scale-Invariant Generator With Column-Row Entangled Pixel Synthesis

3D Video Loops From Asynchronous Input

BASiS: Batch Aligned Spectral Embedding Space

Temporal Interpolation Is All You Need for Dynamic Neural Radiance Fields

DiGA: Distil To Generalize and Then Adapt for Domain Adaptive Semantic Segmentation

Open-Vocabulary Semantic Segmentation With Mask-Adapted CLIP

Mask3D: Pre-Training 2D Vision Transformers by Learning Masked 3D Priors

Privacy-Preserving Representations Are Not Enough: Recovering Scene Content From Camera Poses

PointCert: Point Cloud Classification With Deterministic Certified Robustness Guarantees

Revealing the Dark Secrets of Masked Image Modeling

Human Pose As Compositional Tokens

Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning

Meta Compositional Referring Expression Segmentation

SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation

Balanced Spherical Grid for Egocentric View Synthesis

OvarNet: Towards Open-Vocabulary Object Attribute Recognition

AutoAD: Movie Description in Context

Visual Recognition by Request

Wavelet Diffusion Models Are Fast and Scalable Image Generators

HOLODIFFUSION: Training a 3D Diffusion Model Using 2D Images

TopDiG: Class-Agnostic Topological Directional Graph Extraction From Remote Sensing Images

Progressive Backdoor Erasing via Connecting Backdoor and Adversarial Attacks

Normalizing Flow Based Feature Synthesis for Outlier-Aware Object Detection

Side Adapter Network for Open-Vocabulary Semantic Segmentation

TOPLight: Lightweight Neural Networks With Task-Oriented Pretraining for Visible-Infrared Recognition

PIP-Net: Patch-Based Intuitive Prototypes for Interpretable Image Classification

Blemish-Aware and Progressive Face Retouching With Limited Paired Data

Self-Guided Diffusion Models

Leveraging Temporal Context in Low Representational Power Regimes

Multi-Granularity Archaeological Dating of Chinese Bronze Dings Based on a Knowledge-Guided Relation Graph

Depth Estimation From Indoor Panoramas With Neural Scene Representation

Progressive Spatio-Temporal Alignment for Efficient Event-Based Motion Estimation

Learning Expressive Prompting With Residuals for Vision Transformers

Sharpness-Aware Gradient Matching for Domain Generalization

UV Volumes for Real-Time Rendering of Editable Free-View Human Performance

Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields

Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network

BEV-LaneDet: An Efficient 3D Lane Detection Based on Virtual Camera via Key-Points

Text-Visual Prompting for Efficient 2D Temporal Video Grounding

NIKI: Neural Inverse Kinematics With Invertible Neural Networks for 3D Human Pose and Shape Estimation

Learning Transferable Spatiotemporal Representations From Natural Script Knowledge

Diffusion-Based Signed Distance Fields for 3D Shape Generation

HDR Imaging With Spatially Varying Signal-to-Noise Ratios

ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders

Audio-Visual Grouping Network for Sound Localization From Mixtures

Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture

Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations

Structured Kernel Estimation for Photon-Limited Deconvolution

Hard Patches Mining for Masked Image Modeling

Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning

Decentralized Learning With Multi-Headed Distillation

Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction

Learning Transformation-Predictive Representations for Detection and Description of Local Features

Graph Representation for Order-Aware Visual Transformation

MoDi: Unconditional Motion Synthesis From Diverse Data

PSVT: End-to-End Multi-Person 3D Pose and Shape Estimation With Progressive Video Transformers

Style Projected Clustering for Domain Generalized Semantic Segmentation

Learning Steerable Function for Efficient Image Resampling

Enhanced Multimodal Representation Learning With Cross-Modal KD

Improving Graph Representation for Point Cloud Segmentation via Attentive Filtering

BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation

Zero-Shot Dual-Lens Super-Resolution

Toward RAW Object Detection: A New Benchmark and a New Model

MAGVIT: Masked Generative Video Transformer

Continuous Landmark Detection With 3D Queries

ToThePoint: Efficient Contrastive Learning of 3D Point Clouds via Recycling

FashionSAP: Symbols and Attributes Prompt for Fine-Grained Fashion Vision-Language Pre-Training

Neural Voting Field for Camera-Space 3D Hand Pose Estimation

On Calibrating Semantic Segmentation Models: Analyses and an Algorithm

Multimodal Prompting With Missing Modalities for Visual Recognition

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

MetaPortrait: Identity-Preserving Talking Head Generation With Fast Personalized Adaptation

Perception and Semantic Aware Regularization for Sequential Confidence Calibration

Trainable Projected Gradient Method for Robust Fine-Tuning

HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

Masked Images Are Counterfactual Samples for Robust Fine-Tuning

SparseFusion: Distilling View-Conditioned Diffusion for 3D Reconstruction

One-Shot Model for Mixed-Precision Quantization

The Resource Problem of Using Linear Layer Leakage Attack in Federated Learning

OCTET: Object-Aware Counterfactual Explanations

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

Shakes on a Plane: Unsupervised Depth Estimation From Unstabilized Photography

Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-Shot Learning With Hyperspherical Embeddings

On the Effects of Self-Supervision and Contrastive Alignment in Deep Multi-View Clustering

DeepVecFont-v2: Exploiting Transformers To Synthesize Vector Fonts With Higher Quality

Blowing in the Wind: CycleNet for Human Cinemagraphs From Still Images

Reconstructing Signing Avatars From Video Using Linguistic Priors

Four-View Geometry With Unknown Radial Distortion

Both Style and Distortion Matter: Dual-Path Unsupervised Domain Adaptation for Panoramic Semantic Segmentation

PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation

DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection

Fuzzy Positive Learning for Semi-Supervised Semantic Segmentation

Out-of-Distributed Semantic Pruning for Robust Semi-Supervised Learning

Domain Generalized Stereo Matching via Hierarchical Visual Transformation

Quality-Aware Pre-Trained Models for Blind Image Quality Assessment

Fine-Grained Audible Video Description

Modeling the Distributional Uncertainty for Salient Object Detection Models

Masked Representation Learning for Domain Generalized Stereo Matching

Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling

Decoupling MaxLogit for Out-of-Distribution Detection

Federated Learning With Data-Agnostic Distribution Fusion

OVTrack: Open-Vocabulary Multiple Object Tracking

CaPriDe Learning: Confidential and Private Decentralized Learning Based on Encryption-Friendly Distillation Loss

StyLess: Boosting the Transferability of Adversarial Examples

HOICLIP: Efficient Knowledge Transfer for HOI Detection With Vision-Language Models

HandsOff: Labeled Dataset Generation With No Additional Human Annotations

Feature Shrinkage Pyramid for Camouflaged Object Detection With Transformers

Improving Visual Representation Learning Through Perceptual Understanding

Automatic High Resolution Wire Segmentation and Removal

PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing

Visual Atoms: Pre-Training Vision Transformers With Sinusoidal Waves

Zero-Shot Generative Model Adaptation via Image-Specific Prompt Learning

CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

Enhancing Deformable Local Features by Jointly Learning To Detect and Describe Keypoints

PRISE: Demystifying Deep Lucas-Kanade With Strongly Star-Convex Constraints for Multimodel Image Alignment

G-MSM: Unsupervised Multi-Shape Matching With Graph-Based Affinity Priors

Power Bundle Adjustment for Large-Scale 3D Reconstruction

Behind the Scenes: Density Fields for Single View Reconstruction

Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments

Relational Space-Time Query in Long-Form Videos

Semidefinite Relaxations for Robust Multiview Triangulation

Adjustment and Alignment for Unbiased Open Set Domain Adaptation

Learning Federated Visual Prompt in Null Space for MRI Reconstruction

Domain Expansion of Image Generators

NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models

Backdoor Defense via Deconfounded Representation Learning

Complementary Intrinsics From Neural Radiance Fields and CNNs for Outdoor Scene Relighting

HumanGen: Generating Human Radiance Fields With Explicit Priors

NeRDi: Single-View NeRF Synthesis With Language-Guided Diffusion As General Image Priors

SPARF: Neural Radiance Fields From Sparse and Noisy Poses

Devil’s on the Edges: Selective Quad Attention for Scene Graph Generation

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models

REC-MV: REconstructing 3D Dynamic Cloth From Monocular Videos

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency

MVImgNet: A Large-Scale Dataset of Multi-View Images

UniSim: A Neural Closed-Loop Sensor Simulator

SFD2: Semantic-Guided Feature Detection and Description

Towards Effective Visual Representations for Partial-Label Learning

ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects

HARP: Personalized Hand Reconstruction From a Monocular RGB Video

Making Vision Transformers Efficient From a Token Sparsification View

MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering

Position-Guided Text Prompt for Vision-Language Pre-Training

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning With Multimodal Models

Polarized Color Image Denoising

Multi Domain Learning for Motion Magnification

SeaThru-NeRF: Neural Radiance Fields in Scattering Media

DIFu: Depth-Guided Implicit Function for Clothed Human Reconstruction

Panoptic Lifting for 3D Scene Understanding With Neural Fields

DualVector: Unsupervised Vector Font Synthesis With Dual-Part Representation

SVGformer: Representation Learning for Continuous Vector Graphics Using Transformers

GeoVLN: Learning Geometry-Enhanced Visual Representation With Slot Attention for Vision-and-Language Navigation

MAIR: Multi-View Attention Inverse Rendering With 3D Spatially-Varying Lighting Estimation

Learning 3D Scene Priors With 2D Supervision

ProTéGé: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding

Video Compression With Entropy-Constrained Neural Representations

Learning From Unique Perspectives: User-Aware Saliency Modeling

Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders

Starting From Non-Parametric Networks for 3D Point Cloud Analysis

NeRFLix: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-Viewpoint MiXer

Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields

TriVol: Point Cloud Rendering via Triple Volumes

DR2: Diffusion-Based Robust Degradation Remover for Blind Face Restoration

ABLE-NeRF: Attention-Based Rendering With Learnable Embeddings for Neural Radiance Field

Exploring Structured Semantic Prior for Multi Label Recognition With Incomplete Labels

LG-BPN: Local and Global Blind-Patch Network for Self-Supervised Real-World Denoising

CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP

FlexiViT: One Model for All Patch Sizes

CLIPPO: Image-and-Language Understanding From Pixels Only

DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling

BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration

DivClust: Controlling Diversity in Deep Clustering

On Data Scaling in Masked Image Modeling

Masked Image Training for Generalizable Deep Image Denoising

ScaleKD: Distilling Scale-Aware Knowledge in Small Object Detector

Shape-Aware Text-Driven Layered Video Editing

Generalizable Implicit Neural Representations via Instance Pattern Composers

Behavioral Analysis of Vision-and-Language Navigation Agents

HierVL: Learning Hierarchical Video-Language Embeddings

Learning Geometry-Aware Representations by Sketching

Pruning Parameterization With Bi-Level Optimization for Efficient Semantic Segmentation on the Edge

Unsupervised Deep Probabilistic Approach for Partial Point Cloud Registration

StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping

Federated Domain Generalization With Generalization Adjustment

STMixer: A One-Stage Sparse Action Detector

Learning Discriminative Representations for Skeleton Based Action Recognition

On the Effectiveness of Partial Variance Reduction in Federated Learning With Heterogeneous Data

Seeing With Sound: Long-range Acoustic Beamforming for Multimodal Scene Understanding

Unicode Analogies: An Anti-Objectivist Visual Reasoning Challenge

Slide-Transformer: Hierarchical Vision Transformer With Local Self-Attention

Dual-Bridging With Adversarial Noise Generation for Domain Adaptive rPPG Estimation

L-CoIns: Language-Based Colorization With Instance Awareness

Diversity-Aware Meta Visual Prompting

Tunable Convolutions With Parametric Multi-Loss Optimization

Efficient View Synthesis and 3D-Based Multi-Frame Denoising With Multiplane Feature Representations

An Image Quality Assessment Dataset for Portraits

FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding

High-Fidelity Generalized Emotional Talking Face Generation With Multi-Modal Emotion Space Learning

MonoATT: Online Monocular 3D Object Detection With Adaptive Token Transformer

Learning Visibility Field for Detailed 3D Human Reconstruction and Relighting

Complexity-Guided Slimmable Decoder for Efficient Deep Video Compression

Glocal Energy-Based Learning for Few-Shot Open-Set Recognition

MACARONS: Mapping and Coverage Anticipation With RGB Online Self-Supervision

Conjugate Product Graphs for Globally Optimal 2D-3D Shape Matching

Real-Time Multi-Person Eyeblink Detection in the Wild for Untrimmed Video

RUST: Latent Neural Scene Representations From Unposed Imagery

Humans As Light Bulbs: 3D Human Reconstruction From Thermal Reflection

What You Can Reconstruct From a Shadow

Uncertainty-Aware Vision-Based Metric Cross-View Geolocalization

Stare at What You See: Masked Image Modeling Without Reconstruction

Network-Free, Unsupervised Semantic Segmentation With Synthetic Images

Q-DETR: An Efficient Low-Bit Quantized Detection Transformer

Progressively Optimized Local Radiance Fields for Robust View Synthesis

Hierarchical Neural Memory Network for Low Latency Event Processing

Attention-Based Point Cloud Edge Sampling

Initialization Noise in Image Gradients and Saliency Maps

A Light Touch Approach to Teaching Transformers Multi-View Geometry

Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm

DynamicStereo: Consistent Dynamic Depth From Stereo Videos

RealFusion: 360° Reconstruction of Any Object From a Single Image

PC2: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction

Learning Conditional Attributes for Compositional Zero-Shot Learning

Masked Autoencoders Enable Efficient Knowledge Distillers

DINN360: Deformable Invertible Neural Network for Latitude-Aware 360° Image Rescaling

One-Stage 3D Whole-Body Mesh Recovery With Component Aware Transformer

Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes

Optimization-Inspired Cross-Attention Transformer for Compressive Sensing

Understanding Imbalanced Semantic Segmentation Through Neural Collapse

Hierarchical Dense Correlation Distillation for Few-Shot Segmentation

Transformer-Based Learned Optimization

NeuFace: Realistic 3D Neural Face Rendering From Multi-View Images

Residual Degradation Learning Unfolding Framework With Mixing Priors Across Spectral and Spatial for Compressive Spectral Imaging

Revisiting Rolling Shutter Bundle Adjustment: Toward Accurate and Fast Solution

SMPConv: Self-Moving Point Representations for Continuous Convolution

Diffusion-Based Generation, Optimization, and Planning in 3D Scenes

LayoutDM: Transformer-Based Diffusion Model for Layout Generation

Decoupling-and-Aggregating for Image Exposure Correction

JAWS: Just a Wild Shot for Cinematic Transfer in Neural Radiance Fields

SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries

Learning To Measure the Point Cloud Reconstruction Loss in a Representation Space

Learning Neural Proto-Face Field for Disentangled 3D Face Modeling in the Wild

Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations

ACL-SPC: Adaptive Closed-Loop System for Self-Supervised Point Cloud Completion

FEND: A Future Enhanced Distribution-Aware Contrastive Learning Framework for Long-Tail Trajectory Prediction

Annealing-Based Label-Transfer Learning for Open World Object Detection

Instance-Aware Domain Generalization for Face Anti-Spoofing

Geometric Visual Similarity Learning in 3D Medical Image Self-Supervised Pre-Training

Unsupervised Deep Asymmetric Stereo Matching With Spatially-Adaptive Self-Similarity

No One Left Behind: Improving the Worst Categories in Long-Tailed Learning

FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer

Sample-Level Multi-View Graph Clustering

Unlearnable Clusters: Towards Label-Agnostic Unlearnable Examples

Multi-Label Compound Expression Recognition: C-EXPR Database & Network

Multi-Concept Customization of Text-to-Image Diffusion

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network

Parameter Efficient Local Implicit Image Function Network for Face Segmentation

Revisiting Reverse Distillation for Anomaly Detection

Perception-Oriented Single Image Super-Resolution Using Optimal Objective Estimation

VGFlow: Visibility Guided Flow Network for Human Reposing

Turning Strengths Into Weaknesses: A Certified Robustness Inspired Attack Framework Against Graph Neural Networks

Center Focusing Network for Real-Time LiDAR Panoptic Segmentation

Harmonious Teacher for Cross-Domain Object Detection

SelfME: Self-Supervised Motion Learning for Micro-Expression Recognition

Mask-Guided Matting in the Wild

Self-Positioning Point-Based Transformer for Point Cloud Understanding

Few-Shot Geometry-Aware Keypoint Localization

Instant Multi-View Head Capture Through Learnable Registration

Trade-Off Between Robustness and Accuracy of Vision Transformers

A Loopback Network for Explainable Microvascular Invasion Classification

Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

Image Quality-Aware Diagnosis via Meta-Knowledge Co-Embedding

Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers

Search-Map-Search: A Frame Selection Paradigm for Action Recognition

DISC: Learning From Noisy Labels via Dynamic Instance-Specific Selection and Correction

Renderable Neural Radiance Map for Visual Navigation

Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation

Learning To Generate Image Embeddings With User-Level Differential Privacy

Persistent Nature: A Generative Model of Unbounded 3D Worlds

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-per-Second

Deep Semi-Supervised Metric Learning With Mixed Label Propagation

Unbiased Scene Graph Generation in Videos

Neural Rate Estimator and Unsupervised Learning for Efficient Distributed Image Analytics in Split-DNN Models

RealImpact: A Dataset of Impact Sound Fields for Real Objects

RMLVQA: A Margin Loss Approach for Visual Question Answering With Language Biases

Lookahead Diffusion Probabilistic Models for Refining Mean Estimation

Learning Articulated Shape With Keypoint Pseudo-Labels From Web Images

Modular Memorability: Tiered Representations for Video Memorability Prediction

Shifted Diffusion for Text-to-Image Generation

CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation

Attribute-Preserving Face Dataset Anonymization via Latent Code Optimization

MetaViewer: Towards a Unified Multi-View Representation

Sequential Training of GANs Against GAN-Classifiers Reveals Correlated “Knowledge Gaps” Present Among Independently Trained GAN Instances

Improving Robustness of Semantic Segmentation to Motion-Blur Using Class-Centric Augmentation

Instance-Specific and Model-Adaptive Supervision for Semi-Supervised Semantic Segmentation

MotionDiffuser: Controllable Multi-Agent Motion Prediction Using Diffusion

Train-Once-for-All Personalization

Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts

You Do Not Need Additional Priors or Regularizers in Retinex-Based Low-Light Image Enhancement

Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models

Semantic-Conditional Diffusion Networks for Image Captioning

Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM

CAP: Robust Point Cloud Classification via Semantic and Structural Modeling

Jedi: Entropy-Based Localization and Removal of Adversarial Patches

Generating Features With Increased Crop-Related Diversity for Few-Shot Object Detection

iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Coreset Sampling From Open-Set for Fine-Grained Self-Supervised Learning

Adaptive Data-Free Quantization

High-Frequency Stereo Matching Network

Masked Autoencoding Does Not Help Natural Language Supervision at Scale

Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models

Principles of Forgetting in Domain-Incremental Semantic Segmentation in Adverse Weather Conditions

Two-Way Multi-Label Loss

Transferable Adversarial Attacks on Vision Transformers With Token Gradient Regularization

Robust 3D Shape Classification via Non-Local Graph Attention Network

Single View Scene Scale Estimation Using Scale Field

Solving Relaxations of MAP-MRF Problems: Combinatorial In-Face Frank-Wolfe Directions

AUNet: Learning Relations Between Action Units for Face Forgery Detection

Learning a 3D Morphable Face Reflectance Model From Low-Cost Data

Frame-Event Alignment and Fusion Network for High Frame Rate Tracking

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data

Learning To Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space

Introducing Competition To Boost the Transferability of Targeted Adversarial Examples Through Clean Feature Mixup

DINER: Disorder-Invariant Implicit Neural Representation

DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium

Manipulating Transfer Learning for Property Inference

Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks

MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset

Leveraging Inter-Rater Agreement for Classification in the Presence of Noisy Labels

Logical Implications for Visual Question Answering Consistency

Independent Component Alignment for Multi-Task Learning

Open Vocabulary Semantic Segmentation With Patch Aligned Contrastive Learning

MMG-Ego4D: Multimodal Generalization in Egocentric Action Recognition

Deep Deterministic Uncertainty: A New Simple Baseline

SViTT: Temporal Learning of Sparse Video-Text Transformers

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model With Knowledge-Enhanced Mixture-of-Denoising-Experts

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval

Open-Set Representation Learning Through Combinatorial Embedding

DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection

HypLiLoc: Towards Effective LiDAR Pose Regression With Hyperbolic Fusion

Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation

Local-Guided Global: Paired Similarity Representation for Visual Reinforcement Learning

Boosting Accuracy and Robustness of Student Models via Adaptive Adversarial Distillation

Hyperspherical Embedding for Point Cloud Completion

Efficient Hierarchical Entropy Model for Learned Point Cloud Compression

Improving the Transferability of Adversarial Samples by Path-Augmented Method

SIEDOB: Semantic Image Editing by Disentangling Object and Background

GradMA: A Gradient-Memory-Based Accelerated Federated Learning With Alleviated Catastrophic Forgetting

Few-Shot Class-Incremental Learning via Class-Aware Bilateral Distillation

Neural Lens Modeling

A Probabilistic Framework for Lifelong Test-Time Adaptation

ZBS: Zero-Shot Background Subtraction via Instance-Level Background Modeling and Foreground Selection

DeAR: Debiasing Vision-Language Models With Additive Residuals

Deep Depth Estimation From Thermal Image

3D GAN Inversion With Facial Symmetry Prior

You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?

Generalization Matters: Loss Minima Flattening via Parameter Hybridization for Efficient Online Knowledge Distillation

BiasAdv: Bias-Adversarial Augmentation for Model Debiasing

PartMix: Regularization Strategy To Learn Part Discovery for Visible-Infrared Person Re-Identification

DART: Diversify-Aggregate-Repeat Training Improves Generalization of Neural Networks

Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting

Towards Practical Plug-and-Play Diffusion Models

PMR: Prototypical Modal Rebalance for Multimodal Learning

Decoupling Learning and Remembering: A Bilevel Memory Framework With Knowledge Projection for Task-Incremental Learning

Exemplar-FreeSOLO: Enhancing Unsupervised Instance Segmentation With Exemplars

Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning

PEFAT: Boosting Semi-Supervised Medical Image Classification via Pseudo-Loss Estimation and Feature Adversarial Training

Privacy-Preserving Adversarial Facial Features

MAGVLT: Masked Generative Vision-and-Language Transformer

Deep Random Projector: Accelerated Deep Image Prior

BEV-Guided Multi-Modality Fusion for Driving Perception

Dealing With Cross-Task Class Discrimination in Online Continual Learning

Tree Instance Segmentation With Temporal Contour Graph

Rethinking Few-Shot Medical Segmentation: A Vector Quantization View

NeFII: Inverse Rendering for Reflectance Decomposition With Near-Field Indirect Illumination

Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification

SlowLiDAR: Increasing the Latency of LiDAR-Based Detection Using Adversarial Examples

Unsupervised Intrinsic Image Decomposition With LiDAR Intensity

RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation With Natural Prompts

Single Image Backdoor Inversion via Robust Smoothed Classifiers

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

Train/Test-Time Adaptation With Retrieval

Hierarchical Fine-Grained Image Forgery Detection and Localization

MammalNet: A Large-Scale Video Benchmark for Mammal Recognition and Behavior Understanding

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval

Contrastive Mean Teacher for Domain Adaptive Object Detectors

TMO: Textured Mesh Acquisition of Objects With a Mobile Device by Using Differentiable Rendering

InstantAvatar: Learning Avatars From Monocular Video in 60 Seconds

Neural Volumetric Memory for Visual Locomotion Control

Efficient On-Device Training via Gradient Filtering

SmartBrush: Text and Shape Guided Object Inpainting With Diffusion Model

NIFF: Alleviating Forgetting in Generalized Few-Shot Object Detection via Neural Instance Feature Forging

Unpaired Image-to-Image Translation With Shortest Path Regularization

NeRFLight: Fast and Light Neural Radiance Fields Using a Shared Feature Grid

PEAL: Prior-Embedded Explicit Attention Learning for Low-Overlap Point Cloud Registration

Transforming Radiance Field With Lipschitz Network for Photorealistic 3D Scene Stylization

Test Time Adaptation With Regularized Loss for Weakly Supervised Salient Object Detection

Fantastic Breaks: A Dataset of Paired 3D Scans of Real-World Broken Objects and Their Complete Counterparts

InstructPix2Pix: Learning To Follow Image Editing Instructions

Cross-Domain 3D Hand Pose Estimation With Dual Modalities

Source-Free Video Domain Adaptation With Spatial-Temporal-Historical Consistency Learning

PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers

SmallCap: Lightweight Image Captioning Prompted With Retrieval Augmentation

Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision

Learning To Render Novel Views From Wide-Baseline Stereo Pairs

Neural Texture Synthesis With Guided Correspondence

AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers

Robust Test-Time Adaptation in Dynamic Scenarios

AnchorFormer: Point Cloud Completion From Discriminative Nodes

Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures

Transformer Scale Gate for Semantic Segmentation

AccelIR: Task-Aware Image Compression for Accelerating Neural Restoration

A Unified Spatial-Angular Structured Light for Single-View Acquisition of Shape and Reflectance

SCOTCH and SODA: A Transformer Video Shadow Detection Framework

MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation

Neuralizer: General Neuroimage Analysis Without Re-Training

MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins

Detecting Human-Object Contact in Images

Efficient Verification of Neural Networks Against LVM-Based Specifications

Recurrent Vision Transformers for Object Detection With Event Cameras

SLACK: Stable Learning of Augmentations With Cold-Start and KL Regularization

SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy

Diversity-Measurable Anomaly Detection

Visual Localization Using Imperfect 3D Models From the Internet

LANA: A Language-Capable Navigator for Instruction Following and Generation

MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition

HOTNAS: Hierarchical Optimal Transport for Neural Architecture Search

Co-Training 2L Submodels for Visual Recognition

Learning Rotation-Equivariant Features for Visual Correspondence

CFA: Class-Wise Calibrated Fair Adversarial Training

VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models

Hint-Aug: Drawing Hints From Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning

Fine-Grained Classification With Noisy Labels

Learning a Practical SDR-to-HDRTV Up-Conversion Using New Dataset and Degradation Models

BBDM: Image-to-Image Translation With Brownian Bridge Diffusion Models

Regularize Implicit Neural Representation by Itself

Multivariate, Multi-Frequency and Multimodal: Rethinking Graph Neural Networks for Emotion Recognition in Conversation

Elastic Aggregation for Federated Optimization

Learning a Deep Color Difference Metric for Photographic Images

Learning Debiased Representations via Conditional Attribute Interpolation

Cooperation or Competition: Avoiding Player Domination for Multi-Target Robustness via Adaptive Budgets

Re-GAN: Data-Efficient GANs Training via Architectural Reconfiguration

Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective

CHMATCH: Contrastive Hierarchical Matching and Robust Adaptive Threshold Boosted Semi-Supervised Learning

Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards

Deformable Mesh Transformer for 3D Human Mesh Recovery

WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding

Bit-Shrinking: Limiting Instantaneous Sharpness for Improving Post-Training Quantization

NÜWA-LIP: Language-Guided Image Inpainting With Defect-Free VQGAN

A Practical Upper Bound for the Worst-Case Attribution Deviations

Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

Robust Generalization Against Photon-Limited Corruptions via Worst-Case Sharpness Minimization

Imitation Learning As State Matching via Differentiable Physics

Improving Generalization With Domain Convex Game

Learning To Generate Text-Grounded Mask for Open-World Semantic Segmentation From Only Image-Text Pairs

CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language

Enhancing Multiple Reliability Measures via Nuisance-Extended Information Bottleneck

Where We Are and What We’re Looking At: Query Based Worldwide Image Geo-Localization Using Hierarchies and Scenes

Boosting Transductive Few-Shot Fine-Tuning With Margin-Based Uncertainty Weighting and Probability Regularization

How To Prevent the Poor Performance Clients for Personalized Federated Learning?

Generalist: Decoupling Natural and Robust Generalization

Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers

FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning

From Images to Textual Prompts: Zero-Shot Visual Question Answering With Frozen Large Language Models

Architectural Backdoors in Neural Networks

CUDA: Convolution-Based Unlearnable Datasets

Simulated Annealing in Early Layers Leads to Better Generalization

Critical Learning Periods for Multisensory Integration in Deep Networks

Boosting Low-Data Instance Segmentation by Unsupervised Pre-Training With Saliency Prompt

Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition

Foundation Model Drives Weakly Incremental Learning for Semantic Segmentation

Learning Neural Volumetric Representations of Dynamic Humans in Minutes

Frame Interpolation Transformer and Uncertainty Guidance

Interventional Bag Multi-Instance Learning on Whole-Slide Pathological Images

Enhanced Stable View Synthesis

Video Event Restoration Based on Keyframes for Video Anomaly Detection

Towards Transferable Targeted Adversarial Examples

Leverage Interactive Affinity for Affordance Learning

Interactive and Explainable Region-Guided Radiology Report Generation

PHA: Patch-Wise High-Frequency Augmentation for Transformer-Based Person Re-Identification

Towards Unbiased Volume Rendering of Neural Implicit Surfaces With Geometry Priors

MixNeRF: Modeling a Ray With Mixture Density for Novel View Synthesis From Sparse Inputs

StyleGene: Crossover and Mutation of Region-Level Facial Genes for Kinship Face Synthesis

Fair Scratch Tickets: Finding Fair Sparse Networks Without Weight Training

PermutoSDF: Fast Multi-View Reconstruction With Implicit Surfaces Using Permutohedral Lattices

PCR: Proxy-Based Contrastive Replay for Online Class-Incremental Continual Learning

Boundary-Aware Backward-Compatible Representation via Adversarial Learning in Image Retrieval

PointClustering: Unsupervised Point Cloud Pre-Training Using Transformation Invariance in Clustering

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

Deep Polarization Reconstruction With PDAVIS Events

NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

CP3: Channel Pruning Plug-In for Point-Based Networks

ERM-KTP: Knowledge-Level Machine Unlearning via Knowledge Transfer

Few-Shot Semantic Image Synthesis With Class Affinity Transfer

Differentiable Architecture Search With Random Features

GKEAL: Gaussian Kernel Embedded Analytic Learning for Few-Shot Class Incremental Task

Extracting Class Activation Maps From Non-Discriminative Features As Well

A Simple Framework for Text-Supervised Semantic Segmentation

Region-Aware Pretraining for Open-Vocabulary Object Detection With Vision Transformers

Can’t Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders

Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition

sRGB Real Noise Synthesizing With Neighboring Correlation-Aware Noise Model

Streaming Video Model

Orthogonal Annotation Benefits Barely-Supervised Medical Image Segmentation

PanoSwin: A Pano-Style Swin Transformer for Panorama Understanding

All Are Worth Words: A ViT Backbone for Diffusion Models

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition With Variational Alignment

Language Adaptive Weight Generation for Multi-Task Visual Grounding

VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking

GeoLayoutLM: Geometric Pre-Training for Visual Information Extraction

Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models

Modeling Entities As Semantic Points for Visual Information Extraction in the Wild

Single Image Depth Prediction Made Better: A Multivariate Gaussian Take

DaFKD: Domain-Aware Federated Knowledge Distillation

Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild

Revisiting the P3P Problem

Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification

DropKey for Vision Transformer

BiCro: Noisy Correspondence Rectification for Multi-Modality Data via Bi-Directional Cross-Modal Similarity Consistency

DA Wand: Distortion-Aware Selection Using Neural Mesh Parameterization

Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding

Unsupervised Visible-Infrared Person Re-Identification via Progressive Graph Matching and Alternate Learning

GCFAgg: Global and Cross-View Feature Aggregation for Multi-View Clustering

An Actor-Centric Causality Graph for Asynchronous Temporal Inference in Group Activity

À-La-Carte Prompt Tuning (APT): Combining Distinct Data via Composable Prompting

Equiangular Basis Vectors

Q: How To Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

Hybrid Active Learning via Deep Clustering for Video Action Detection

Constrained Evolutionary Diffusion Filter for Monocular Endoscope Tracking

MMVC: Learned Multi-Mode Video Compression With Block-Based Prediction Mode Selection and Density-Adaptive Entropy Coding

Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

Adaptive Graph Convolutional Subspace Clustering

LightedDepth: Video Depth Estimation in Light of Limited Inference View Angles

Two-Stream Networks for Weakly-Supervised Temporal Action Localization With Semantic-Aware Mechanisms

OpenMix: Exploring Outlier Samples for Misclassification Detection

DyLiN: Making Light Field Networks Dynamic

ProphNet: Efficient Agent-Centric Motion Forecasting With Anchor-Informed Proposals

Meta Architecture for Point Cloud Analysis

Generating Anomalies for Video Anomaly Detection With Prompt-Based Feature Mapping

RWSC-Fusion: Region-Wise Style-Controlled Fusion Network for the Prohibited X-Ray Security Image Synthesis

Robust Outlier Rejection for 3D Registration With Variational Bayes

Architecture, Dataset and Model-Scale Agnostic Data-Free Meta-Learning

BKinD-3D: Self-Supervised 3D Keypoint Discovery From Multi-View Videos

Dual-Path Adaptation From Image to Video Transformers

RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training

Meta-Learning With a Geometry-Adaptive Preconditioner

Passive Micron-Scale Time-of-Flight With Sunlight Interferometry

Swept-Angle Synthetic Wavelength Interferometry

Indescribable Multi-Modal Spatial Evaluator

Abstract Visual Reasoning: An Algebraic Approach for Solving Raven’s Progressive Matrices

Decoupling Human and Camera Motion From Videos in the Wild

Unifying Vision, Text, and Layout for Universal Document Processing

Flow Supervision for Deformable NeRF

Learning From Noisy Labels With Decoupled Meta Label Purifier

Multi-Sensor Large-Scale Dataset for Multi-View 3D Reconstruction

OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis

Latency Matters: Real-Time Action Forecasting Transformer

ViTs for SITS: Vision Transformers for Satellite Image Time Series

Learning Joint Latent Space EBM Prior Model for Multi-Layer Generator

Efficient Map Sparsification Based on 2D and 3D Discretized Grids

LVQAC: Lattice Vector Quantization Coupled With Spatially Adaptive Companding for Efficient Learned Image Compression

Pose-Disentangled Contrastive Learning for Self-Supervised Facial Representation

Probabilistic Knowledge Distillation of Face Ensembles

Multi-Modal Gait Recognition via Effective Spatial-Temporal Feature Fusion

DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning

Kernel Aware Resampler

Document Image Shadow Removal Guided by Color-Aware Background

Temporal Consistent 3D LiDAR Representation Learning for Semantic Perception in Autonomous Driving

EMT-NAS:Transferring Architectural Knowledge Between Tasks From Different Datasets

CompletionFormer: Depth Completion With Convolutions and Vision Transformers

Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering

Re-Thinking Federated Active Learning Based on Inter-Class Diversity

Physical-World Optical Adversarial Attacks on 3D Face Recognition

DATE: Domain Adaptive Product Seeker for E-Commerce

Trap Attention: Monocular Depth Estimation With Manual Traps

Integral Neural Networks

Full or Weak Annotations? An Adaptive Strategy for Budget-Constrained Annotation Campaigns

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Switchable Representation Learning Framework With Self-Compatibility

Neural Fourier Filter Bank

Exploring Data Geometry for Continual Learning

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation

Learning Neural Duplex Radiance Fields for Real-Time View Synthesis

FlowGrad: Controlling the Output of Generative ODEs With Gradients

PointVector: A Vector Representation in Point Cloud Analysis

Data-Driven Feature Tracking for Event Cameras

Deep Curvilinear Editing: Commutative and Nonlinear Image Manipulation for Pretrained Deep Generative Model

ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-Real Novel View Synthesis via Contrastive Learning

Multi-Agent Automated Machine Learning

Inversion-Based Style Transfer With Diffusion Models

Computational Flash Photography Through Intrinsics

Range-Nullspace Video Frame Interpolation With Focalized Motion Estimation

Robust and Scalable Gaussian Process Regression and Its Applications

OmniVidar: Omnidirectional Depth Estimation From Multi-Fisheye Images

Semi-Weakly Supervised Object Kinematic Motion Prediction

VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution

Task-Specific Fine-Tuning via Variational Information Bottleneck for Weakly-Supervised Pathology Whole Slide Image Classification

DynamicDet: A Unified Dynamic Architecture for Object Detection

Toward Stable, Interpretable, and Lightweight Hyperspectral Super-Resolution

Optimal Proposal Learning for Deployable End-to-End Pedestrian Detection

IDGI: A Framework To Eliminate Explanation Noise From Integrated Gradients

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

Fast Monocular Scene Reconstruction With Global-Sparse Local-Dense Grids

Bi-Level Meta-Learning for Few-Shot Domain Generalization

Class-Balancing Diffusion Models

Difficulty-Based Sampling for Debiased Contrastive Representation Learning

The Treasure Beneath Multiple Annotations: An Uncertainty-Aware Edge Detector

Towards Trustable Skin Cancer Diagnosis via Rewriting Model’s Decision

DCFace: Synthetic Face Generation With Dual Condition Diffusion Model

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Frame Flexible Network

Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention

Sparsely Annotated Semantic Segmentation With Adaptive Gaussian Mixtures

Diverse 3D Hand Gesture Prediction From Body Dynamics by Bilateral Hand Disentanglement

STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection

Spatial-Frequency Mutual Learning for Face Super-Resolution

Inverse Rendering of Translucent Objects Using Physical and Neural Renderers

Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection

MOT: Masked Optimal Transport for Partial Domain Adaptation

Spatio-Temporal Pixel-Level Contrastive Learning-Based Source-Free Domain Adaptation for Video Semantic Segmentation

Rethinking Federated Learning With Domain Shift: A Prototype View

Learning Attribute and Class-Specific Representation Duet for Fine-Grained Fashion Analysis

NaQ: Leveraging Narrations As Queries To Supervise Episodic Memory

Ensemble-Based Blackbox Attacks on Dense Prediction

Implicit Neural Head Synthesis via Controllable Local Deformation Fields

Realistic Saliency Guided Image Enhancement

CIRCLE: Capture in Rich Contextual Environments

Reducing the Label Bias for Timestamp Supervised Temporal Action Segmentation

Modality-Agnostic Debiasing for Single Domain Generalization

Learning Action Changes by Measuring Verb-Adverb Textual Relationships

DeGPR: Deep Guided Posterior Regularization for Multi-Class Cell Detection and Counting

Seeing What You Miss: Vision-Language Pre-Training With Semantic Completion Learning

On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks

Learning To Name Classes for Vision and Language Models

NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers

Comprehensive and Delicate: An Efficient Transformer for Image Restoration

MoStGAN-V: Video Generation With Temporal Motion Styles

Weakly Supervised Class-Agnostic Motion Prediction for Autonomous Driving

Avatars Grow Legs: Generating Smooth Human Motion From Sparse Tracking Inputs With Diffusion Model

Referring Image Matting

Rethinking Image Super Resolution From Long-Tailed Distribution Learning Perspective

DiffusioNeRF: Regularizing Neural Radiance Fields With Denoising Diffusion Models

Self-Supervised Super-Plane for Neural 3D Reconstruction

Implicit Surface Contrastive Clustering for LiDAR Point Clouds

BlendFields: Few-Shot Example-Driven Facial Modeling

Fast Point Cloud Generation With Straight Flows

Leveraging Hidden Positives for Unsupervised Semantic Segmentation

BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects

PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation

Test of Time: Instilling Video-Language Models With a Sense of Time

Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class

vMAP: Vectorised Object Mapping for Neural Field SLAM

POEM: Reconstructing Hand in a Point Embedded Multi-View Stereo

NeRF-Supervised Deep Stereo

High-Fidelity 3D Face Generation From Natural Language Descriptions

Spatial-Then-Temporal Self-Supervised Learning for Video Correspondence

Adaptive Plasticity Improvement for Continual Learning

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

CIGAR: Cross-Modality Graph Reasoning for Domain Adaptive Object Detection

MHPL: Minimum Happy Points Learning for Active Source Free Domain Adaptation

Defining and Quantifying the Emergence of Sparse Concepts in DNNs

LiDAR-in-the-Loop Hyperparameter Optimization

Revisiting Rotation Averaging: Uncertainties and Robust Losses

Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners

A Probabilistic Attention Model With Occlusion-Aware Texture Regression for 3D Hand Reconstruction From a Single RGB Image

Accelerated Coordinate Encoding: Learning to Relocalize in Minutes Using RGB and Poses

Label Information Bottleneck for Label Enhancement

MOSO: Decomposing MOtion, Scene and Object for Video Prediction

Edges to Shapes to Concepts: Adversarial Augmentation for Robust Vision

EVAL: Explainable Video Anomaly Localization

Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space

CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision

Noisy Correspondence Learning With Meta Similarity Correction

Robust Multiview Point Cloud Registration With Reliable Pose Graph Initialization and History Reweighting

MeMaHand: Exploiting Mesh-Mano Interaction for Single Image Two-Hand Reconstruction

Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition

BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation

Memory-Friendly Scalable Super-Resolution via Rewinding Lottery Ticket Hypothesis

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

High-Fidelity and Freely Controllable Talking Head Video Generation

On the Stability-Plasticity Dilemma of Class-Incremental Learning

Multilateral Semantic Relations Modeling for Image Text Retrieval

Practical Network Acceleration With Tiny Sets

Learning To Exploit the Sequence-Specific Prior Knowledge for Image Processing Pipelines Optimization

On the Pitfall of Mixup for Uncertainty Calibration

Few-Shot Non-Line-of-Sight Imaging With Signal-Surface Collaborative Regularization

Differentiable Shadow Mapping for Efficient Inverse Graphics

FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction

Texture-Guided Saliency Distilling for Unsupervised Salient Object Detection

PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection

Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation

Sphere-Guided Training of Neural Implicit Surfaces

Color Backdoor: A Robust Poisoning Attack in Color Space

Explicit Visual Prompting for Low-Level Structure Segmentations

VQACL: A Novel Visual Question Answering Continual Learning Setting

Non-Line-of-Sight Imaging With Signal Superresolution Network

Unsupervised Contour Tracking of Live Cells by Mechanical and Cycle Consistency Losses

Context-Based Trit-Plane Coding for Progressive Image Compression

Adaptive Sparse Convolutional Networks With Global Context Enhancement for Faster Object Detection on Drone Images

Deep Frequency Filtering for Domain Generalization

Self-Supervised AutoFlow

ScarceNet: Animal Pose Estimation With Scarce Annotations

MELTR: Meta Loss Transformer for Learning To Fine-Tune Video Foundation Models

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering

Crossing the Gap: Domain Generalization for Image Captioning

Semantic Ray: Learning a Generalizable Semantic Field With Cross-Reprojection Attention

Generalized UAV Object Detection via Frequency Domain Disentanglement

Text With Knowledge Graph Augmented Transformer for Video Captioning

StyleAdv: Meta Style Adversarial Training for Cross-Domain Few-Shot Learning

FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation

Physically Adversarial Infrared Patches With Learnable Shapes and Locations

Multi-Level Logit Distillation

TriDet: Temporal Action Detection With Relative Boundary Modeling

Dimensionality-Varying Diffusion Process

Fast Contextual Scene Graph Generation With Unbiased Context Augmentation

Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds

Conditional Text Image Generation With Diffusion Models

Compacting Binary Neural Networks by Sparse Kernel Selection

A General Regret Bound of Preconditioned Gradient Method for DNN Training

Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

Neural Video Compression With Diverse Contexts

Controllable Mesh Generation Through Sparse Latent Point Diffusion Models

Balanced Energy Regularization Loss for Out-of-Distribution Detection

Private Image Generation With Dual-Purpose Auxiliary Classifier

Pixels, Regions, and Objects: Multiple Enhancement for Salient Object Detection

AdaptiveMix: Improving GAN Training via Feature Space Shrinkage

CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution

Black-Box Sparse Adversarial Attack via Multi-Objective Optimisation

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition With Pre-Trained Vision-Language Models

Person Image Synthesis via Denoising Diffusion Model

Policy Adaptation From Foundation Model Feedback

Augmentation Matters: A Simple-Yet-Effective Approach to Semi-Supervised Semantic Segmentation

Neural Fields Meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning

Learning Dynamic Style Kernels for Artistic Style Transfer

Robust Unsupervised StyleGAN Image Restoration

Localized Semantic Feature Mixers for Efficient Pedestrian Detection in Autonomous Driving

Conflict-Based Cross-View Consistency for Semi-Supervised Semantic Segmentation

Learning To Detect Mirrors From Videos via Dual Correspondences

Cross-Domain Image Captioning With Discriminative Finetuning

SfM-TTR: Using Structure From Motion for Test-Time Refinement of Single-View Depth Networks

TINC: Tree-Structured Implicit Neural Compression

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification

TBP-Former: Learning Temporal Bird’s-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-Training Model

A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction From In-the-Wild Images

HS-Pose: Hybrid Scope Feature Extraction for Category-Level Object Pose Estimation

PAniC-3D: Stylized Single-View 3D Reconstruction From Portraits of Anime Characters

SGLoc: Scene Geometry Encoding for Outdoor LiDAR Localization

SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail

Resource-Efficient RGBD Aerial Tracking

Density-Insensitive Unsupervised Domain Adaption on 3D Object Detection

Neural Transformation Fields for Arbitrary-Styled Font Generation

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos

Ham2Pose: Animating Sign Language Notation Into Pose Sequences

Towards Modality-Agnostic Person Re-Identification With Descriptive Query

Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger

Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer

Hard Sample Matters a Lot in Zero-Shot Quantization

Constructing Deep Spiking Neural Networks From Artificial Neural Networks With Knowledge Distillation

Class Attention Transfer Based Knowledge Distillation

Global and Local Mixture Consistency Cumulative Learning for Long-Tailed Visual Recognitions

Egocentric Video Task Translation

3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions

High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition

Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence

Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection

Dynamic Conceptional Contrastive Learning for Generalized Category Discovery

Local 3D Editing via 3D Distillation of CLIP Knowledge

EvShutter: Transforming Events for Unconstrained Rolling Shutter Correction

Learning Personalized High Quality Volumetric Head Avatars From Monocular RGB Videos

HGNet: Learning Hierarchical Geometry From Points, Edges, and Surfaces

DINER: Depth-Aware Image-Based NEural Radiance Fields

A New Comprehensive Benchmark for Semi-Supervised Video Anomaly Detection and Anticipation

Multi-Modal Representation Learning With Text-Driven Soft Masks

Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving

Linking Garment With Person via Semantically Associated Landmarks for Virtual Try-On

2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection

Discriminating Known From Unknown Objects via Structure-Enhanced Recurrent Variational AutoEncoder

Generative Diffusion Prior for Unified Image Restoration and Enhancement

OmniAL: A Unified CNN Framework for Unsupervised Anomaly Localization

Revisiting the Stack-Based Inverse Tone Mapping

Towards Realistic Long-Tailed Semi-Supervised Learning: Consistency Is All You Need

Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field

AeDet: Azimuth-Invariant Multi-View 3D Object Detection

HyperMatch: Noise-Tolerant Semi-Supervised Learning via Relaxed Contrastive Constraint

Feature Alignment and Uniformity for Test Time Adaptation

Unifying Layout Generation With a Decoupled Diffusion Model

Good Is Bad: Causality Inspired Cloth-Debiasing for Cloth-Changing Person Re-Identification

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

Semi-Supervised 2D Human Pose Estimation Driven by Position Inconsistency Pseudo Label Correction Module

Multiplicative Fourier Level of Detail

Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning

Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks

Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator

SteerNeRF: Accelerating NeRF Rendering via Smooth Viewpoint Trajectory

Collaborative Noisy Label Cleaner: Learning Scene-Aware Trailers for Multi-Modal Highlight Detection in Movies

FreeNeRF: Improving Few-Shot Neural Rendering With Free Frequency Regularization

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Hyperbolic Contrastive Learning for Visual Representations Beyond Objects

MotionTrack: Learning Robust Short-Term and Long-Term Motions for Multi-Object Tracking

CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

B-Spline Texture Coefficients Estimator for Screen Content Image Super-Resolution

Multimodal Industrial Anomaly Detection via Hybrid Fusion

GD-MAE: Generative Decoder for MAE Pre-Training on LiDAR Point Clouds

Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation

RONO: Robust Discriminative Learning With Noisy Labels for 2D-3D Cross-Modal Retrieval

ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos

Parts2Words: Learning Joint Embedding of Point Clouds and Texts by Bidirectional Matching Between Parts and Words

Efficient Second-Order Plane Adjustment

Deep Hashing With Minimal-Distance-Separated Hash Centers

RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension

Adaptive Assignment for Geometry Aware Local Feature Matching

ConZIC: Controllable Zero-Shot Image Captioning by Sampling-Based Polishing

Curricular Object Manipulation in LiDAR-Based Object Detection

Fully Self-Supervised Depth Estimation From Defocus Clue

Post-Training Quantization on Diffusion Models

Rebalancing Batch Normalization for Exemplar-Based Class-Incremental Learning

Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures

HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization

AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata

High-Res Facial Appearance Capture From Polarized Smartphone Images

Feature Aggregated Queries for Transformer-Based Video Object Detectors

Ambiguous Medical Image Segmentation Using Diffusion Models

Twin Contrastive Learning With Noisy Labels

Partial Network Cloning

Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation

Understanding Deep Generative Models With Generalized Empirical Likelihoods

Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

Adaptive Annealing for Robust Geometric Estimation

Self-Supervised 3D Scene Flow Estimation Guided by Superpoints

Learning Optical Expansion From Scale Matching

Watch or Listen: Robust Audio-Visual Speech Recognition With Visual Corruption Modeling and Reliability Scoring

Grid-Guided Neural Radiance Fields for Large Urban Scenes

SpaText: Spatio-Textual Representation for Controllable Image Generation

Local Implicit Ray Function for Generalizable Radiance Field Representation

Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process

Weakly Supervised Video Representation Learning With Unaligned Text for Sequential Videos

The Best Defense Is a Good Offense: Adversarial Augmentation Against Adversarial Attacks

Handy: Towards a High Fidelity 3D Hand Shape and Appearance Model

Learning a Depth Covariance Function

TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers

DiffRF: Rendering-Guided 3D Radiance Field Diffusion

Clothing-Change Feature Augmentation for Person Re-Identification

Learnable Skeleton-Aware 3D Point Cloud Sampling

TrojDiff: Trojan Attacks on Diffusion Models With Diverse Targets

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment

Modeling Inter-Class and Intra-Class Constraints in Novel Class Discovery

GarmentTracking: Category-Level Garment Pose Tracking

Benchmarking Robustness of 3D Object Detection to Common Corruptions

Generic-to-Specific Distillation of Masked Autoencoders

Dynamic Focus-Aware Positional Queries for Semantic Segmentation

Physics-Guided ISO-Dependent Sensor Noise Modeling for Extreme Low-Light Photography

Compositor: Bottom-Up Clustering and Compositing for Robust Part and Object Segmentation

Visual-Language Prompt Tuning With Knowledge-Guided Context Optimization

Context-Aware Pretraining for Efficient Blind Image Decomposition

VecFontSDF: Learning To Reconstruct and Synthesize High-Quality Vector Fonts via Signed Distance Functions

Towards Better Gradient Consistency for Neural Signed Distance Functions via Level Set Alignment

PACO: Parts and Attributes of Common Objects

Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-Based Active Learning

MetaMix: Towards Corruption-Robust Continual Learning With Temporally Self-Adaptive Data Transformation

UniHCP: A Unified Model for Human-Centric Perceptions

Learning To Zoom and Unzoom

Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels

Implicit Identity Driven Deepfake Face Swapping Detection

Prototypical Residual Networks for Anomaly Detection and Localization

Bridging Search Region Interaction With Template for RGB-T Tracking

COT: Unsupervised Domain Adaptation With Clustering and Optimal Transport

Tangentially Elongated Gaussian Belief Propagation for Event-Based Incremental Optical Flow Estimation

HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation

Photo Pre-Training, but for Sketch

PREIM3D: 3D Consistent Precise Image Attribute Editing From a Single Image

ConQueR: Query Contrast Voxel-DETR for 3D Object Detection

OcTr: Octree-Based Transformer for 3D Object Detection

RILS: Masked Visual Reconstruction in Language Semantic Space

Image Cropping With Spatial-Aware Feature and Rank Consistency

Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial Attacks

DropMAE: Masked Autoencoders With Spatial-Attention Dropout for Tracking Tasks

Delving Into Discrete Normalizing Flows on SO(3) Manifold for Probabilistic Rotation Modeling

3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification

GeoNet: Benchmarking Unsupervised Adaptation Across Geographies

Joint HDR Denoising and Fusion: A Real-World Mobile HDR Image Dataset

PATS: Patch Area Transportation With Subdivision for Local Feature Matching

SINE: Semantic-Driven Image-Based NeRF Editing With Prior-Guided Editing Field

Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video

Learning To Detect and Segment for Open Vocabulary Object Detection

Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection

Dynamic Graph Learning With Content-Guided Spatial-Frequency Relation Reasoning for Deepfake Detection

OSAN: A One-Stage Alignment Network To Unify Multimodal Alignment and Unsupervised Domain Adaptation

Ultrahigh Resolution Image/Video Matting With Spatio-Temporal Sparsity

MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices

WeatherStream: Light Transport Automation of Single Image Deweathering

Normal-Guided Garment UV Prediction for Human Re-Texturing

Depth Estimation From Camera Image and mmWave Radar Point Cloud

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

pCON: Polarimetric Coordinate Networks for Neural Scene Representations

Deep Factorized Metric Learning

Improving Image Recognition by Retrieving From Web-Scale Image-Text Data

Video-Text As Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Semantic Prompt for Few-Shot Image Recognition

SVFormer: Semi-Supervised Video Transformer for Action Recognition

Exploring the Relationship Between Architectural Design and Adversarially Robust Generalization

Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching

Federated Incremental Semantic Segmentation

Revisiting Prototypical Network for Cross Domain Few-Shot Learning

Task Difficulty Aware Parameter Allocation & Regularization for Lifelong Learning

Two-View Geometry Scoring Without Correspondences

AltFreezing for More General Video Face Forgery Detection

CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer

Uncurated Image-Text Datasets: Shedding Light on Demographic Bias

Decompose More and Aggregate Better: Two Closer Looks at Frequency Representation Learning for Human Motion Prediction

Motion Information Propagation for Neural Video Compression

Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition From Egocentric RGB Videos

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

GFIE: A Dataset and Baseline for Gaze-Following From 2D to 3D in Indoor Environments

Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning

NeuralPCI: Spatio-Temporal Neural Field for 3D Point Cloud Multi-Frame Non-Linear Interpolation

LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction

Model-Agnostic Gender Debiased Image Captioning

Bootstrapping Objectness From Videos by Relaxed Common Fate and Visual Grouping

Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval

Delving Into Shape-Aware Zero-Shot Semantic Segmentation

Cascade Evidential Learning for Open-World Weakly-Supervised Temporal Action Localization

NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds

ProtoCon: Pseudo-Label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-Supervised Learning

Mixed Autoencoder for Self-Supervised Visual Representation Learning

DPF: Learning Dense Prediction Fields With Weak Supervision

MISC210K: A Large-Scale Dataset for Multi-Instance Semantic Correspondence

Similarity Metric Learning for RGB-Infrared Group Re-Identification

Exploring Discontinuity for Video Frame Interpolation

GradICON: Approximate Diffeomorphisms via Gradient Inverse Consistency

DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos

Polynomial Implicit Neural Representations for Large Diverse Datasets

Towards Better Decision Forests: Forest Alternating Optimization

CrOC: Cross-View Online Clustering for Dense Visual Representation Learning

Shape, Pose, and Appearance From a Single Image via Bootstrapped Radiance Field Inversion

Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation With Implicit Neural Representations

Target-Referenced Reactive Grasping for Dynamic Objects

ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration

Structured Sparsity Learning for Efficient Video Super-Resolution

Non-Contrastive Unsupervised Learning of Physiological Signals From Video

Weakly-Supervised Single-View Image Relighting

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

End-to-End Video Matting With Trimap Propagation

Human Body Shape Completion With Implicit Shape and Flow Learning

TinyMIM: An Empirical Study of Distilling MIM Pre-Trained Models

Plateau-Reduced Differentiable Path Tracing

Computationally Budgeted Continual Learning: What Does Matter?

Event-Based Shape From Polarization

Adversarially Robust Neural Architecture Search for Graph Neural Networks

An In-Depth Exploration of Person Re-Identification and Gait Recognition in Cloth-Changing Conditions

From Node Interaction To Hop Interaction: New Effective and Scalable Graph Learning Paradigm

Transfer4D: A Framework for Frugal Motion Capture and Deformation Transfer

HumanBench: Towards General Human-Centric Perception With Projector Assisted Pretraining

SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations

Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization

Instant Volumetric Head Avatars

Boundary-Enhanced Co-Training for Weakly Supervised Semantic Segmentation

CRAFT: Concept Recursive Activation FacTorization for Explainability

Don’t Lie to Me! Robust and Efficient Explainability With Verified Perturbation Analysis

HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics

Learning To Predict Scene-Level Implicit 3D From Posed RGBD Data

Prompting Large Language Models With Answer Heuristics for Knowledge-Based Visual Question Answering

Harmonious Feature Learning for Interactive Hand-Object Pose Estimation

Decompose, Adjust, Compose: Effective Normalization by Playing With Frequency for Domain Generalization

Habitat-Matterport 3D Semantics Dataset

Reinforcement Learning-Based Black-Box Model Inversion Attacks

PIRLNav: Pretraining With Imitation and RL Finetuning for ObjectNav

DC2: Dual-Camera Defocus Control by Learning To Refocus

The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects

Rethinking Domain Generalization for Face Anti-Spoofing: Separability and Alignment

Grad-PU: Arbitrary-Scale Point Cloud Upsampling via Gradient Descent With Learned Distance Functions

Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Unsupervised Inference of Signed Distance Functions From Single Sparse Point Clouds Without Learning Priors

Neural Vector Fields: Implicit Representation by Explicit Learning

Zero-Shot Pose Transfer for Unrigged Stylized 3D Characters

Complete 3D Human Reconstruction From a Single Incomplete Image

PartDistillation: Learning Parts From Instance Segmentation

EDICT: Exact Diffusion Inversion via Coupled Transformations

PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models

Regularized Vector Quantization for Tokenized Image Synthesis

EDGE: Editable Dance Generation From Music

Low-Light Image Enhancement via Structure Modeling and Guidance

Where Is My Spot? Few-Shot Image Generation via Latent Subspace Optimization

Bilateral Memory Consolidation for Continual Learning

Structure Aggregation for Cross-Spectral Stereo Image Guided Denoising

What Happened 3 Seconds Ago? Inferring the Past With Thermal Imaging

Contrastive Grouping With Transformer for Referring Image Segmentation

Learning To Segment Every Referring Object Point by Point

Event-Guided Person Re-Identification via Sparse-Dense Complementary Learning

RIDCP: Revitalizing Real Image Dehazing via High-Quality Codebook Priors

Dynamic Graph Enhanced Contrastive Learning for Chest X-Ray Report Generation

GlassesGAN: Eyewear Personalization Using Synthetic Appearance Discovery and Targeted Subspace Modeling

Gaussian Label Distribution Learning for Spherical Image Object Detection

Long Range Pooling for 3D Large-Scale Scene Understanding

Learning Spatial-Temporal Implicit Neural Representations for Event-Guided Video Super-Resolution

PaletteNeRF: Palette-Based Appearance Editing of Neural Radiance Fields

Ego-Body Pose Estimation via Ego-Head Pose Estimation

Towards Stable Human Pose Estimation via Cross-View Fusion and Foot Stabilization

Command-Driven Articulated Object Understanding and Manipulation

ReasonNet: End-to-End Driving With Temporal and Global Reasoning

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce

Adaptive Sparse Pairwise Loss for Object Re-Identification

FedSeg: Class-Heterogeneous Federated Learning for Semantic Segmentation

UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration

Iterative Next Boundary Detection for Instance Segmentation of Tree Rings in Microscopy Images of Shrub Cross Sections

HyperReel: High-Fidelity 6-DoF Video With Ray-Conditioned Sampling

Compressing Volumetric Radiance Fields to 1 MB

EC2: Emergent Communication for Embodied Control

Joint Visual Grounding and Tracking With Natural Language Specification

TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning With Structure-Trajectory Prompted Reconstruction for Person Re-Identification

DAA: A Delta Age AdaIN Operation for Age Estimation via Binary Code Transformer

Evading DeepFake Detectors via Adversarial Statistical Consistency

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

Disentangled Representation Learning for Unsupervised Neural Quantization

Zero-Shot Model Diagnosis

Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild

Learning Bottleneck Concepts in Image Classification

Multi-Centroid Task Descriptor for Dynamic Class Incremental Inference

Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching

CXTrack: Improving 3D Point Cloud Tracking With Contextual Information

Efficient Multimodal Fusion via Interactive Prompting

MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors

AsyFOD: An Asymmetric Adaptation Paradigm for Few-Shot Domain Adaptive Object Detection

NerVE: Neural Volumetric Edges for Parametric Curve Extraction From Point Cloud

Referring Multi-Object Tracking

Paint by Example: Exemplar-Based Image Editing With Diffusion Models

Interactive Cartoonization With Controllable Perceptual Factors

Rethinking the Approximation Error in 3D Surface Fitting for Point Cloud Normal Estimation

Weakly Supervised Monocular 3D Object Detection Using Multi-View Projection and Direction Consistency

Representing Volumetric Videos As Dynamic MLP Maps

3D-Aware Face Swapping

Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

NeuWigs: A Neural Dynamic Model for Volumetric Hair Capture and Animation

Proximal Splitting Adversarial Attack for Semantic Segmentation

Data-Free Sketch-Based Image Retrieval

CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion

Spherical Transformer for LiDAR-Based 3D Recognition

Adaptive Global Decay Process for Event Cameras

Adapting Shortcut With Normalizing Flow: An Efficient Tuning Framework for Visual Recognition

IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction

SOOD: Towards Semi-Supervised Oriented Object Detection

Towards Artistic Image Aesthetics Assessment: A Large-Scale Dataset and a New Method

Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking

Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning

Spatially Adaptive Self-Supervised Learning for Real-World Image Denoising

IterativePFN: True Iterative Point Cloud Filtering

3D-Aware Facial Landmark Detection via Multi-View Consistent Training on Synthetic Data

Semi-Supervised Domain Adaptation With Source Label Adaptation

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

Self-Supervised Video Forensics by Audio-Visual Anomaly Detection

Visual-Tactile Sensing for In-Hand Object Reconstruction

MaLP: Manipulation Localization Using a Proactive Scheme

Logical Consistency and Greater Descriptive Power for Facial Hair Attribute Learning

N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution

Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection

ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal

Learning Human Mesh Recovery in 3D Scenes

Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation

Curricular Contrastive Regularization for Physics-Aware Single Image Dehazing

VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

Raw Image Reconstruction With Learned Compact Metadata

End-to-End 3D Dense Captioning With Vote2Cap-DETR

Generating Human Motion From Textual Descriptions With Discrete Representations

RefSR-NeRF: Towards High Fidelity and Super Resolution View Synthesis

Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration

Human Guided Ground-Truth Generation for Realistic Image Super-Resolution

DiffPose: Toward More Reliable 3D Pose Estimation

SQUID: Deep Feature In-Painting for Unsupervised Anomaly Detection

DegAE: A New Pretraining Paradigm for Low-Level Vision

RankMix: Data Augmentation for Weakly Supervised Learning of Classifying Whole Slide Images With Diverse Sizes and Imbalanced Categories

Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting

Revisiting Self-Similarity: Structural Embedding for Image Retrieval

Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild

Towards Bridging the Performance Gaps of Joint Energy-Based Models

FFF: Fragment-Guided Flexible Fitting for Building Complete Protein Structures

Layout-Based Causal Inference for Object Navigation

POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery

Coaching a Teachable Student

Shape-Constraint Recurrent Flow for 6D Object Pose Estimation

Catch Missing Details: Image Reconstruction With Frequency Augmented Variational Autoencoder

Rigidity-Aware Detection for 6D Object Pose Estimation

ShadowNeuS: Neural SDF Reconstruction by Shadow Ray Supervision

ALOFT: A Lightweight MLP-Like Architecture With Dynamic Low-Frequency Transform for Domain Generalization

The Enemy of My Enemy Is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training

Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style

Primitive Generation and Semantic-Related Alignment for Universal Zero-Shot Segmentation

Semi-Supervised Hand Appearance Recovery via Structure Disentanglement and Dual Adversarial Discrimination

TryOnDiffusion: A Tale of Two UNets

Breaking the “Object” in Video Object Segmentation

SunStage: Portrait Reconstruction and Relighting Using the Sun as a Light Stage

Object Discovery From Motion-Guided Tokens

Batch Model Consolidation: A Multi-Task Model Consolidation Framework

Dense Network Expansion for Class Incremental Learning

IMP: Iterative Matching and Pose Estimation With Adaptive Pooling

LightPainter: Interactive Portrait Relighting With Freehand Scribble

Unified Pose Sequence Modeling

VindLU: A Recipe for Effective Video-and-Language Pretraining

MAGE: MAsked Generative Encoder To Unify Representation Learning and Image Synthesis

I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

Continual Semantic Segmentation With Automatic Memory Sample Selection

Regularizing Second-Order Influences for Continual Learning

Boost Vision Transformer With GPU-Friendly Sparsity and Quantization

Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning

NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-Shot Real Image Animation

HairStep: Transfer Synthetic to Real Using Strand and Depth Maps for Single-View 3D Hair Modeling

VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud

F2-NeRF: Fast Neural Radiance Field Training With Free Camera Trajectories

Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning

Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training

Geometry and Uncertainty-Aware 3D Point Cloud Class-Incremental Semantic Segmentation

UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird’s-Eye View

ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing

Image Super-Resolution Using T-Tetromino Pixels

Highly Confident Local Structure Based Consensus Graph Learning for Incomplete Multi-View Clustering

Non-Contrastive Learning Meets Language-Image Pre-Training

Dynamic Inference With Grounding Based Vision and Language Models

A Simple Baseline for Video Restoration With Grouped Spatial-Temporal Shift

Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences Between Pretrained Generative Models

Analyzing Physical Impacts Using Transient Surface Wave Imaging

Deep Learning of Partial Graph Matching via Differentiable Top-K

LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs

Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception

Binarizing Sparse Convolutional Networks for Efficient Point Cloud Analysis

Hunting Sparsity: Density-Guided Contrastive Learning for Semi-Supervised Semantic Segmentation

Autoregressive Visual Tracking

LinK: Linear Kernel for LiDAR-Based 3D Perception

Weakly Supervised Segmentation With Point Annotations for Histopathology Images via Contrast-Based Variational Model

KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation

DARE-GRAM: Unsupervised Domain Adaptation Regression by Aligning Inverse Gram Matrices

FCC: Feature Clusters Compression for Long-Tailed Visual Recognition

DartBlur: Privacy Preservation With Detection Artifact Suppression

Efficient Frequency Domain-Based Transformers for High-Quality Image Deblurring

Focus on Details: Online Multi-Object Tracking With Diverse Fine-Grained Representation

Transfer Knowledge From Head to Tail: Uncertainty Calibration Under Long-Tailed Distribution

Deep Stereo Video Inpainting

Open-Set Semantic Segmentation for Point Clouds via Adversarial Prototype Framework

CLAMP: Prompt-Based Contrastive Learning for Connecting Language and Animal Pose

Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior

RGBD2: Generative Scene Synthesis via Incremental View Inpainting Using RGBD Diffusion Models

Prototype-Based Embedding Network for Scene Graph Generation

Backdoor Defense via Adaptively Splitting Poisoned Dataset

GaitGCI: Generative Counterfactual Intervention for Gait Recognition

Bi-Directional Feature Fusion Generative Adversarial Network for Ultra-High Resolution Pathological Image Virtual Re-Staining

Vector Quantization With Self-Attention for Quality-Independent Representation Learning

Fine-Grained Face Swapping via Regional GAN Inversion

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Frequency-Modulated Point Cloud Rendering With Easy Editing

TAPS3D: Text-Guided 3D Textured Shape Generation From Pseudo Supervision

Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains

Enhancing the Self-Universality for Transferable Targeted Attacks

Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes

A Unified Pyramid Recurrent Network for Video Frame Interpolation

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation

Rethinking Optical Flow From Geometric Matching Consistent Perspective

MMANet: Margin-Aware Distillation and Modality-Aware Regularization for Incomplete Multimodal Learning

Self-Supervised Implicit Glyph Attention for Text Recognition

Semi-Supervised Video Inpainting With Cycle Consistency Constraints

Patch-Based 3D Natural Scene Generation From a Single Example

Listening Human Behavior: 3D Human Pose Estimation With Acoustic Signals

Mobile User Interface Element Detection via Adaptively Prompt Tuning

High-Fidelity Facial Avatar Reconstruction From Monocular Video With Generative Priors

Anchor3DLane: Learning To Regress 3D Anchors for Monocular 3D Lane Detection

BioNet: A Biologically-Inspired Network for Face Recognition

PyramidFlow: High-Resolution Defect Contrastive Localization Using Pyramid Normalizing Flow

Neural Kaleidoscopic Space Sculpting

Accelerating Dataset Distillation via Model Augmentation

ProD: Prompting-To-Disentangle Domain Knowledge for Cross-Domain Few-Shot Image Classification

Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-Identification

Neural Dependencies Emerging From Learning Massive Categories

DeFeeNet: Consecutive 3D Human Motion Prediction With Deviation Feedback

LP-DIF: Learning Local Pattern-Specific Deep Implicit Function for 3D Objects and Scenes

A Dynamic Multi-Scale Voxel Flow Network for Video Prediction

Dionysus: Recovering Scene Structures by Dividing Into Semantic Pieces

Being Comes From Not-Being: Open-Vocabulary Text-to-Motion Generation With Wordless Training

Unsupervised Cumulative Domain Adaptation for Foggy Scene Optical Flow

DynaMask: Dynamic Mask Selection for Instance Segmentation

HandNeRF: Neural Radiance Fields for Animatable Interacting Hands

Simultaneously Short- and Long-Term Temporal Modeling for Semi-Supervised Video Semantic Segmentation

Devil Is in the Queries: Advancing Mask Transformers for Real-World Medical Image Segmentation and Out-of-Distribution Localization

Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

Learning With Fantasy: Semantic-Aware Virtual Contrastive Constraint for Few-Shot Class-Incremental Learning

Towards Accurate Image Coding: Improved Autoregressive Image Generation With Dynamic Vector Quantization

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales With Multi-Depth Seeds for 3D Object Detection

Learning Anchor Transformations for 3D Garment Animation

LaserMix for Semi-Supervised LiDAR Semantic Segmentation

Enhanced Training of Query-Based Object Detection via Selective Query Recollection

SCoDA: Domain Adaptive Shape Completion for Real Scans

VideoTrack: Learning To Track Objects via Video Transformer

Probing Sentiment-Oriented Pre-Training Inspired by Human Sentiment Perception Mechanism

Towards Professional Level Crowd Annotation of Expert Domain Data

MoDAR: Using Motion Forecasting for 3D Object Detection in Point Cloud Sequences

iQuery: Instruments As Queries for Audio-Visual Sound Separation

RGB No More: Minimally-Decoded JPEG Vision Transformers

Label-Free Liver Tumor Segmentation

Zero-Shot Object Counting

Exploring Incompatible Knowledge Transfer in Few-Shot Image Generation

DepGraph: Towards Any Structural Pruning

GRES: Generalized Referring Expression Segmentation

Tracking Through Containers and Occluders in the Wild

Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection

Neural Preset for Color Style Transfer

Vid2Avatar: 3D Avatar Reconstruction From Videos in the Wild via Self-Supervised Scene Decomposition

CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset

NeRF-RPN: A General Framework for Object Detection in NeRFs

Diffusion-SDF: Text-To-Shape via Voxelized Diffusion

PointAvatar: Deformable Point-Based Head Avatars From Videos

Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries

DLBD: A Self-Supervised Direct-Learned Binary Descriptor

MSF: Motion-Guided Sequential Fusion for Efficient 3D Object Detection From Point Cloud Sequences

Multi-Space Neural Radiance Fields

HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes With Iterative Intertwined Regularization

Learning With Noisy Labels via Self-Supervised Adversarial Noisy Masking

Toward Accurate Post-Training Quantization for Image Super Resolution

Generating Holistic 3D Human Motion From Speech

TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation

End-to-End Vectorized HD-Map Construction With Piecewise Bezier Curve

Accelerating Vision-Language Pretraining With Free Language Modeling

OSRT: Omnidirectional Image Super-Resolution With Distortion-Aware Transformer

Interactive Segmentation As Gaussion Process Classification

Probability-Based Global Cross-Modal Upsampling for Pansharpening

Boosting Verified Training for Robust Image Classifications via Abstraction

PVT-SSD: Single-Stage 3D Object Detector With Point-Voxel Transformer

StyleRes: Transforming the Residuals for Real Image Editing With StyleGAN

Breaching FedMD: Image Recovery via Paired-Logits Inversion Attack

FrustumFormer: Adaptive Instance-Aware Resampling for Multi-View 3D Detection

Edge-Aware Regional Message Passing Controller for Image Forgery Localization

PD-Quant: Post-Training Quantization Based on Prediction Difference Metric

Learning Geometric-Aware Properties in 2D Representation Using Lightweight CAD Models, or Zero Real 3D Pairs

Learning To Dub Movies via Hierarchical Prosody Models

Binary Latent Diffusion

NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds

Neural Kernel Surface Reconstruction

You Can Ground Earlier Than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

Hybrid Neural Rendering for Large-Scale Scenes With Motion Blur

Self-Supervised Non-Uniform Kernel Estimation With Flow-Based Motion Prior for Blind Image Deblurring

Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark

The Devil Is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-Guided Mask Representation

Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation

Skinned Motion Retargeting With Residual Perception of Motion Semantics & Geometry

Multispectral Video Semantic Segmentation: A Benchmark Dataset and Baseline

3D Registration With Maximal Cliques

Inferring and Leveraging Parts From Object Shape for Improving Semantic Image Synthesis

LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion

Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning

A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation From a Single RGB Image

Distilling Vision-Language Pre-Training To Collaborate With Weakly-Supervised Temporal Action Localization

Discriminator-Cooperated Feature Map Distillation for GAN Compression

Balancing Logit Variation for Long-Tailed Semantic Segmentation

Efficient Mask Correction for Click-Based Interactive Image Segmentation

Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval

Simple Cues Lead to a Strong Multi-Object Tracker

MixTeacher: Mining Promising Labels With Mixed Scale Teacher for Semi-Supervised Object Detection

Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond

Unifying Short and Long-Term Tracking With Graph Hierarchies

Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo

Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast for Autonomous Driving

NeRF-DS: Neural Radiance Fields for Dynamic Specular Objects

CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network With Large Input

System-Status-Aware Adaptive Network for Online Streaming Video Understanding

On the Difficulty of Unpaired Infrared-to-Visible Video Translation: Fine-Grained Content-Rich Patches Transfer

Gloss Attention for Gloss-Free Sign Language Translation

FAC: 3D Representation Learning via Foreground Aware Feature Contrast

InstMove: Instance Motion for Object-Centric Video Segmentation

Less Is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation

ResFormer: Scaling ViTs With Multi-Resolution Training

CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model

Sibling-Attack: Rethinking Transferable Adversarial Attacks Against Face Recognition

Identity-Preserving Talking Face Generation With Landmark and Appearance Priors

Divide and Adapt: Active Domain Adaptation via Customized Learning

Putting People in Their Place: Affordance-Aware Human Insertion Into Scenes

Advancing Visual Grounding With Scene Knowledge: Benchmark and Method

Parametric Implicit Face Representation for Audio-Driven Facial Reenactment

Improved Distribution Matching for Dataset Condensation

Semi-DETR: Semi-Supervised Object Detection With Detection Transformers

SHS-Net: Learning Signed Hyper Surfaces for Oriented Normal Estimation of Point Clouds

DBARF: Deep Bundle-Adjusting Generalizable Neural Radiance Fields

Stitchable Neural Networks

Equivalent Transformation and Dual Stream Network Construction for Mobile Image Super-Resolution

Dynamic Aggregated Network for Gait Recognition

MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Action Recognition

Omni Aggregation Networks for Lightweight Image Super-Resolution

Masked Image Modeling With Local Multi-Scale Reconstruction

RaBit: Parametric Modeling of 3D Biped Cartoon Characters With a Topological-Consistent Dataset

Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation With Cross-Scale Distortion Awareness

Global-to-Local Modeling for Video-Based 3D Human Pose and Shape Estimation

Top-Down Visual Attention From Analysis by Synthesis

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

SimpleNet: A Simple Network for Image Anomaly Detection and Localization

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

IS-GGT: Iterative Scene Graph Generation With Generative Transformers

MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID

Executing Your Commands via Motion Diffusion in Latent Space

NEF: Neural Edge Fields for 3D Parametric Curve Reconstruction From Multi-View Images

FLAG3D: A 3D Fitness Activity Dataset With Language Instruction

Towards Universal Fake Image Detectors That Generalize Across Generative Models

NeuralUDF: Learning Unsigned Distance Fields for Multi-View Reconstruction of Surfaces With Arbitrary Topologies

Context De-Confounded Emotion Recognition

PA&DA: Jointly Sampling Path and Data for Consistent NAS

You Only Segment Once: Towards Real-Time Panoptic Segmentation

Activating More Pixels in Image Super-Resolution Transformer

DisWOT: Student Architecture Search for Distillation WithOut Training

Effective Ambiguity Attack Against Passport-Based DNN Intellectual Property Protection Schemes Through Fully Connected Layer Substitution

Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Object Pose Estimation With Statistical Guarantees: Conformal Keypoint Detection and Geometric Uncertainty Propagation

Dynamically Instance-Guided Adaptation: A Backward-Free Approach for Test-Time Domain Adaptive Semantic Segmentation

Pose Synchronization Under Multiple Pair-Wise Relative Poses

Open Set Action Recognition via Multi-Label Evidential Learning

Micron-BERT: BERT-Based Facial Micro-Expression Recognition

Genie: Show Me the Data for Quantization

Deep Graph Reprogramming

Generalizable Local Feature Pre-Training for Deformable Shape Analysis

Collaborative Diffusion for Multi-Modal Face Generation and Editing

Diffusion Probabilistic Model Made Slim

Unsupervised Domain Adaption With Pixel-Level Discriminator for Image-Aware Layout Generation

Token Contrast for Weakly-Supervised Semantic Segmentation

VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval

BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields

Class-Conditional Sharpness-Aware Minimization for Deep Long-Tailed Recognition

Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution

Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

DoNet: Deep De-Overlapping Network for Cytology Instance Segmentation

PointListNet: Deep Learning on 3D Point Lists

NeuralLift-360: Lifting an In-the-Wild 2D Photo to a 3D Object With 360° Views

Representation Learning for Visual Object Tracking by Masked Appearance Transfer

Boosting Detection in Crowd Analysis via Underutilized Output Features

Endpoints Weight Fusion for Class Incremental Semantic Segmentation

Symmetric Shape-Preserving Autoencoder for Unsupervised Real Scene Point Cloud Completion

LiDAR2Map: In Defense of LiDAR-Based Semantic Map Construction Using Online Camera Distillation

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator

Masked Motion Encoding for Self-Supervised Video Representation Learning

In-Hand 3D Object Scanning From an RGB Sequence

SceneComposer: Any-Level Semantic Image Synthesis

QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity

Magic3D: High-Resolution Text-to-3D Content Creation

Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction

3D-Aware Conditional Image Synthesis

GeneCIS: A Benchmark for General Conditional Image Similarity

Neighborhood Attention Transformer

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding

Camouflaged Object Detection With Feature Decomposition and Edge Reconstruction

Iterative Proposal Refinement for Weakly-Supervised Video Grounding

SIM: Semantic-Aware Instance Mask Generation for Box-Supervised Instance Segmentation

Deep Discriminative Spatial and Temporal Network for Efficient Video Deblurring

Continuous Sign Language Recognition With Correlation Network

Learning a Sparse Transformer Network for Effective Image Deraining

Iterative Geometry Encoding Volume for Stereo Matching

Look Before You Match: Instance Understanding Matters in Video Object Segmentation

Decoupled Semantic Prototypes Enable Learning From Diverse Annotation Types for Semi-Weakly Segmentation in Expert-Driven Domains

Neuron Structure Modeling for Generalizable Remote Physiological Measurement

Long-Tailed Visual Recognition via Self-Heterogeneous Integration With Knowledge Excavation

MDQE: Mining Discriminative Query Embeddings To Segment Occluded Instances on Challenging Videos

One-to-Few Label Assignment for End-to-End Dense Detection

Intrinsic Physical Concepts Discovery With Object-Centric Predictive Models

Uncertainty-Aware Unsupervised Image Deblurring With Deep Residual Prior

EditableNeRF: Editing Topologically Varying Neural Radiance Fields by Key Points

Progressive Open Space Expansion for Open-Set Model Attribution

Seeing a Rose in Five Thousand Ways

Unsupervised Sampling Promoting for Stochastic Human Trajectory Prediction

Object-Goal Visual Navigation via Effective Exploration of Relations Among Historical Navigation States

Large-Scale Training Data Search for Object Re-Identification

AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation

Use Your Head: Improving Long-Tail Video Recognition

The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction

Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Deep Arbitrary-Scale Image Super-Resolution via Scale-Equivariance Pursuit

Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection

HexPlane: A Fast Representation for Dynamic Scenes

AdamsFormer for Spatial Action Localization in the Future

HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization

Content-Aware Token Sharing for Efficient Semantic Segmentation With Vision Transformers

Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation

Multiview Compressive Coding for 3D Reconstruction

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners

Fake It Till You Make It: Learning Transferable Representations From Synthetic ImageNet Clones

How Can Objects Help Action Recognition?

Understanding and Improving Features Learned in Deep Functional Maps

Soft Augmentation for Image Classification

GraVoS: Voxel Selection for 3D Point-Cloud Detection

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

Learned Two-Plane Perspective Prior Based Image Resampling for Efficient Object Detection

Adversarial Counterfactual Visual Explanations

Tracking Multiple Deformable Objects in Egocentric Videos

CUF: Continuous Upsampling Filters

Curvature-Balanced Feature Manifold Learning for Long-Tailed Classification

DIP: Dual Incongruity Perceiving Network for Sarcasm Detection

LipFormer: High-Fidelity and Generalizable Talking Face Generation With a Pre-Learned Facial Codebook

Meta-Causal Learning for Single Domain Generalization

Mind the Label Shift of Augmentation-Based Graph OOD Generalization

BAEFormer: Bi-Directional and Early Interaction Transformers for Bird’s Eye View Semantic Segmentation

Learning a Simple Low-Light Image Enhancer From Paired Low-Light Instances

MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving

Weakly Supervised Posture Mining for Fine-Grained Classification

A Light Weight Model for Active Speaker Detection

Graphics Capsule: Learning Hierarchical 3D Face Representations From 2D Images

Network Expansion for Practical Training Acceleration

Upcycling Models Under Domain and Category Shift

CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection

TIPI: Test Time Adaptation With Transformation Invariance

BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks

Joint Video Multi-Frame Interpolation and Deblurring Under Unknown Exposure Time

CLOTH4D: A Dataset for Clothed Human Reconstruction

Deep Incomplete Multi-View Clustering With Cross-View Partial Sample and Prototype Alignment

Rethinking Out-of-Distribution (OOD) Detection: Masked Image Modeling Is All You Need

DETR With Additional Global Aggregation for Cross-Domain Weakly Supervised Object Detection

Learning Imbalanced Data With Vision Transformers

StructVPR: Distill Structural Knowledge With Weighting Samples for Visual Place Recognition

Asymmetric Feature Fusion for Image Retrieval

DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion

1% VS 100%: Parameter-Efficient Low Rank Adapter for Dense Predictions

LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models

OT-Filter: An Optimal Transport Filter for Learning With Noisy Labels

Learning Adaptive Dense Event Stereo From the Image Domain

Progressive Neighbor Consistency Mining for Correspondence Pruning

Adversarial Robustness via Random Projection Filters