Poster Session
Poster Session 6 & Exhibit Hall
Arch 4A-E
MonoHair: High-Fidelity Hair Modeling from a Monocular Video
Keyu Wu · LINGCHEN YANG · Zhiyi Kuang · Yao Feng · Xutao Han · Yuefan Shen · Hongbo Fu · Kun Zhou · Youyi Zheng
Undoubtedly, high-fidelity 3D hair is crucial for achieving realism, artistic expression, and immersion in computer graphics. While existing 3D hair modeling methods have achieved impressive performance, the challenge of achieving high-quality hair reconstruction persists: they either require strict capture conditions, making practical applications difficult, or heavily rely on learned prior data, obscuring fine-grained details in images. To address these challenges, we propose MonoHair, a generic framework to achieve high-fidelity hair reconstruction from a monocular video, without specific requirements for environments. Our approach bifurcates the hair modeling process into two main stages: precise exterior reconstruction and interior structure inference. The exterior is meticulously crafted using our Patch-based Multi-View Optimization (PMVO). This method strategically collects and integrates hair information from multiple views, independent of prior data, to produce a high-fidelity exterior 3D line map. This map not only captures intricate details but also facilitates the inference of the hair’s inner structure. For the interior, we employ a data-driven, multi-view 3D hair reconstruction method. This method utilizes 2D structural renderings derived from the reconstructed exterior, mirroring the synthetic 2D inputs used during training. This alignment effectively bridges the domain gap between our training data and real-world data, thereby enhancing the accuracy and reliability of our interior structure inference. Lastly, we generate a strand model and resolve the directional ambiguity by our hair growth algorithm. Our experiments demonstrate that our method exhibits robustness across diverse hairstyles and achieves state-of-the-art performance. For more results, please refer to our project page https://keyuwu-cs.github.io/MonoHair/
BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP
Jiawang Bai · Kuofeng Gao · Shaobo Min · Shu-Tao Xia · Zhifeng Li · Wei Liu
Contrastive Vision-Language Pre-training, known as CLIP, has shown promising effectiveness in addressing downstream image recognition tasks. However, recent works revealed that the CLIP model can be implanted with a downstream-oriented backdoor. On downstream tasks, one victim model performs well on clean samples but predicts a specific target class whenever a specific trigger is present. For injecting a backdoor, existing attacks depend on a large amount of additional data to maliciously fine-tune the entire pre-trained CLIP model, which makes them inapplicable to data-limited scenarios. In this work, motivated by the recent success of learnable prompts, we address this problem by injecting a backdoor into the CLIP model in the prompt learning stage. Our method named BadCLIP is built on a novel and effective mechanism in backdoor attacks on CLIP, i.e., influencing both the image and text encoders with the trigger. It consists of a learnable trigger applied to images and a trigger-aware context generator, such that the trigger can change text features via trigger-aware prompts, resulting in a powerful and generalizable attack. Extensive experiments conducted on 11 datasets verify that the clean accuracy of BadCLIP is similar to those of advanced prompt learning methods and the attack success rate is higher than 99% in most cases. BadCLIP is also generalizable to unseen classes, and shows a strong generalization capability under cross-dataset and cross-domain settings. The code is available at https://github.com/jiawangbai/BadCLIP.
Despite its importance, generating attacks for multi-label learning (MLL) models has received much less attention compared to multi-class recognition. Attacking an MLL model by optimizing a loss on the target set of labels has often the undesired consequence of changing the predictions for other labels. On the other hand, adding a loss on the remaining labels to keep them fixed leads to highly negatively correlated gradient directions, reducing the attack effectiveness. In this paper, we develop a framework for crafting effective and semantic-aware adversarial attacks for MLL. First, to obtain an attack that leads to semantically consistent predictions across all labels, we find a minimal superset of the target labels, referred to as consistent target set. To do so, we develop an efficient search algorithm over a knowledge graph, which encodes label dependencies. Next, we propose an optimization that searches for an attack that modifies the predictions of labels in the consistent target set while ensuring other labels will not get affected. This leads to an efficient algorithm that projects the gradient of the consistent target set loss onto the orthogonal direction of the gradient of the loss on other labels. Our framework can generate attacks on different target set sizes and for MLL with thousands of labels (as in OpenImages). Finally, by extensive experiments on three datasets and several MLL models, we show that our method generates both successful and semantically consistent attacks.
Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay
Yuhang Zhou · Zhongyun Hua
Deep neural networks have demonstrated susceptibility to adversarial attacks. Adversarial defense techniques often focus on one-shot setting to maintain robustness against attack. However, new attacks can emerge in sequences in real-world deployment scenarios. As a result, it is crucial for a defense model to constantly adapt to new attacks, but the adaptation process can lead to catastrophic forgetting of previously defended against attacks. In this paper, we discuss for the first time the concept of continual adversarial defense under a sequence of attacks, and propose a lifelong defense baseline called Anisotropic \& Isotropic Replay (AIR), which offers three advantages: (1) Isotropic replay ensures model consistency in the neighborhood distribution of new data, indirectly aligning the output preference between old and new tasks. (2) Anisotropic replay enables the model to learn a compromise data manifold with fresh mixed semantics for further replay constraints and potential future attacks. (3) A straightforward regularizer mitigates the 'plasticity-stability' trade-off by aligning model output between new and old tasks. Experiment results demonstrate that AIR can approximate or even exceed the empirical performance upper bounds achieved by Joint Training.
Learning to Transform Dynamically for Better Adversarial Transferability
Rongyi Zhu · Zeliang Zhang · Susan Liang · Zhuo Liu · Chenliang Xu
Adversarial examples, crafted by adding perturbations imperceptible to humans, can deceive neural networks. Recent studies identify the adversarial transferability across various models, i.e., the cross-model attack ability of adversarial samples. To enhance such adversarial transferability, existing input transformation-based methods diversify input data with transformation augmentation. However, their effectiveness is limited by the finite number of available transformations. In our study, we introduce a novel approach named Learning to Transform (L2T). L2T increases the diversity of transformed images by selecting the optimal combination of operations from a pool of candidates, consequently improving adversarial transferability. We conceptualize the selection of optimal transformation combinations as a trajectory optimization problem and employ a reinforcement learning strategy to effectively solve the problem. Comprehensive experiments on the ImageNet dataset, as well as practical tests with Google Vision and GPT-4V, reveal that L2T surpasses current methodologies in enhancing adversarial transferability, thereby confirming its effectiveness and practical significance.
Infrared Adversarial Car Stickers
Xiaopei Zhu · Yuqiu Liu · Zhanhao Hu · Jianmin Li · Xiaolin Hu
Infrared physical adversarial examples are of great significance for studying the security of infrared AI systems that are widely used in our lives such as autonomous driving. Previous infrared physical attacks mainly focused on 2D infrared pedestrian detection which may not fully manifest its destructiveness to AI systems. In this work, we propose a physical attack method against infrared detectors based on 3D modeling, which is applied to a real car. The goal is to design a set of infrared adversarial stickers to make cars invisible to infrared detectors at various viewing angles, distances, and scenes. We build a 3D infrared car model with real infrared characteristics and propose an infrared adversarial pattern generation method based on 3D mesh shadow. We propose a 3D control points-based mesh smoothing algorithm and use a set of smoothness loss functions to enhance the smoothness of adversarial meshes and facilitate the sticker implementation. Besides, We designed the aluminum stickers and conducted physical experiments on two real Mercedes-Benz A200L cars. Our adversarial stickers hid the cars from Faster RCNN, an object detector, at various viewing angles, distances, and scenes. The attack success rate (ASR) was 91.49% for real cars. In comparison, the ASRs of random stickers and no sticker were only 6.21% and 0.66%, respectively. In addition, the ASRs of the designed stickers against six unseen object detectors such as YOLOv3 and Deformable DETR were between 73.35%-95.80%, showing good transferability of the attack performance across detectors.
Foundation segmentation models, while powerful, pose a significant risk: they enable users to effortlessly extract any objects from any digital content with a single click, potentially leading to copyright infringement or malicious misuse. To mitigate this risk, we introduce a new task Anything Unsegmentable'' to grant any image
the right to be unsegmented''. The ambitious pursuit of the task is to achieve highly transferable adversarial attack against all prompt-based segmentation models, regardless of model parameterizations and prompts. Through observation and analysis, we found that prompt-specific adversarial attacks generate highly variant perturbations that transfer narrowly, due to the heterogeneous nature of prompts. To achieve prompt-agnostic attacks, we focus on manipulating the image encoder features. Surprisingly we found that targetted feature perturbations lead to more transferable attacks, suggesting the optimal direction of optimization should be along the image distribution. Based on the observations, we design a novel attack named Unsegment Anything by Simulating Deformation (UAD). Our attack optimizes a differentiable deformation function to create a target deformed image, which alters structural information while preserving achievable feature distance by adversarial example. The optimization objective seeks trade-off between structural deformation and the fidelity of adversarial noise in simulating this deformation. Extensive experiments verify the effectiveness of our approach, compromising a variety of promptable segmentation models with different architectures and prompt interfaces.
Efficient Model Stealing Defense with Noise Transition Matrix
Dong-Dong Wu · Chilin Fu · Weichang Wu · Wenwen Xia · Xiaolu Zhang · JUN ZHOU · Min-Ling Zhang
With the escalating complexity and investment cost of training deep neural networks, safeguarding them from unauthorized usage and intellectual property theft has become imperative. Especially the rampant misuse of prediction APIs to replicate models without access to the original data or architecture poses grave security threats. Diverse defense strategies have emerged to address these vulnerabilities, yet these defenses either incur heavy inference overheads or assume idealized attack scenarios. To address these challenges, we revisit the utilization of noise transition matrix as an efficient perturbation technique, which injects noise into predicted posteriors in a linear manner and integrates seamlessly into existing systems with minimal overhead, for model stealing defense. Provably, with such perturbed posteriors, the attacker's cloning process degrades into learning from noisy data. Toward optimizing the noise transition matrix, we proposed a novel bi-level optimization training framework, which performs fidelity on the victim model while the surrogate model adversarially. Comprehensive experimental results demonstrate that our method effectively thwarts model stealing attacks and achieves minimal utility trade-offs, outperforming existing state-of-the-art defenses.
Fully Exploiting Every Real Sample: SuperPixel Sample Gradient Model Stealing
Yunlong Zhao · Xiaoheng Deng · Yijing Liu · Xinjun Pei · Jiazhi Xia · Wei Chen
Model stealing (MS) involves querying and observing the output of a machine learning model to steal its capabilities. The quality of queried data is crucial, yet obtaining a large amount of real data for MS is often challenging. Recent works have reduced reliance on real data by using generative models. However, when high-dimensional query data is required, these methods are impractical due to the high costs of querying and the risk of model collapse. In this work, we propose using sample gradients (SG) to enhance the utility of each real sample, as SG provides crucial guidance on the decision boundaries of the victim model. However, utilizing SG in the model stealing scenario faces two challenges: 1. Pixel-level gradient estimation requires extensive query volume and is susceptible to defenses. 2. The estimation of sample gradients has a significant variance. This paper proposes Superpixel Sample Gradient stealing (SPSG) for model stealing under the constraint of limited real samples. With the basic idea of imitating the victim model's low-variance patch-level gradients instead of pixel-level gradients, SPSG achieves efficient sample gradient estimation through two steps. First, we perform patch-wise perturbations on query images to estimate the average gradient in different regions of the image. Then, we filter the gradients through a threshold strategy to reduce variance. Exhaustive experiments demonstrate that, with the same number of real samples, SPSG achieves accuracy, agreements, and adversarial success rate significantly surpassing the current state-of-the-art MS methods. Codes are available at https://github.com/zyl123456aB/SPSG_attack.
Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds
Tianrui Lou · Xiaojun Jia · Jindong Gu · Li Liu · Siyuan Liang · Bangyan He · Xiaochun Cao
Adversarial attack methods based on point manipulation for 3D point cloud classification have revealed the fragility of 3D models, yet the adversarial examples they produce are easily perceived or defended against. The trade-off between the imperceptibility and adversarial strength leads most point attack methods to inevitably introduce easily detectable outlier points upon a successful attack. Another promising strategy, shape-based attack, can effectively eliminate outliers, but existing methods often suffer significant reductions in imperceptibility due to irrational deformations. We find that concealing deformation perturbations in areas insensitive to human eyes can achieve a better trade-off between imperceptibility and adversarial strength, specifically in parts of the object surface that are complex and exhibit drastic curvature changes. Therefore, we propose a novel shape-based adversarial attack method, HiT-ADV, which initially conducts a two-stage search for attack regions based on saliency and imperceptibility scores, and then adds deformation perturbations in each attack region using Gaussian kernel functions. Additionally, HiT-ADV is extendable to physical attack. We propose that by employing benign resampling and benign rigid transformations, we can further enhance physical adversarial strength with little sacrifice to imperceptibility. Extensive experiments have validated the superiority of our method in terms of adversarial and imperceptible properties in both digital and physical spaces.
Boosting Adversarial Transferability by Block Shuffle and Rotation
Kunyu Wang · he xuanran · Wenxuan Wang · Xiaosen Wang
Adversarial examples mislead deep neural networks with imperceptible perturbations and have brought significant threats to deep learning. An important aspect is their transferability, which refers to their ability to deceive other models, thus enabling attacks in the black-box setting. Though various methods have been proposed to boost transferability, the performance still falls short compared with white-box attacks. In this work, we observe that existing input transformation based attacks, one of the mainstream transfer-based attacks, result in different attention heatmaps on various models, which might limit the transferability. We also find that breaking the intrinsic relation of the image can disrupt the attention heatmap of the original image. Based on this finding, we propose a novel input transformation based attack called block shuffle and rotation (BSR). Specifically, BSR splits the input image into several blocks, then randomly shuffles and rotates these blocks to construct a set of new images for gradient calculation. Empirical evaluations on the ImageNet dataset demonstrate that BSR could achieve significantly better transferability than the existing input transformation based methods under single-model and ensemble-model settings. Combining BSR with the current input transformation method can further improve the transferability, which significantly outperforms the state-of-the-art methods.
Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM
Linyu Tang · Lei Zhang
Numerous studies have demonstrated the susceptibility of deep neural networks (DNNs) to subtle adversarial perturbations, prompting the development of many advanced adversarial defense methods aimed at mitigating adversarial attacks. Current defense strategies usually train DNNs for a specific adversarial attack method and can achieve good defense results in defense against this type of adversarial attacks. Nevertheless, when subjected to evaluations involving unfamiliar attack modalities, empirical evidence reveals a pronounced deterioration in the robustness of DNNs. Meanwhile, there is a trade-off between the classification accuracy of clean examples and adversarial examples. Most defense methods often sacrifice the accuracy of clean examples in order to improve the adversarial robustness of DNNs. To alleviate these problems and enhance the overall robustness and generalization of DNNs, we proposed the Test-Time Pixel-Level Adversarial Purification (TPAP) method. This approach is based on the robust overfitting characteristic of DNNs to the fast gradient sign method (FGSM) on training and test datasets. It utilizes FGSM for adversarial purification, to process images for purifying unknown adversarial perturbations from pixels at testing phase time in a "counter changes with changelessness" manner, thereby enhancing the defense capability of DNNs against various unknown adversarial attacks. Extensive experimental results show that our method can effectively improves both overall robustness and generalization of DNNs, notably over previous methods.
Data Poisoning based Backdoor Attacks to Contrastive Learning
Jinghuai Zhang · Hongbin Liu · Jinyuan Jia · Neil Zhenqiang Gong
Contrastive learning (CL) pre-trains general-purpose encoders using an unlabeled pre-training dataset, which consists of images or image-text pairs. CL is vulnerable to data poisoning based backdoor attacks (DPBAs), in which an attacker injects poisoned inputs into the pre-training dataset so the encoder is backdoored. However, existing DPBAs achieve limited effectiveness. In this work, we take the first step to analyze the limitations of existing backdoor attacks and propose new DPBAs called CorruptEncoder to CL. CorruptEncoder introduces a new attack strategy to create poisoned inputs and uses a theory-guided method to maximize attack effectiveness. Our experiments show that CorruptEncoder substantially outperforms existing DPBAs. In particular, CorruptEncoder is the first DPBA that achieves more than 90% attack success rates with only a few (3) reference images and a small poisoning ratio (0.5%). Moreover, we also propose a defense, called localized cropping, to defend against DPBAs. Our results show that our defense can reduce the effectiveness of DPBAs, but it sacrifices the utility of the encoder, highlighting the need for new defenses.
NAPGuard: Towards Detecting Naturalistic Adversarial Patches
Siyang Wu · Jiakai Wang · Jiejie Zhao · Yazhe Wang · Xianglong Liu
Recently, the emergence of naturalistic adversarial patch (NAP), which possesses a deceptive appearance and various representations, underscores the necessity of developing robust detection strategies.However, existing approaches fail to differentiate the deep-seated natures in adversarial patches, i.e., aggressiveness and naturalness, leading to unsatisfactory precision and generalization against NAPs.To tackle this issue, we propose NAPGuard to provide strong detection capability against NAPs via the elaborated critical feature modulation framework.For improving precision, we propose the aggressive feature aligned learning to enhance the model's capability in capturing accurate aggressive patterns. Considering the challenge of inaccurate model learning caused by deceptive appearance, we align the aggressive features by the proposed pattern alignment loss during training. Since the model could learn more accurate aggressive patterns, it is able to detect deceptive patches more precisely.To enhance generalization, we design the natural feature suppressed inference to universally mitigate the disturbance from different NAPs. Since various representations arise in diverse disturbing forms to hinder generalization, we suppress the natural features in a unified approach via the feature shield module. Therefore, the models could recognize NAPs within less disturbance and activate the generalized detection ability.Extensive experiments show that our method surpasses state-of-the-art methods by large margins in detecting NAPs (improve 60.24% AP\@0.5 on average).
Ensemble Diversity Facilitates Adversarial Transferability
Bowen Tang · Zheng Wang · Yi Bin · Qi Dou · Yang Yang · Heng Tao Shen
With the advent of ensemble-based attacks, the transferability of generated adversarial examples is elevated by a noticeable margin despite many methods only employing superficial integration yet ignoring the diversity between ensemble models. However, most of them compromise the latent value of the diversity between generated perturbation from distinct models which we argue is also able to increase the adversarial transferability, especially heterogeneous attacks. To address the issues, we propose a novel method of Stochastic Mini-batch black-box attack with Ensemble Reweighing using reinforcement learning (SMER) to produce highly transferable adversarial examples. We emphasize the diversity between surrogate models achieving individual perturbation iteratively. In order to customize the individual effect between surrogates, ensemble reweighing is introduced to refine ensemble weights by maximizing attack loss based on reinforcement learning which functions on the ultimate transferability elevation. Extensive experiments demonstrate our superiority to recent ensemble attacks with a significant margin across different black-box attack scenarios, especially on heterogeneous conditions.
Revamping Federated Learning Security from a Defender's Perspective: A Unified Defense with Homomorphic Encrypted Data Space
Naveen Kumar Kummari · Reshmi Mitra · Krishna Mohan Chalavadi
Federated Learning (FL) facilitates clients to collaborate on training a shared machine learning model without exposing individual private data. Nonetheless, FL remains susceptible to utility and privacy attacks, notably evasion data poisoning and model inversion attacks, compromising the system's efficiency and data privacy. Existing FL defenses are often specialized to a particular single attack, lacking generality and a comprehensive defender's perspective. To address these challenges, we introduce \textbf{F}ederated \textbf{C}ryptography \textbf{D}efense (FCD), a unified single framework aligning with the defender's perspective. FCD employs row-wise transposition cipher based data encryption with a secret key to counter both evasion black-box data poisoning and model inversion attacks. The crux of FCD lies in transferring the entire learning process into an encrypted data space and using a novel distillation loss guided by the Kullback-Leibler (KL) divergence. This measure compares the probability distributions of the local pretrained teacher model's predictions on normal data and the local student model's predictions on the same data in FCD's encrypted form. By working within this encrypted space, FCD eliminates the need for decryption at the server, resulting in reduced computational complexity. We demonstrate the practical feasibility of FCD and apply it to defend against evasion utility attack on benchmark datasets (GTSRB, KBTS, CIFAR10, and EMNIST). We further extend FCD for defending against model inversion attack in split FL on the CIFAR100 dataset. Our experiments across the diverse attack and FL settings demonstrate practical feasibility and robustness against utility evasion (impact >30) and privacy attacks (MSE >73) compared to the second best method.
Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?
Zhengyue Zhao · Jinhao Duan · Kaidi Xu · Chenan Wang · Rui Zhang · Zidong Du · Qi Guo · Xing Hu
Stable Diffusion has established itself as a foundation model in generative AI artistic applications, receiving widespread research and application. Some recent fine-tuning methods have made it feasible for individuals to implant personalized concepts onto the basic Stable Diffusion model with minimal computational costs on small datasets. However, these innovations have also given rise to issues like facial privacy forgery and artistic copyright infringement. In recent studies, researchers have explored the addition of imperceptible adversarial perturbations to images to prevent potential unauthorized exploitation and infringements when personal data is used for fine-tuning Stable Diffusion. Although these studies have demonstrated the ability to protect images, it is essential to consider that these methods may not be entirely applicable in real-world scenarios. In this paper, we systematically evaluate the use of perturbations to protect images within a practical threat model. The results suggest that these approaches may not be sufficient to safeguard image privacy and copyright effectively. Furthermore, we introduce a purification method capable of removing protected perturbations while preserving the original image structure to the greatest extent possible. Experiments reveal that Stable Diffusion can effectively learn from purified images over all protective methods.
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
Lin Li · Haoyan Guan · Jianing Qiu · Michael Spratling
Large pre-trained Vision-Language Models (VLMs) like CLIP, despite having remarkable generalization ability, are highly vulnerable to adversarial examples. This work studies the adversarial robustness of VLMs from the novel perspective of the text prompt instead of the extensively studied model weights (frozen in this work). We first show that the effectiveness of both adversarial attack and defense are sensitive to the used text prompt.Inspired by this, we propose a method to improve resilience to adversarial attacks by learning a robust text prompt for VLMs. The proposed method, named Adversarial Prompt Tuning (APT), is effective while being both computationally and data efficient.Extensive experiments are conducted across 15 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show APT's superiority over hand-engineered prompts and other state-of-the-art adaption methods. APT demonstrated excellent abilities in terms of the in-distribution performance and the generalization under input distribution shift and across datasets.Surprisingly, by simply adding one learned word to the prompts, APT can significantly boost the accuracy and robustness ($\epsilon=4/255$) over the hand-engineered prompts by +13% and +8.5% on average respectively. The improvement further increases, in our most effective setting, to +26.4% for accuracy and +16.7% for robustness.
Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models
Peifei Zhu · Tsubasa Takahashi · Hirokatsu Kataoka
Diffusion Models (DMs) have shown remarkable capabilities in various image-generation tasks. However, there are growing concerns that DMs could be used to imitate unauthorized creations and thus raise copyright issues. To address this issue, we propose a novel framework that embeds personal watermarks in the generation of adversarial examples. Such examples can force DMs to generate images with visible watermarks and prevent DMs from imitating unauthorized images. We construct a generator based on conditional adversarial networks and design three losses (adversarial loss, GAN loss, and perturbation loss) to generate adversarial examples that have subtle perturbation but can effectively attack DMs to prevent copyright violations. Training a generator for a personal watermark by our method only requires 5-10 samples within 2-3 minutes, and once the generator is trained, it can generate adversarial examples with that watermark significantly fast (0.2s per image). We conduct extensive experiments in various conditional image-generation scenarios. Compared to existing methods that generate images with chaotic textures, our method adds visible watermarks on the generated images, which is a more straightforward way to indicate copyright violations. We also observe that our adversarial examples exhibit good transferability across unknown generative models. Therefore, this work provides a simple yet powerful way to protect copyright from DM-based imitation.
Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transfomers
Sheng Yang · Jiawang Bai · Kuofeng Gao · Yong Yang · Yiming Li · Shu-Tao Xia
Given the power of vision transformers, a new learning paradigm, pre-training and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically, an extra prompt token, called the switch token in this work, can turn the backdoor mode on, i.e., converting a benign model into a backdoored one. Once under the backdoor mode, a specific trigger can force the model to predict a target class. It poses a severe risk to the users of cloud API, since the malicious behavior can not be activated and detected under the benign mode, thus making the attack very stealthy. To attack a pre-trained model, our proposed attack, named SWARM, learns a trigger and prompt tokens including a switch token. They are optimized with the clean loss which encourages the model always behaves normally even the trigger presents, and the backdoor loss that ensures the backdoor can be activated by the trigger when the switch is on. Besides, we utilize the cross-mode feature distillation to reduce the effect of the switch token on clean samples. The experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack, i.e., achieving 95%+ attack success rate, and also being hard to be detected and removed.
Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training
Qian Li · Yuxiao Hu · Yinpeng Dong · Dongxiao Zhang · Yuntian Chen
Adversarial training is often formulated as a min-max problem, however, concentrating only on the worst adversarial examples causes alternating repetitive confusion of the model, i.e., previously defended or correctly classified samples are not defensible or accurately classifiable in subsequent adversarial training. We characterize such non-ignorable samples as ``hiders'', which reveal the hidden high-risk regions within the secure area obtained through adversarial training and prevent the model from finding the real worst cases. We demand the model to prevent hiders when defending against adversarial examples for improving accuracy and robustness simultaneously. By rethinking and redefining the min-max optimization problem for adversarial training, we propose a generalized adversarial training algorithm called Hider-Focused Adversarial Training (HFAT). HFAT introduces the iterative evolution optimization strategy to simplify the optimization problem and employs an auxiliary model to reveal hiders, effectively combining the optimization directions of standard adversarial training and prevention hiders. Furthermore, we introduce an adaptive weighting mechanism that facilitates the model in adaptively adjusting its focus between adversarial examples and hiders during different training periods. We demonstrate the effectiveness of our method based on extensive experiments, and ensure that HFAT can provide higher robustness and accuracy. We will release the source code upon publication.
Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving
Junhao Zheng · Chenhao Lin · Jiahao Sun · Zhengyu Zhao · Qian Li · Chao Shen
Deep learning-based monocular depth estimation (MDE), extensively applied in autonomous driving, is known to be vulnerable to adversarial attacks. Previous physical attacks against MDE models rely on 2D adversarial patches, so they only affect a small, localized region in the MDE map but fail under various viewpoints. To address these limitations, we propose 3D Depth Fool (3D$^2$Fool), the first 3D texture-based adversarial attack against MDE models. 3D$^2$Fool is specifically optimized to generate 3D adversarial textures agnostic to model types of vehicles and to have improved robustness in bad weather conditions, such as rain and fog. Experimental results validate the superior performance of our 3D$^2$Fool across various scenarios, including vehicles, MDE models, weather conditions, and viewpoints. Real-world experiments with printed 3D textures on physical vehicle models further demonstrate that our 3D$^2$Fool can cause an MDE error of over 10 meters.
Distraction is All You Need: Memory-Efficient Image Immunization against Diffusion-Based Image Editing
Ling Lo · Cheng Yeo · Hong-Han Shuai · Wen-Huang Cheng
Recent text-to-image (T2I) diffusion models have revolutionized image editing by empowering users to control outcomes using natural language. However, the ease of image manipulation has raised ethical concerns, with the potential for malicious use in generating deceptive or harmful content. To address the concerns, we propose an image immunization approach named semantic attack to protect our images from being manipulated by malicious agents using diffusion models. Our approach focuses on disrupting the semantic understanding of T2I diffusion models regarding specific content. By attacking the cross-attention mechanism that encodes image features with text messages during editing, we distract the model's attention regarding the content of our concern. Our semantic attack renders the model uncertain about the areas to edit, resulting in poorly edited images and contradicting the malicious editing attempts. In addition, by shifting the attack target towards intermediate attention maps from the final generated image, our approach substantially diminishes computational burden and alleviates GPU memory constraints in comparison to previous methods. Moreover, we introduce timestep universal gradient updating to create timestep-agnostic perturbations effective across different input noise levels. By treating the full diffusion process as discrete denoising timesteps during the attack, we achieve equivalent or even superior immunization efficacy with nearly half the memory consumption of the previous method. Our contributions include a practical and effective approach to safeguard images against malicious editing, and the proposed method offers robust immunization against various image inpainting and editing approaches, showcasing its potential for real-world applications.
PAD: Patch-Agnostic Defense against Adversarial Patch Attacks
Lihua Jing · Rui Wang · Wenqi Ren · Xin Dong · Cong Zou
Adversarial patch attacks present a significant threat to real-world object detectors due to their practical feasibility. Existing defense methods, which rely on attack data or prior knowledge, struggle to effectively address a wide range of adversarial patches. In this paper, we show two inherent characteristics of adversarial patches, semantic independence and spatial heterogeneity, independent of their appearance, shape, size, quantity, and location. Semantic independence indicates that adversarial patches operate autonomously within their semantic context, while spatial heterogeneity manifests as distinct image quality of the patch area that differs from original clean image due to the independent generation process. Based on these observations, we propose PAD, a novel adversarial patch localization and removal method that does not require prior knowledge or additional training. PAD offers patch-agnostic defense against various adversarial patches, compatible with any pre-trained object detectors. Our comprehensive digital and physical experiments involving diverse patch types, such as localized noise, printable, and naturalistic patches, exhibit notable improvements over state-of-the-art works. Our code is available at https://github.com/Lihua-Jing/PAD.
PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor
Jaewon Jung · Hongsun Jang · Jaeyong Song · Jinho Lee
Adversarial robustness of the neural network is a significant concern when it is applied to security-critical domains.In this situation, adversarial distillation is a promising option which aims to distill the robustness of the teacher network to improve the robustness of a small student network.Previous works pretrain the teacher network to make it robust against the adversarial examples aimed at itself.However, the adversarial examples are dependent on the parameters of the target network.The fixed teacher network inevitably degrades its robustness against the unseen transferred adversarial examples which target the parameters of the student network in the adversarial distillation process.We propose PeerAiD to make a peer network learn the adversarial examples of the student network instead of adversarial examples aimed at itself.PeerAiD is an adversarial distillation that trains the peer network and the student network simultaneously in order to specialize the peer network for defending the student network.We observe that such peer networks surpass the robustness of the pretrained robust teacher model against adversarial examples aimed at the student network.With this peer network and adversarial distillation, PeerAiD achieves significantly higher robustness of the student network with AutoAttack (AA) accuracy by up to 1.66%p and improves the natural accuracy of the student network by up to 4.72%p with ResNet-18 on TinyImageNet dataset.Code is available at https://github.com/jaewonalive/PeerAiD.
Revisiting Adversarial Training Under Long-Tailed Distributions
Xinli Yue · Ningping Mou · Qian Wang · Lingchen Zhao
Deep neural networks are vulnerable to adversarial attacks, leading to erroneous outputs. Adversarial training has been recognized as one of the most effective methods to counter such attacks. However, existing adversarial training techniques have predominantly been evaluated on balanced datasets, whereas real-world data often exhibit a long-tailed distribution, casting doubt on the efficacy of these methods in practical scenarios. In this paper, we delve into the performance of adversarial training under long-tailed distributions. Through an analysis of the prior method "RoBal" (Wu et al., CVPR'21), we discover that utilizing Balanced Softmax Loss (BSL) alone can obtain comparable performance to the complete RoBal approach while significantly reducing the training overhead. Then, we reveal that adversarial training under long-tailed distributions also suffers from robust overfitting similar to uniform distributions. We explore utilizing data augmentation to mitigate this issue and unexpectedly discover that, unlike results obtained with balanced data, data augmentation not only effectively alleviates robust overfitting but also significantly improves robustness. We further identify that the improvement is attributed to the increased diversity of training data. Extensive experiments further corroborate that data augmentation alone can significantly improve robustness. Finally, building on these findings, we demonstrate that compared to RoBal, the combination of BSL and data augmentation leads to a +6.66% improvement in model robustness under AutoAttack on CIFAR-10-LT. Our code is available at: https://github.com/NISPLab/AT-BSL.
Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness
Sibo Wang · Jie Zhang · Zheng Yuan · Shiguang Shan
Large-scale pre-trained vision-language models like CLIP have demonstrated impressive performance across various tasks, and exhibit remarkable zero-shot generalization capability, while they are also vulnerable to imperceptible adversarial examples. Existing works typically employ adversarial training (fine-tuning) as a defense method against adversarial examples. However, direct application to the CLIP model may result in overfitting, compromising the model's capacity for generalization.In this paper, we propose Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT) method, which leverages supervision from the original pre-trained model by carefully designing an auxiliary branch, to enhance the model's zero-shot adversarial robustness.Specifically, PMG-AFT minimizes the distance between the features of adversarial examples in the target model and those in the pre-trained model, aiming to preserve the generalization features already captured by the pre-trained model.Extensive Experiments on 15 zero-shot datasets demonstrate that PMG-AFT significantly outperforms the state-of-the-art method, improving the top-1 robust accuracy by an average of 4.99\%.Furthermore, our approach consistently improves clean accuracy by an average of 8.72\%.Our code is available at \href{https://github.com/serendipity1122/Pre-trained-Model-Guided-Fine-Tuning-for-Zero-Shot-Adversarial-Robustness}{here}.\footnote{https://github.com/serendipity1122/Pre-trained-Model-Guided-Fine-Tuning-for-Zero-Shot-Adversarial-Robustness}
Towards Transferable Targeted 3D Adversarial Attack in the Physical World
Yao Huang · Yinpeng Dong · Shouwei Ruan · Xiao Yang · Hang Su · Xingxing Wei
Compared with transferable untargeted attacks, transferable targeted adversarial attacks could specify the misclassification categories of adversarial samples, posing a greater threat to security-critical tasks. In the meanwhile, 3D adversarial samples, due to their potential of multi-view robustness, can more comprehensively identify weaknesses in existing deep learning systems, possessing great application value. However, the field of transferable targeted 3D adversarial attacks remains vacant. The goal of this work is to develop a more effective technique that could generate transferable targeted 3D adversarial examples, filling the gap in this field. To achieve this goal, we design a novel framework named TT3D that could rapidly reconstruct from few multi-view images into Transferable Targeted 3D textured meshes. While existing mesh-based texture optimization methods compute gradients in the high-dimensional mesh space and easily fall into local optima, leading to unsatisfactory transferability and significant distortions in naturalness, TT3D innovatively performs dual optimization towards both feature grid and Multi-layer Perceptron (MLP) parameters in the grid-based NeRF space, which significantly enhances black-box transferability meanwhile enjoying the naturalness. Experimental results show that TT3D not only exhibits superior cross-model transferability but also maintains considerable adaptability across different renders and vision tasks. More importantly, we produce 3D adversarial textured meshes with 3D printing techniques in the real world and verify their robust performance under various scenarios.
Nearest is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks
Boheng Li · Yishuo Cai · Haowei Li · Feng Xue · Zhifeng Li · Yiming Li
Model quantization is widely used to compress and accelerate deep neural networks. However, recent studies have revealed the feasibility of weaponizing model quantization via implanting quantization-conditioned backdoors (QCBs). These special backdoors stay dormant on released full-precision models but will come into effect after standard quantization. Due to the peculiarity of QCBs, existing defenses have minor effects on reducing their threats or are even infeasible. In this paper, we conduct the first in-depth analysis of QCB. We reveal that the activation of existing QCBs primarily stems from the nearest rounding operation and is closely related to the norms of neuron-wise truncation errors ($i.e.$, the difference between the continuous full-precision weights and its quantized version). Motivated by these insights, we propose \textbf{E}rror-guided \textbf{F}lipped \textbf{R}ounding with \textbf{A}ctivation \textbf{P}reservation (EFRAP), an effective and practical defense against QCBs. Specifically, EFRAP learns a non-nearest rounding strategy with neuron-wise error norm and layer-wise activation preservation guidance, flipping the rounding strategies of neurons crucial for backdoor effects but with minimal impact on clean accuracy. Extensive evaluations on benchmark datasets demonstrate that our EFRAP can defeat state-of-the-art QCB attacks under various settings.
Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models
Jingyao Xu · Yuetong Lu · Yandong Li · Siyang Lu · Dongdong Wang · Xiang Wei
Diffusion models (DMs) embark a new era of generative modeling and offer more opportunities for efficient generating high-quality and realistic data samples. However, their widespread use has also brought forth new challenges in model security, which motivates the creation of more effective adversarial attackers on DMs to understand its vulnerability. We propose CAAT, a simple but generic and efficient approach that does not require costly training to effectively fool latent diffusion models (LDMs). The approach is based on the observation that cross-attention layers exhibits higher sensitivity to gradient change, allowing for leveraging subtle perturbations on published images to significantly corrupt the generated images. We show that a subtle perturbation on an image can significantly impact the cross-attention layers, thus changing the mapping between text and image during the fine-tuning of customized diffusion models. Extensive experiments demonstrate that CAAT is compatible with diverse diffusion models and outperforms baseline attack methods in a more effective (more noise) and efficient (twice as fast as Anti-DreamBooth and Mist) manner.
Boosting Adversarial Training via Fisher-Rao Norm-based Regularization
Xiangyu Yin · Wenjie Ruan
Adversarial training is extensively utilized to improve the adversarial robustness of deep neural networks. Yet, mitigating the degradation of standard generalization performance in adversarial-trained models remains an open problem. This paper attempts to resolve this issue through the lens of model complexity. First, We leverage the Fisher-Rao norm, a geometrically invariant metric for model complexity, to establish the non-trivial bounds of the Cross-Entropy Loss-based Rademacher complexity for a ReLU-activated Multi-Layer Perceptron. Then we generalize a complexity-related variable, which is sensitive to the changes in model width and the trade-off factors in adversarial training. Moreover, intensive empirical evidence validates that this variable highly correlates with the generalization gap of Cross-Entropy loss between adversarial-trained and standard-trained models, especially during the initial and final phases of the training process. Building upon this observation, we propose a novel regularization framework, called Logit-Oriented Adversarial Training (LOAT), which can mitigate the trade-off between robustness and accuracy while imposing only a negligible increase in computational overhead. Our extensive experiments demonstrate that the proposed regularization strategy can boost the performance of the prevalent adversarial training algorithms, including PGD-AT, TRADES, TRADES (LSE), MART, and DM-AT, across various network architectures. Our code will be available at https://github.com/TrustAI/LOAT.
Random Entangled Tokens for Adversarially Robust Vision Transformer
Huihui Gong · Minjing Dong · Siqi Ma · Seyit Camtepe · Surya Nepal · Chang Xu
Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs) in the realm of computer vision, showcasing tremendous potential. However, recent research has unveiled a susceptibility of ViTs to adversarial attacks, akin to their CNN counterparts. Adversarial training and randomization are two representative effective defenses for CNNs. Some researchers have attempted to apply adversarial training to ViTs and achieved comparable robustness to CNNs, while it is not easy to directly apply randomization to ViTs because of the architecture difference between CNNs and ViTs. In this paper, we delve into the structural intricacies of ViTs and propose a novel defense mechanism termed Random entangled image Transformer (ReiT), which seamlessly integrates adversarial training and randomization to bolster the adversarial robustness of ViTs. Recognizing the challenge posed by the structural disparities between ViTs and CNNs, we introduce a novel module, input-independent random entangled self-attention (II-ReSA). This module optimizes random entangled tokens that lead to "dissimilar" self-attention outputs by leveraging model parameters and the sampled random tokens, thereby synthesizing the self-attention module outputs and random entangled tokens to diminish adversarial similarity. ReiT incorporates two distinct random entangled tokens and employs dual randomization, offering an effective countermeasure against adversarial examples while ensuring comprehensive deduction guarantees. Through extensive experiments conducted on various ViT variants and benchmarks, we substantiate the superiority of our proposed method in enhancing the adversarial robustness of Vision Transformers.
Deep neural networks have played a crucial part in many critical domains, such as autonomous driving, face recognition, and medical diagnosis. However, deep neural networks are facing security threats from backdoor attacks and can be manipulated into attacker-decided behaviors by the backdoor attacker. To defend the backdoor, prior research has focused on using clean data to remove backdoor attacks before model deployment. In this paper, we investigate the possibility of defending against backdoor attacks by utilizing test-time partially poisoned data to remove the backdoor from the model. To address the problem, a two-stage method TTBD is proposed. In the first stage, we propose a backdoor sample detection method DDP to identify poisoned samples from a batch of mixed, partially poisoned samples. Once the poisoned samples are detected, we employ Shapley estimation to calculate the contribution of each neuron's significance in the network, locate the poisoned neurons, and prune them to remove backdoor in the models. Our experiments demonstrate that TTBD removes the backdoor successfully with only a batch of partially poisoned data across different model architectures and datasets against different types of backdoor attacks.
1-Lipschitz Layers Compared: Memory Speed and Certifiable Robustness
Bernd Prach · Fabio Brau · Giorgio Buttazzo · Christoph Lampert
The robustness of neural networks against input perturbations with bounded magnitude represents a serious concern in the deployment of deep learning models in safety-critical systems. Recently, the scientific community has focused on enhancing certifiable robustness guarantees by crafting 1-Lipschitz neural networks that leverage Lipschitz bounded dense and convolutional layers. Although different methods have been proposed in the literature to achieve this goal, understanding the performance of such methods is not straightforward, since different metrics can be relevant (e.g., training time, memory usage, accuracy, certifiable robustness) for different applications. For this reason, this work provides a thorough theoretical and empirical comparison between methods by evaluating them in terms of memory usage, speed, and certifiable robust accuracy. The paper also provides some guidelines and recommendations to support the user in selecting the methods that work best depending on the available resources.
DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection
Yuhao Sun · Lingyun Yu · Hongtao Xie · Jiaming Li · Yongdong Zhang
With the rapid development of face recognition (FR) systems, the privacy of face images on social media is facing severe challenges due to the abuse of unauthorized FR systems. Some studies utilize adversarial attack techniques to defend against malicious FR systems by generating adversarial examples. However, the generated adversarial examples, i.e., the protected face images, tend to suffer from subpar visual quality and low transferability. In this paper, we propose a novel face protection approach, dubbed DiffAM, which leverages the powerful generative ability of diffusion models to generate high-quality protected face images with adversarial makeup transferred from reference images. To be specific, we first introduce a makeup removal module to generate non-makeup images utilizing a fine-tuned diffusion model with guidance of textual prompts in CLIP space. As the inverse process of makeup transfer, makeup removal can make it easier to establish the deterministic relationship between makeup domain and non-makeup domain regardless of elaborate text prompts. Then, with this relationship, a CLIP-based makeup loss along with an ensemble attack strategy is introduced to jointly guide the direction of adversarial makeup domain, achieving the generation of protected face images with natural-looking makeup and high black-box transferability. Extensive experiments demonstrate that DiffAM achieves higher visual quality and attack success rates with a gain of 13.14% under black-box setting compared with the state of the arts.
DAP: A Dynamic Adversarial Patch for Evading Person Detectors
Amira Guesmi · Ruitian Ding · Muhammad Abdullah Hanif · Ihsen Alouani · Muhammad Shafique
Patch-based adversarial attacks were proven to compromise the robustness and reliability of computer vision systems.However, their conspicuous and easily detectable nature challenge their practicality in real-world setting. To address this, recent work has proposed using Generative Adversarial Networks (GANs) to generate naturalistic patches that may not attract human attention. However, such approaches suffer from a limited latent space making it challenging to produce a patch that is efficient, stealthy, and robust to multiple real-world transformations.This paper introduces a novel approach that produces a Dynamic Adversarial Patch (DAP) designed to overcome these limitations. DAP maintains a naturalistic appearance while optimizing attack efficiency and robustness to real-world transformations.The approach involves redefining the optimization problem and introducing a novel objective function that incorporates a similarity metric to guide the patch's creation. Unlike GAN-based techniques, the DAP directly modifies pixel values within the patch, providing increased flexibility and adaptability to multiple transformations. Furthermore, most clothing-based physical attacks assume static objects and ignore the possible transformations caused by non-rigid deformation due to changes in a person’s pose. To address this limitation, a `Creases Transformation' (CT) block is introduced, enhancing the patch's resilience to a variety of real-world distortions.Experimental results demonstrate that the proposed approach outperforms state-of-the-art attacks, achieving a success rate of up to 82.28\% in the digital world when targeting the YOLOv7 detector and 65\% in the physical world when targeting YOLOv3tiny detector deployed in edge-based smart cameras.
Adversarial Distillation Based on Slack Matching and Attribution Region Alignment
Shenglin Yin · Zhen Xiao · Mingxuan Song · Jieyi Long
Adversarial distillation (AD) is a highly effective method for enhancing the robustness of small models.Contrary to expectations, a high-performing teacher model does not always result in a more robust student model.This is due to two main reasons. First, when there are significant differences in predictions between the teacher model and the student model, exact matching of predicted values using KL divergence interferes with training, leading to poor performance of existing methods. Second, matching solely based on the output prevents the student model from fully understanding the behavior of the teacher model.To address these challenges, this paper proposes a novel AD method named SmaraAD. During the training process, we facilitate the student model in better understanding the teacher model's behavior by aligning the attribution region that the student model focuses on with that of the teacher model. Concurrently, we relax the condition of exact matching in KL divergence and replace it with a more flexible matching criterion, thereby enhancing the model's robustness. Extensive experiments substantiate the effectiveness of our method in improving the robustness of small models, outperforming previous SOTA methods.
Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement
Han Wu · Guanyan Ou · Weibin Wu · Zibin Zheng
Various transfer attack methods have been proposed to evaluate the robustness of deep neural networks (DNNs). Although manifesting remarkable performance in generating untargeted adversarial perturbations, existing proposals still fail to achieve high targeted transferability. In this work, we discover that the adversarial perturbations' overfitting towards source models of mediocre generalization capability can hurt their targeted transferability. To address this issue, we focus on enhancing the source model's generalization capability to improve its ability to conduct transferable targeted adversarial attacks. In pursuit of this goal, we propose a novel model self-enhancement method that incorporates two major components: Sharpness-Aware Self-Distillation (SASD) and Weight Scaling (WS). Specifically, SASD distills a fine-tuned auxiliary model, which mirrors the source model's structure, into the source model while flattening the source model's loss landscape. WS obtains an approximate ensemble of numerous pruned models to perform model augmentation, which can be conveniently synergized with SASD to elevate the source model's generalization capability and thus improve the resultant targeted perturbations' transferability. Extensive experiments corroborate the effectiveness of the proposed method. Notably, under the black-box setting, our approach can outperform the state-of-the-art baselines by a significant margin of 12.2\% on average in terms of the obtained targeted transferability. Code is available at https://github.com/g4alllf/SASD.
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
Xuanming Cui · Alejandro Aparcedo · Young Kyun Jang · Ser-Nam Lim
Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models, the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks, evaluated across tasks including image classification, image captioning, and Visual Question Answer (VQA). We find that in general LMMs are not robust to visual adversarial inputs. However, our findings suggest that context provided to the model via prompts—such as questions in a QA pair—helps to mitigate the effects of visual adversarial inputs. Notably, the LMMs evaluated demonstrated remarkable resilience to such attacks on the ScienceQA task with only an 8.10% drop in performance compared to their visual counterparts which dropped 99.73%. We also propose a new approach to real-world image classification which we term query decomposition. By incorporating existence queries into our input prompt we observe diminished attack effectiveness and improvements in image classification accuracy. This research highlights a previously under explored facet of LMM robustness and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments.
Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models
Takami Sato · Justin Yue · Nanze Chen · Ningfei Wang · Alfred Chen
Denoising probabilistic diffusion models have shown breakthrough performance to generate more photo-realistic images or human-level illustrations than the prior models such as GANs. This high image-generation capability has stimulated the creation of many downstream applications in various areas. However, we find that this technology is actually a double-edged sword: We identify a new type of attack, called the Natural Denoising Diffusion (NDD) attack based on the finding that state-of-the-art deep neural network (DNN) models still hold their prediction even if we intentionally remove their robust features, which are essential to the human visual system (HVS), through text prompts. The NDD attack shows a significantly high capability to generate low-cost, model-agnostic, and transferable adversarial attacks by exploiting the natural attack capability in diffusion models. To systematically evaluate the risk of the NDD attack, we perform a large-scale empirical study with our newly created dataset, the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the natural attack capability by answering 6 research questions. Through a user study, we find that it can achieve an 88% detection rate while being stealthy to 93% of human subjects; we also find that the non-robust features embedded by diffusion models contribute to the natural attack capability. To confirm the model-agnostic and transferable attack capability, we perform the NDD attack against the Tesla Model 3 and find that 73% of the physically printed attacks can be detected as stop signs. Our hope is that the study and dataset can help our community be aware of the risks in diffusion models and facilitate further research toward robust DNN models.
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
Siyuan Liang · Mingli Zhu · Aishan Liu · Baoyuan Wu · Xiaochun Cao · Ee-Chien Chang
While existing backdoor attacks have successfully infected multimodal contrastive learning models such as CLIP, they can be easily countered by specialized backdoor defenses for MCL models. This paper reveals the threats in this practical scenario and introduces the BadCLIP attack, which is resistant to backdoor detection and model fine-tuning defenses. To achieve this, we draw motivations from the perspective of the Bayesian rule and propose a dual-embedding guided framework for backdoor attacks. Specifically, we ensure that visual trigger patterns approximate the textual target semantics in the embedding space, making it challenging to detect the subtle parameter variations induced by backdoor learning on such natural trigger patterns. Additionally, we optimize the visual trigger patterns to align the poisoned samples with target vision features in order to hinder backdoor unlearning through clean fine-tuning. Our experiments show a significant improvement in attack success rate (+45.3 % ASR) over current leading methods, even against state-of-the-art backdoor defenses, highlighting our attack's effectiveness in various scenarios, including downstream tasks. Our codes can be found at https://github.com/LiangSiyuan21/BadCLIP.
MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models
Yanting Wang · Hongye Fu · Wei Zou · Jinyuan Jia
Different from a unimodal model whose input is from a single modality, the input (called multi-modal input) of a multi-modal model is from multiple modalities such as image, 3D points, audio, text, etc. Similar to unimodal models, many existing studies show that a multi-modal model is also vulnerable to adversarial perturbation, where an attacker could add small perturbation to all modalities of a multi-modal input such that the multi-modal model makes incorrect predictions for it. Existing certified defenses are mainly designed for unimodal models. Our experimental results show they achieve sub-optimal certified robustness guarantees when extended to multi-modal models. In our work, we aim to bridge the gap. In particular, we propose MMCert, the first certified defense against adversarial attacks to a multi-modal model. We derive a lower bound on the performance of our MMCert under arbitrary adversarial attacks with bounded perturbations to both modalities (e.g., in the context of auto-driving, we bound the number of changed pixels in both RGB image and depth image). We evaluate our MMCert using two benchmark datasets: one for the multi-modal road segmentation task and the other for the multi-modal emotion recognition task. Moreover, we compare our MMCert with a state-of-the-art certified defense extended from unimodal models. Our experimental results show that our MMCert outperforms the baseline.
MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model
Kaiyu Song · Hanjiang Lai · Yan Pan · Jian Yin
Deep neural networks (DNNs) are vulnerable to adversarial perturbation, where an imperceptible perturbation is added to the image that can fool the DNNs. Diffusion-based adversarial purification uses the diffusion model to generate a clean image against such adversarial attacks. Unfortunately, the generative process of the diffusion model is also inevitably affected by adversarial perturbation since the diffusion model is also a deep neural network where its input has adversarial perturbation. In this work, we propose MimicDiffusion, a new diffusion-based adversarial purification technique that directly approximates the generative process of the diffusion model with the clean image as input. Concretely, we analyze the differences between the guided terms using the clean image and the adversarial sample. After that, we first implement MimicDiffusion based on Manhattan distance. Then, we propose two guidance to purify the adversarial perturbation and approximate the clean diffusion model. Extensive experiments on three image datasets, including CIFAR-10, CIFAR-100, and ImageNet, with three classifier backbones including WideResNet-70-16, WideResNet-28-10, and ResNet-50 demonstrate that MimicDiffusion significantly performs better than the state-of-the-art baselines. On CIFAR-10, CIFAR-100, and ImageNet, it achieves 92.67\%, 61.35\%, and 61.53\% average robust accuracy, which are 18.49\%, 13.23\%, and 17.64\% higher, respectively. The code is available at https://github.com/psky1111/MimicDiffusion.
Revisiting Adversarial Training at Scale
Zeyu Wang · Xianhang li · Hongru Zhu · Cihang Xie
The machine learning community has witnessed a drastic change in the training pipeline, pivoted by those ''foundation models'' with unprecedented scales. However, the field of adversarial training is lagging behind, predominantly centered around small model sizes like ResNet-50, and tiny and low-resolution datasets like CIFAR-10. To bridge this transformation gap, this paper provides a modern re-examination with adversarial training, investigating its potential benefits when applied at scale. Additionally, we introduce an efficient and effective training strategy to enable adversarial training with giant models and web-scale data at an affordable computing cost. We denote this newly introduced framework as AdvXL.Empirical results demonstrate that AdvXL establishes new state-of-the-art robust accuracy records under AutoAttack on ImageNet-1K. For example, by training on DataComp-1B dataset, our AdvXL empowers a vanilla ViT-g model to substantially surpass the previous records of $l_{\infty}$-, $l_{2}$-, and $l_{1}$-robust accuracy by margins of **11.4**, **14.2** and **12.9**, respectively. This achievement posits AdvXL as a pioneering approach, charting a new trajectory for the efficient training of robust visual representations at significantly larger scales. Our code is available at https://github.com/UCSC-VLAA/AdvXL.
Language-Driven Anchors for Zero-Shot Adversarial Robustness
Xiao Li · Wei Zhang · Yining Liu · Zhanhao Hu · Bo Zhang · Xiaolin Hu
Deep Neural Networks (DNNs) are known to be susceptible to adversarial attacks. Previous researches mainly focus on improving adversarial robustness in the fully supervised setting, leaving the challenging domain of zero-shot adversarial robustness an open question. In this work, we investigate this domain by leveraging the recent advances in large vision-language models, such as CLIP, to introduce zero-shot adversarial robustness to DNNs. We propose LAAT, a Language-driven, Anchor-based Adversarial Training strategy. LAAT utilizes the features of a text encoder for each category as fixed anchors (normalized feature embeddings) for each category, which are then employed for adversarial training. By leveraging the semantic consistency of the text encoders, LAAT aims to enhance the adversarial robustness of the image model on novel categories. However, naively using text encoders leads to poor results. Through analysis, we identified the issue to be the high cosine similarity between text encoders. We then design an expansion algorithm and an alignment cross-entropy loss to alleviate the problem. Our experimental results demonstrated that LAAT significantly improves zero-shot adversarial robustness over state-of-the-art methods. LAAT has the potential to enhance adversarial robustness by large-scale multimodal models, especially when labeled data is unavailable during training.
Transferable Structural Sparse Adversarial Attack Via Exact Group Sparsity Training
Di Ming · Peng Ren · Yunlong Wang · Xin Feng
Deep neural networks (DNNs) are vulnerable to highly transferable adversarial attacks. Especially, many studies have shown that sparse attacks pose a significant threat to DNNs on account of their exceptional imperceptibility. Current sparse attack methods mostly limit only the magnitude and number of perturbations while generally overlooking the location of the perturbations, resulting in decreased performances on attack transferability. A subset of studies indicates that perturbations existing in the significant regions with rich classification-relevant features are more effective. Leveraging this insight, we introduce the structural sparsity constraint in the framework of generative models to limit the perturbation positions. To ensure that the perturbations are generated towards classification-relevant regions, we propose an exact group sparsity training method to learn pixel-level and group-level sparsity. For purpose of improving the effectiveness of sparse training, we further put forward masked quantization network and multi-stage optimization algorithm in the training process. Utilizing CNNs as surrogate models, extensive experiments demonstrate that our method has higher transferability in image classification attack compared to state-of-the-art methods at approximately same sparsity levels. In cross-model ViT, object detection, and semantic segmentation attack tasks, we also achieve a better attack success rate. Code is available at https://github.com/MisterRpeng/EGS-TSSA.
Fooling Polarization-Based Vision using Locally Controllable Polarizing Projection
Zhuoxiao Li · Zhihang Zhong · Shohei Nobuhara · Ko Nishino · Yinqiang Zheng
Polarization is a fundamental property of light that encodes abundant information regarding surface shape, material, illumination and viewing geometry. The computer vision community has witnessed a blossom of polarization-based vision applications, such as reflection removal, shape-from-polarization (SfP), transparent object segmentation and color constancy, partially due to the emergence of single-chip mono/color polarization sensors that make polarization data acquisition easier than ever. However, is polarization-based vision vulnerable to adversarial attacks? If so, is that possible to realize these adversarial attacks in the physical world, without being perceived by human eyes? In this paper, we warn the community of the vulnerability of polarization-based vision, which can be more serious than RGB-based vision. By adapting a commercial LCD projector, we achieve locally controllable polarizing projection, which is successfully utilized to fool state-of-the-art polarization-based vision algorithms for glass segmentation and SfP. Compared with existing physical attacks on RGB-based vision, which always suffer from the trade-off between attack efficacy and eye conceivability, the adversarial attackers based on polarizing projection are contact-free and visually imperceptible, since naked human eyes can rarely perceive the difference of viciously manipulated polarizing light and ordinary illumination. This poses unprecedented risks on polarization-based vision, for which due attentions should be paid and counter measures be considered.
Overload: Latency Attacks on Object Detection for Edge Devices
Erh-Chung Chen · Pin-Yu Chen · I-Hsin Chung · Che-Rung Lee
Nowadays, the deployment of deep learning based applications is an essential task owing to the increasing demands on intelligent services. In this paper, we investigate latency attacks on deep learning applications. Unlike common adversarial attacks for misclassification, the goal of latency attacks is to increase the inference time, which may stop applications from responding to the requests within a reasonable time. This kind of attack is ubiquitous for various applications, and we use object detection to demonstrate how such kind of attacks work. We also design a framework named Overload to generate latency attacks at scale. Our method is based on a newly formulated optimization problem and a novel technique, called spatial attention. This attack serves to escalate the required computing costs during the inference time, consequently leading to an extended inference time for object detection. It presents a significant threat, especially to systems with limited computing resources. We have conducted experiments using YOLOv5 models on Nvidia NX. Compared to existing methods, our attacking method is simpler and more effective. The experimental results show that with latency attacks, the inference time of a single image can be increased ten times longer in reference to the normal setting. Moreover, our findings pose a potential new threat to all object detection tasks requiring non-maximum suppression (NMS), as our attack is NMS-agnostic.
Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models
Samar Fares · Karthik Nandakumar
Poisoning (trojan/backdoor) attacks enable an adversary to train and deploy a corrupted machine learning (ML) model, which typically works well and achieves good accuracy on clean input samples but behaves maliciously on poisoned samples containing specific trigger patterns. Using such poisoned ML models as the foundation to build real-world systems can compromise application safety. Hence, there is a critical need for algorithms that detect whether a given target model has been poisoned. This work proposes a novel approach for detecting poisoned models called Attack To Defend (A2D), which is based on the observation that poisoned models are more sensitive to adversarial perturbations compared to benign models. We propose a metric called sensitivity to adversarial perturbations (SAP) to measure the sensitivity of a ML model to adversarial attacks at a specific perturbation bound. We then generate strong adversarial attacks against an unrelated reference model and estimate the SAP value of the target model by transferring the generated attacks. The target model is deemed to be a $trojan$ if its SAP value exceeds a decision threshold. The A2D framework requires only black-box access to the target model and a small clean set, while being computationally efficient. The A2D approach has been evaluated on four standard image datasets and its effectiveness under various types of poisoning attacks has been demonstrated.
Towards Understanding and Improving Adversarial Robustness of Vision Transformers
Samyak Jain · Tanima Dutta
Recent literature has demonstrated that vision transformers (VITs) exhibit superior performance compared to convolutional neural networks (CNNs). The majority of recent research on adversarial robustness, however, has predominantly focused on CNNs. In this work, we bridge this gap by analyzing the effectiveness of existing attacks on VITs. We demonstrate that due to the softmax computations in every attention block in VITs, they are inherently vulnerable to floating point underflow errors. This can lead to a gradient masking effect resulting in suboptimal attack strength of well-known attacks, like PGD, Carlini and Wagner (CW) and GAMA attacks. Motivated by this, we propose Adaptive Attention Scaling (AAS) attack that can automatically find the optimal scaling factors of pre-softmax outputs using gradient-based optimization. We show that the proposed simple strategy can be incorporated with any existing adversarial attacks as well as adversarial training methods and achieved improved performance. On VIT-B16, we demonstrate an improved attack strength of upto 2.2% on CIFAR10 and upto 2.9% on CIFAR100 by incorporating the proposed AAS attack with state-of-the-art single attack methods like GAMA attack. Further, we utilise the proposed AAS attack for every few epochs in existing adversarial training methods, which is termed as Adaptive Attention Scaling Adversarial Training (AAS-AT). On incorporating AAS-AT with existing methods, we outperform them on VITs over 1.3-3.5% on CIFAR10. We observe improved performance on ImageNet-100 as well.
Towards Fairness-Aware Adversarial Learning
Yanghao Zhang · Tianle Zhang · Ronghui Mu · Xiaowei Huang · Wenjie Ruan
Although adversarial training (AT) has proven effective in enhancing the model's robustness, the recently revealed issue of fairness in robustness has not been well addressed, i.e. the robust accuracy varies significantly among different categories. In this paper, instead of uniformly evaluating the model's average class performance, we delve into the issue of robust fairness, by considering the worst-case distribution across various classes. We propose a novel learning paradigm, named Fairness-Aware Adversarial Learning (FAAL). As a generalization of conventional AT, we re-define the problem of adversarial training as a min-max-max framework, to ensure both robustness and fairness of the trained model. Specifically, by taking advantage of distributional robust optimization, our method aims to find the worst distribution among different categories, and the solution is guaranteed to obtain the upper bound performance with high probability. In particular, FAAL can fine-tune an unfair robust model to be fair within only two epochs, without compromising the overall clean and robust accuracies. Extensive experiments on various image datasets validate the superior performance and efficiency of the proposed FAAL compared to other state-of-the-art methods.
Byzantine-robust Decentralized Federated Learning via Dual-domain Clustering and Trust Bootstrapping
Peng Sun · Xinyang Liu · Zhibo Wang · Bo Liu
Decentralized federated learning (DFL) facilitates collaborative model training across multiple connected clients without a central coordination server, thereby avoiding the single point of failure in traditional centralized federated learning (CFL). However, DFL exhibits heightened susceptibility to Byzantine attacks owing to the lack of a responsible central server. Furthermore, a benign client in DFL may be dominated by Byzantine clients (more than half of its neighbors are malicious), posing significant challenges for robust model training. In this work, we propose DFL-Dual, a novel Byzantine-robust DFL method through dual-domain client clustering and trust bootstrapping. Specifically, we first propose to leverage both data-domain and model-domain distance metrics to identify client discrepancies. Then, we design a trust evaluation mechanism centered on benign clients, which enables them to evaluate their neighbors. Building upon the dual-domain distance metric and trust evaluation mechanism, we further develop a two-stage clustering and trust bootstrapping technique to exclude Byzantine clients from local model aggregation. We extensively evaluate the proposed DFL-Dual method through rigorous experimentation, demonstrating its remarkable performance superiority over existing robust CFL and DFL schemes.
Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation
Yuan Xiao · Shiqing Ma · Juan Zhai · Chunrong Fang · Jinyuan Jia · Zhenyu Chen
The robustness of convolutional neural networks (CNNs) is vital to modern AI-driven systems. It can be quantified by formal verification by providing a certified lower bound, within which any perturbation does not alter the original input's classification result. It is challenging due to nonlinear components, such as MaxPool. At present, efficient and scalable verification methods are sound but incomplete, and thus, a certified lower bound is a crucial criterion for evaluating the performance of verification tools. In this paper, we present MaxLin, a robustness verifier for maxpool-based CNNs with tight linear approximation. By tightening the linear approximation of the MaxPool function, we can certify larger certified lower bounds of CNNs. We evaluate MaxLin with open-sourced benchmarks, including LeNet and networks trained on the MNIST, CIFAR-10, and Tiny ImageNet datasets. The results show that MaxLin outperforms state-of-the-art tools with up to 110.60% improvement regarding the certified lower bound and 5.13x speedup for the same neural networks.
Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement
Daiwei Yu · Zhuorong Li · Lina Wei · Canghong Jin · Yun Zhang · Sixian Chan
Adversarial training (AT) is currently one of the most effective ways to obtain the robustness of deep neural networks against adversarial attacks. However, most AT methods suffer from robust overfitting, i.e., a significant generalization gap in adversarial robustness between the training and testing curves. In this paper, we first identify a connection between robust overfitting and the excessive memorization of noisy labels in AT from a view of gradient norm. As such label noise is mainly caused by a distribution mismatch and improper label assignments, we are motivated to propose a label refinement approach for AT. Specifically, our Self-Guided Label Refinement first self-refines a more accurate and informative label distribution from over-confident hard labels, and then it calibrates the training by dynamically incorporating knowledge from self-distilled models into the current model and thus requiring no external teachers. Empirical results demonstrate that our method can simultaneously boost the standard accuracy and robust performance across multiple benchmark datasets, attack types, and architectures. In addition, we also provide a set of analyses from the perspectives of information theory to dive into our method and suggest the importance of soft labels for robust generalization.
SlowFormer: Adversarial Attack on Compute and Energy Consumption of Efficient Vision Transformers
Navaneet K L · Soroush Abbasi Koohpayegani · Essam Sleiman · Hamed Pirsiavash
Recently, there has been a lot of progress in reducing the computation of deep models at inference time. These methods can reduce both the computational needs and power usage of deep models. Some of these approaches adaptively scale the compute based on the input instance. We show that such models can be vulnerable to a universal adversarial patch attack, where the attacker optimizes for a patch that when pasted on any image, can increase the compute and power consumption of the model. We run experiments with three different efficient vision transformer methods showing that in some cases, the attacker can increase the computation to the maximum possible level by simply pasting a patch that occupies only 8\% of the image area. We also show that a standard adversarial training defense method can reduce some of the attack's success. We believe adaptive efficient methods will be necessary for the future to lower the power usage of expensive deep models, so we hope our paper encourages the community to study the robustness of these methods and develop better defense methods for the proposed attack.
LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning
Siyuan Cheng · Guanhong Tao · Yingqi Liu · Guangyu Shen · Shengwei An · Shiwei Feng · Xiangzhe Xu · Kaiyuan Zhang · Shiqing Ma · Xiangyu Zhang
Backdoor attack poses a significant security threat to Deep Learning applications. Existing attacks are often not evasive to established backdoor detection techniques. This susceptibility primarily stems from the fact that these attacks typically leverage a universal trigger pattern or transformation function, such that the trigger can cause misclassification for any input. In response to this, recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. While these approaches manage to evade detection to some extent, they reveal vulnerability to existing backdoor mitigation techniques. To address and enhance both evasiveness and resilience, we introduce a novel backdoor attack LOTUS. Specifically, it leverages a secret function to separate samples in the victim class into a set of partitions and applies unique triggers to different partitions. Furthermore, LOTUS incorporates an effective trigger focusing mechanism, ensuring only the trigger corresponding to the partition can induce the backdoor behavior. Extensive experimental results show that LOTUS can achieve high attack success rate across 4 datasets and 7 model structures, and effectively evading 13 backdoor detection and mitigation techniques.
Deep-TROJ: An Inference Stage Trojan Insertion Algorithm through Efficient Weight Replacement Attack
Sabbir Ahmed · RANYANG ZHOU · Shaahin Angizi · Adnan Rakin Rakin
To insert Trojan into a Deep Neural Network (DNN), the existing attack assumes the attacker can access the victim's training facilities. However, a realistic threat model was recently developed by leveraging memory fault to inject Trojans at the inference stage. In this work, we develop a novel Trojan attack by adopting a unique memory fault injection technique that can inject bit-flip into the page table of the main memory. In the main memory, each weight block consists of a group of weights located at a specific address of a DRAM row. A bit-flip in the page frame number replaces a $\textbf{target}$ weight block of a DNN model with another $\textbf{replacement}$ weight block. To develop a successful Trojan attack leveraging this unique fault model, the attacker must solve three key challenges: i) how to identify a minimum set of target weight blocks to be modified? ii) how to identify the corresponding optimal replacement weight block? iii) how to optimize the trigger to maximize the attacker's objective given a target and replacement weight block set? We address them by proposing a novel Deep-TROJ attack algorithm that can identify a minimum set of vulnerable target and corresponding replacement weight blocks while optimizing the trigger at the same time. We evaluate the performance of our proposed Deep-TROJ on CIFAR-10, CIFAR-100, and ImageNet dataset for thirteen different DNN architectures, including vision transformers. Proposed Deep-TROJ is the most successful one to date that does not require access to training facilities while successfully bypassing the existing defenses.
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
Alvi Md Ishmam · Chris Thomas
In recent years there has been enormous interest in vision-language models trained using self-supervised objectives. However, the use of large-scale datasets scraped from the web for training also makes these models vulnerable to potential security threats, such as backdooring and poisoning attacks. In this paper, we propose a method for mitigating such attacks on contrastively trained vision-language models. Our approach, Semantic Shield, leverages external knowledge extracted from a language model to prevent models from learning correlations between image regions which lack strong alignment with external knowledge. We do this by imposing constraints to enforce that attention paid by the model to visual regions is proportional to the alignment of those regions with external knowledge.We conduct extensive experiments using a variety of recent backdooring and poisoning attacks on multiple datasets and architectures. Our results clearly demonstrate that our proposed approach is highly effective at defending against such attacks across multiple settings, while maintaining model utility and without requiring any changes at inference time.
Initialization Matters for Adversarial Transfer Learning
Andong Hua · Jindong Gu · Zhiyu Xue · Nicholas Carlini · Eric Wong · Yao Qin
With the prevalence of the Pretraining-Finetuning paradigm in transfer learning, the robustness of downstream tasks has become a critical concern. In this work, we delve into adversarial robustness in transfer learning and reveal the critical role of initialization, including both the pretrained model and the linear head. First, we discover the necessity of an adversarially robust pretrained model. Specifically, we reveal that with a standard pretrained model, Parameter-Efficient Finetuning (PEFT) methods either fail to be adversarially robust or continue to exhibit significantly degraded adversarial robustness on downstream tasks, even with adversarial training during finetuning. Leveraging a robust pretrained model, surprisingly, we observe that a simple linear probing can outperform full finetuning and other PEFT methods with random initialization on certain datasets. We further identify that linear probing excels in preserving robustness from the robust pretraining. Based on this, we propose Robust Linear Initialization (RoLI) for adversarial finetuning, which initializes the linear head with the weights obtained by adversarial linear probing to maximally inherit the robustness from pretraining. Across five different image classification datasets, we demonstrate the effectiveness of RoLI and achieve new state-of-the-art results.
Strong Transferable Adversarial Attacks via Ensembled Asymptotically Normal Distribution Learning
Zhengwei Fang · Rui Wang · Tao Huang · Liping Jing
Strong adversarial examples are crucial for evaluating and enhancing the robustness of deep neural networks. However, the performance of popular attacks is usually sensitive, for instance, to minor image transformations, stemming from limited information — typically only one input example, a handful of white-box source models, and undefined defense strategies. Hence, the crafted adversarial examples are prone to overfit the source model, which hampers their transferability to unknown architectures. In this paper, we propose an approach named Multiple Asymptotically Normal Distribution Attacks (MultiANDA) which explicitly characterize adversarial perturbations from a learned distribution. Specifically, we approximate the posterior distribution over the perturbations by taking advantage of the asymptotic normality property of stochastic gradient ascent (SGA), then employ the deep ensemble strategy as an effective proxy for Bayesian marginalization in this process, aiming to estimate a mixture of Gaussians that facilitates a more thorough exploration of the potential optimization space. The approximated posterior essentially describes the stationary distribution of SGA iterations, which captures the geometric information around the local optimum. Thus, MultiANDA allows drawing an unlimited number of adversarial perturbations for each input and reliably maintains the transferability. Our proposed method outperforms ten state-of-the-art black-box attacks on deep learning models with or without defenses through extensive experiments on seven normally trained and seven defense models.
HDRFlow: Real-Time HDR Video Reconstruction with Large Motions
Gangwei Xu · Yujin Wang · Jinwei Gu · Tianfan Xue · Xin Yang
Reconstructing High Dynamic Range (HDR) video from image sequences captured with alternating exposures is a challenging task, especially in the presence of large camera or object motion. Existing methods typically align low dynamic range sequences using optical flow or attention mechanism for deghosting. However, they often struggle to handle large complex motions and are computationally expensive. To address these challenges, we propose a robust and efficient flow estimator tailored for real-time HDR video reconstruction, named HDRFlow. HDRFlow has three novel designs: an HDR-domain alignment loss (HALoss), an efficient flow network with a multi-size large kernel (MLK), and a new HDR flow training scheme. The HALoss supervises our flow network to learn an HDR-oriented flow for accurate alignment in saturated and dark regions. The MLK can effectively model large motions at a negligible cost. In addition, we incorporate synthetic data, Sintel, into our training dataset, utilizing both its provided forward flow and backward flow generated by us to supervise our flow network, enhancing our performance in large motion regions. Extensive experiments demonstrate that our HDRFlow outperforms previous methods on standard benchmarks. To the best of our knowledge, HDRFlow is the first real-time HDR video reconstruction method for video sequences captured with alternating exposures, capable of processing 720p resolution inputs at 25ms. We will release the source code upon the publication of the paper.
A Physics-informed Low-rank Deep Neural Network for Blind and Universal Lens Aberration Correction
Jin Gong · Runzhao Yang · Weihang Zhang · Jinli Suo · Qionghai Dai
High-end lenses, although offering high-quality images, suffer from both insufficient affordability and bulky design, which hamper their applications in low-budget scenarios or on low-payload platforms. A flexible scheme is to tackle the optical aberration of low-end lenses computationally. However, it is highly demanded but quite challenging to build a general model capable of handling non-stationary aberrations and covering diverse lenses, especially in a blind manner. To address this issue, we propose a universal solution by extensively utilizing the physical properties of camera lenses: (i) reducing the complexity of lens aberrations, i.e., lens-specific non-stationary blur, by warping annual-ring-shaped sub-images into rectangular stripes to transform non-uniform degenerations into a uniform one, (ii) building a low-dimensional non-negative orthogonal representation of lens blur kernels to cover diverse lenses; (iii) designing a decoupling network to decompose the input low-quality image into several components degenerated by above kernel bases, and applying corresponding pre-trained deconvolution networks to reverse the degeneration. Benefiting from the proper incorporation of lenses' physical properties and unique network design, the proposed method achieves superb imaging quality, wide applicability for various lenses, high running efficiency, and is totally free of kernel calibration. These advantages bring great potential for scenarios requiring lightweight high-quality photography.
Super-Resolution Reconstruction from Bayer-Pattern Spike Streams
Yanchen Dong · Ruiqin Xiong · Jian Zhang · Zhaofei Yu · Xiaopeng Fan · Shuyuan Zhu · Tiejun Huang
Spike camera is a neuromorphic vision sensor that can capture highly dynamic scenes by generating a continuous stream of binary spikes to represent the arrival of photons at very high temporal resolution. Equipped with Bayer color filter array (CFA), color spike camera (CSC) has been invented to capture color information. Although spike camera has already demonstrated great potential for high-speed imaging, its spatial resolution is limited compared with conventional digital cameras. This paper proposes a Color Spike Camera Super-Resolution (CSCSR) network to super-resolve higher-resolution color images from spike camera streams with Bayer CFA. To be specific, we first propose a representation for Bayer-pattern spike streams, exploring local temporal information with global perception to represent the binary data. Then we exploit the CFA layout and sub-pixel level motion to collect temporal pixels for the spatial super-resolution of each color channel. In particular, a residual-based module for feature refinement is developed to reduce the impact of motion estimation errors. Considering color correlation, we jointly utilize the multi-stage temporal-pixel features of color channels to reconstruct the high-resolution color image. Experimental results demonstrate that the proposed scheme can reconstruct satisfactory color images with both high temporal and spatial resolution from low-resolution Bayer-pattern spike streams. All the codes and datasets will be publicly available.
In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging
Xin Wang · Lizhi Wang · Xiangtian Ma · Maoqing Zhang · Lin Zhu · Hua Huang
Dual-camera compressive hyperspectral imaging (DCCHI) offers the capability to reconstruct 3D hyperspectral image (HSI) by fusing compressive and panchromatic (PAN) image, which has shown great potential for snapshot hyperspectral imaging in practice. In this paper, we introduce a novel DCCHI reconstruction network, intra-inter similarity exploiting Transformer (In2SET). Our key insight is to make full use of the PAN image to assist the reconstruction. To this end, we propose to use the intra-similarity within the PAN image as a proxy for approximating the intra-similarity in the original HSI, thereby offering an enhanced content prior for more accurate HSI reconstruction. Furthermore, we propose to use the inter-similarity to align the features between HSI and PAN images, thereby maintaining semantic consistency between the two modalities during the reconstruction process. By integrating In2SET into a PAN-guided deep unrolling (PGDU) framework, our method substantially enhances the spatial-spectral fidelity and detail of the reconstructed images, providing a more comprehensive and accurate depiction of the scene. Experiments conducted on both real and simulated datasets demonstrate that our approach consistently outperforms existing state-of-the-art methods in terms of reconstruction quality and computational complexity. The code is available at https://github.com/2JONAS/In2SET.
SuperSVG: Superpixel-based Scalable Vector Graphics Synthesis
Teng Hu · Ran Yi · Baihong Qian · Jiangning Zhang · Paul L. Rosin · Yu-Kun Lai
SVG (Scalable Vector Graphics) is a widely used graphics format that possesses excellent scalability and editability. Image vectorization that aims to convert raster images to SVGs, is an important yet challenging problem in computer vision and graphics. Existing image vectorization methods either suffer from low reconstruction accuracy for complex images or require long computation time. To address this issue, we propose SuperSVG, a superpixel-based vectorization model that achieves fast and high-precision image vectorization. Specifically, we decompose the input image into superpixels to help the model focus on areas with similar colors and textures. Then, we propose a two-stage self-training framework, where a coarse-stage model is employed to reconstruct the main structure and a refinement-stage model is used for enriching the details. Moreover, we propose a novel dynamic path warping loss to help the refinement-stage model to inherit knowledge from the coarse-stage model. Extensive qualitative and quantitative experiments demonstrate the superior performance of our method in terms of reconstruction accuracy and inference time compared to state-of-the-art approaches.
Language-driven All-in-one Adverse Weather Removal
Hao Yang · Liyuan Pan · Yan Yang · Wei Liang
All-in-one (AiO) frameworks restore various adverse weather degradations with a single set of networks jointly. To handle various weather conditions, an AiO framework is expected to adaptively learn weather-specific knowledge for different degradations and shared knowledge for common patterns. However, existing method: 1) rely on extra supervision signals, which are usually unknown in real-world applications; 2) employ fixed network structures, which restrict the diversity of weather-specific knowledge. In this paper, we propose a Language-driven Restoration framework (LDR) to alleviate the aforementioned issues. First, we leverage the power of pre-trained vision-language (PVL) models to enrich the diversity of weather-specific knowledge by reasoning about the occurrence, type, and severity of degradation, generating description-based degradation priors. Then, with the guidance of degradation prior, we sparsely select restoration experts from a candidate list dynamically based on a Mixture-of-Experts (MoE) structure. This enables us to adaptively learn the weather-specific and shared knowledge to handle various weather conditions (e.g., unknown or mixed weather). Experiments on extensive restoration scenarios show our superior performance (see Fig. 1). The source code will be made available.
LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network
Hao Yang · Liyuan Pan · Yan Yang · Richard Hartley · Miaomiao Liu
Recovering sharp images from dual-pixel (DP) pairs with disparity-dependent blur is a challenging task. Existing blur map-based deblurring methods have demonstrated promising results. In this paper, we propose, to the best of our knowledge, the first framework to introduce the contrastive language-image pre-training framework (CLIP) to achieve accurate blur map estimation from DP pairs unsupervisedly. To this end, we first carefully design text prompts to enable CLIP to understand blur-related geometric prior knowledge from the DP pair. Then, we propose a format to input stereo DP pair to the CLIP without any fine-tuning, where the CLIP is pre-trained on monocular images. Given the estimated blur map, we introduce a blur-prior attention block, a blur-weighting loss and a blur-aware loss to recover the all-in-focus image. Our method achieves state-of-the-art performance in extensive experiments (see Fig. 1).
Language-guided Image Reflection Separation
Haofeng Zhong · Yuchen Hong · Shuchen Weng · Jinxiu Liang · Boxin Shi
This paper studies the problem of language-guided reflection separation, which aims at addressing the ill-posed reflection separation problem by introducing language descriptions to provide layer content. We propose a unified framework to solve this problem, which leverages the cross-attention mechanism with contrastive learning strategies to construct the correspondence between language descriptions and image layers. A gated network design and a randomized training strategy are employed to tackle the recognizable layer ambiguity. The effectiveness of the proposed method is validated by the significant performance advantage over existing reflection separation methods on both quantitative and qualitative comparisons.
Time-Efficient Light-Field Acquisition Using Coded Aperture and Events
Shuji Habuchi · Keita Takahashi · Chihiro Tsutake · Toshiaki Fujii · Hajime Nagahara
We propose a computational imaging method for time-efficient light-field acquisition that combines a coded aperture with an event-based camera. Different from the conventional coded-aperture imaging method, our method applies a sequence of coding patterns during a single exposure for an image frame. The parallax information, which is related to the differences in coding patterns, is recorded as events. The image frame and events, all of which are measured in a single exposure, are jointly used to computationally reconstruct a light field. We also designed an algorithm pipeline for our method that is end-to-end trainable on the basis of deep optics and compatible with real camera hardware. We experimentally showed that our method can achieve more accurate reconstruction than several other imaging methods with a single exposure. We also developed a hardware prototype with the potential to complete the measurement on the camera within 22 msec and demonstrated that light fields from real 3-D scenes can be obtained with convincing visual quality. Our software and supplementary video are available from our project website.
NB-GTR: Narrow-Band Guided Turbulence Removal
Yifei Xia · Chu Zhou · Chengxuan Zhu · Minggui Teng · Chao Xu · Boxin Shi
The removal of atmospheric turbulence is crucial for long-distance imaging. Leveraging the stochastic nature of atmospheric turbulence, numerous algorithms have been developed that employ multi-frame input to mitigate the turbulence. However, when limited to a single frame, existing algorithms face substantial performance drops, particularly in diverse real-world scenes. In this paper, we propose a robust solution to turbulence removal from an RGB image under the guidance of an additional narrow-band image, broadening the applicability of turbulence mitigation techniques in real-world imaging scenarios. Our approach exhibits a substantial suppression in the magnitude of turbulence artifacts by using only a pair of images, thereby enhancing the clarity and fidelity of the captured scene.
Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction
Jianping Jiang · xinyu zhou · Bingxuan Wang · Xiaoming Deng · Chao Xu · Boxin Shi
Reliable hand mesh reconstruction (HMR) from commonly-used color and depth sensors is challenging especially under scenarios with varied illuminations and fast motions. Event camera is a highly promising alternative for its high dynamic range and dense temporal resolution properties, but it lacks key texture appearance for hand mesh reconstruction. In this paper, we propose EvRGBHand -- the first approach for 3D hand mesh reconstruction with an event camera and an RGB camera compensating for each other. By fusing two modalities of data across time, space, and information dimensions,EvRGBHand can tackle overexposure and motion blur issues in RGB-based HMR and foreground scarcity and background overflow issues in event-based HMR. We further propose EvRGBDegrader, which allows our model to generalize effectively in challenging scenes, even when trained solely on standard scenes, thus reducing data acquisition costs. Experiments on real-world data demonstrate that EvRGBHand can effectively solve the challenging issues when using either type of camera alone via retaining the merits of both, and shows the potential of generalization to outdoor scenes and another type of event camera. Our code, models, and dataset will be made public after acceptance.
Boosting Spike Camera Image Reconstruction from a Perspective of Dealing with Spike Fluctuations
Rui Zhao · Ruiqin Xiong · Jing Zhao · Jian Zhang · Xiaopeng Fan · Zhaofei Yu · Tiejun Huang
As a bio-inspired vision sensor with ultra-high speed, spike cameras exhibit great potential in recording dynamic scenes with high-speed motion or drastic light changes. Different from traditional cameras, each pixel in spike cameras records the arrival of photons continuously by firing binary spikes at an ultra-fine temporal granularity. In this process, multiple factors impact the imaging, including the photons' Poisson arrival, thermal noises from circuits, and quantization effects in spike readout. These factors introduce fluctuations to spikes, making the recorded spike intervals unstable and unable to reflect accurate light intensities. In this paper, we present an approach to deal with spike fluctuations and boost spike camera image reconstruction. We first analyze the quantization effects and reveal the unbiased estimation attribute of the reciprocal of differential of spike firing time (DSFT). Based on this, we propose a spike representation module to use DSFT with multiple orders for fluctuation suppression, where DSFT with higher orders indicates spike integration duration between multiple spikes. We also propose a module for inter-moment feature alignment at multiple granularities. The coarser alignment is based on patch-level cross-attention with a local search strategy, and the finer alignment is based on deformable convolution at the pixel level. Experimental results demonstrate the effectiveness of our method on both synthetic and real-captured data. The source code and dataset are available at https://github.com/ruizhao26/BSF.
Frequency-aware Event-based Video Deblurring for Real-World Motion Blur
Taewoo Kim · Hoonhee Cho · Kuk-Jin Yoon
Video deblurring aims to restore sharp frames from blurred video clips. Despite notable progress in video deblurring works, it is still a challenging problem because of the loss of motion information during the duration of the exposure time. Since event cameras can capture clear motion information asynchronously with high temporal resolution, several works exploit the event camera for deblurring as they can provide abundant motion information. However, despite these approaches, there were few cases of actively exploiting the long-range temporal dependency of videos. To tackle these deficiencies, we present an event-based video deblurring framework by actively utilizing temporal information from videos. To be specific, we first introduce a frequency-based cross-modal feature enhancement module. Second, we propose event-guided video alignment modules by considering the valuable characteristics of the event and videos. In addition, we designed a hybrid camera system to collect the first real-world event-based video deblurring dataset. For the first time, we build a dataset containing synchronized high-resolution real-world blurred videos and corresponding sharp videos and event streams. Experimental results validate that our frameworks significantly outperform the state-of-the-art frame-based and event-based video deblurring works in the various datasets.
Latency Correction for Event-guided Deblurring and Frame Interpolation
Yixin Yang · Jinxiu Liang · Bohan Yu · Yan Chen · Jimmy S. Ren · Boxin Shi
Event cameras, with their high temporal resolution, dynamic range, and low power consumption, are particularly good at time-sensitive applications like deblurring and frame interpolation. However, their performance is hindered by latency variability, especially under low-light conditions and with fast-moving objects. This paper addresses the challenge of latency in event cameras — the temporal discrepancy between the actual occurrence of changes in the corresponding timestamp assigned by the sensor. Focusing on event-guided deblurring and frame interpolation tasks, we propose a latency correction method based on a parameterized latency model. To enable data-driven learning, we develop an event-based temporal fidelity to describe the sharpness of latent images reconstructed from events and the corresponding blurry images, and reformulate the event-based double integral model differentiable to latency. The proposed method is validated using synthetic and real-world datasets, demonstrating the benefits of latency correction for deblurring and interpolation across different lighting conditions.
Learning to Remove Wrinkled Transparent Film with Polarized Prior
Jiaqi Tang · RUIZHENG WU · Xiaogang Xu · Sixing Hu · Ying-Cong Chen
In this paper, we study a new problem, Film Removal (FR), which attempts to remove the interference of wrinkled transparent films and reconstruct the original information under films for industrial recognition systems. We first physically model the imaging of industrial materials covered by the film. Considering the specular highlight from the film can be effectively recorded by the polarized camera, we build a practical dataset with polarization information containing paired data with and without transparent film. We aim to remove interference from the film (specular highlights and other degradations) with an end-to-end framework. To locate the specular highlight, we use an angle estimation network to optimize the polarization angle with the minimized specular highlight. The image with minimized specular highlight is set as a prior for supporting the reconstruction network. Based on the prior and the polarized images, the reconstruction network can decouple all degradations from the film. Extensive experiments show that our framework achieves SOTA performance in both image reconstruction and industrial downstream tasks. Our code will be released at \url{https://github.com/jqtangust/FilmRemoval}.
Dispersed Structured Light for Hyperspectral 3D Imaging
Suhyun Shin · Seokjun Choi · Felix Heide · Seung-Hwan Baek
Hyperspectral 3D imaging aims to acquire both depth and spectral information of a scene. However, existing methods are either prohibitively expensive and bulky or compromise on spectral and depth accuracy. In this work, we present Dispersed Structured Light (DSL), a cost-effective and compact method for accurate hyperspectral 3D imaging. DSL modifies a traditional projector-camera system by placing a sub-millimeter thick diffraction grating film front of the projector.The grating disperses structured light based on light wavelength. To utilize the dispersed structured light, we devise a model for dispersive projection image formation and a per-pixel hyperspectral 3D reconstruction method. We validate DSL by instantiating a compact experimental prototype.DSL achieves spectral accuracy of 18.8nm full-width half-maximum (FWHM) and depth error of 1mm. We demonstrate that DSL outperforms prior work on practical hyperspectral 3D imaging.DSL promises accurate and practical hyperspectral 3D imaging for diverse application domains, including computer vision and graphics, cultural heritage, geology, and biology.
Generalized Event Cameras
Varun Sundar · Matthew Dutson · Andrei Ardelean · Claudio Bruschini · Edoardo Charbon · Mohit Gupta
Event cameras capture the world at high time resolution and with minimal bandwidth requirements.However, event streams, which only encode changes in brightness, do not contain sufficient scene information to support a wide variety of downstream tasks.In this work, we design generalized event cameras that inherently preserve scene intensity in a bandwidth-efficient manner.We generalize event cameras in terms of when an event is generated and what information is transmitted.To implement our designs, we turn to single-photon sensors that provide digital access to individual photon detections; this modality gives us the flexibility to realize a rich space of generalized event cameras.Our single-photon event cameras are capable of high-speed, high-fidelity imaging at low readout rates.Consequently, these event cameras can support plug-and-play downstream inference, without capturing new event datasets or designing specialized event-vision models.As a practical implication, our designs, which involve lightweight and near-sensor-compatible computations, provide a way to use single-photon sensors without exorbitant bandwidth costs.
Intensity-Robust Autofocus for Spike Camera
Changqing Su · Zhiyuan Ye · Yongsheng Xiao · You Zhou · Zhen Cheng · Bo Xiong · Zhaofei Yu · Tiejun Huang
Spike cameras, a novel neuromorphic visual sensor, can capture full-time spatial information through spike stream, offering ultra-high temporal resolution and an extensive dynamic range. Autofocus control (AC) plays a pivotal role in a camera to efficiently capture information in challenging real-world scenarios. Nevertheless, due to disparities in data modality and information characteristics compared to frame stream and event stream, the current lack of efficient AC methods has made it challenging for spike cameras to adapt to intricate real-world conditions. To address this challenge, we introduce a spike-based autofocus framework that includes a spike-specific focus measure called spike dispersion (SD), which effectively mitigates the influence of variations in scene light intensity during the focusing process by leveraging the spike camera's ability to record full-time spatial light intensity. Additionally, the framework integrates a fast search strategy called spike-based golden fast search (SGFS), allowing rapid focal positioning without the need for a complete focus range traversal. To validate the performance of our method, we have collected a spike-based autofocus dataset (SAD) containing synthetic data and real-world data under varying scene brightness and motion scenarios. Experimental results on these datasets demonstrate that our method offers state-of-the-art accuracy and efficiency. Furthermore, experiments with data captured under varying scene brightness levels illustrate the robustness of our method to changes in light intensity during the focusing process.
Selective Nonlinearities Removal from Digital Signals
Krzysztof Maliszewski · Magdalena Urbanska · Varvara Vetrova · Sylwia Kolenderska
Many instruments performing optical and non-optical imaging and sensing, such as Optical Coherence Tomography (OCT), Magnetic Resonance Imaging or Fourier-transform spectrometry, produce digital signals containing modulations, sine-like components, which only after Fourier transformation give information about the structure or characteristics of the investigated object. Due to the fundamental physics-related limitations of such methods, the distribution of these signal components is often nonlinear and, when not properly compensated, leads to the resolution, precision or quality drop in the final image. Here, we propose an innovative approach that has the potential to allow cleaning of the signal from the nonlinearities but most of all, it now allows to switch the given order off, leaving all others intact. The latter provides a tool for more in-depth analysis of the nonlinearity-inducing properties of the investigated object, which can lead to applications in early disease detection or more sensitive sensing of chemical compounds. We consider OCT signals and nonlinearities up to the third order. In our approach, we propose two neural networks: one to remove solely the second-order nonlinearity and the other for removing solely the third-order nonlinearity. The input of the networks is a novel two-dimensional data structure with all the information needed for the network to infer a nonlinearity-free signal. We describe the developed networks and present the results for second-order and third-order nonlinearity removal in OCT data representing the images of various objects: a mirror, glass, and fruits.
Close Imitation of Expert Retouching for Black-and-White Photography
Seunghyun Shin · Jisu Shin · Jihwan Bae · Inwook Shim · Hae-Gon Jeon
Since the widespread availability of cameras, black-and-white (BW) photography has been a popular choice for artistic and aesthetic expression. It highlights the main subject in varying tones of gray, creating various effects such as drama and contrast. However, producing BW photography often demands high-end cameras or photographic editing from experts. Even the experts have their own preferred styles, and may also favor different styles depending on the subject when taking gray-scale photos or converting color images to BW. It is thus questionable which approach is better. To imitate the artistic values of decolorized images, this paper introduces a deep metric learning framework with a novel subject-style specified proxy and a large-scale BW dataset. Our proxy-based decolorization utilizes a hierarchical proxy-based loss and a hierarchical bilateral grid network to mimic the experts' retouching scheme. The proxy-based loss captures both expert-discriminative and class-sharing characteristics, while the hierarchical bilateral grid network enables imitating spatially-variant retouching by considering both global and local scene contexts. Our dataset, including color and BW images edited by three experts, demonstrates the scalability of our method, which can be further enhanced by constructing additional proxies from any set of BW photos like Internet downloaded figures. Our Experiments show that our framework successfully produces visually-pleasing BW images from color ones, as evaluated by user preference with respect to artistry and aesthetics.
Spike-guided Motion Deblurring with Unknown Modal Spatiotemporal Alignment
Jiyuan Zhang · Shiyan Chen · Yajing Zheng · Zhaofei Yu · Tiejun Huang
The traditional frame-based cameras that rely on exposure windows for imaging experience motion blur in high-speed scenarios. Frame-based deblurring methods lack reliable motion cues to restore sharp images under extreme blur conditions. The spike camera is a novel neuromorphic visual sensor that outputs spike streams with ultra-high temporal resolution. It can supplement the temporal information lost in traditional cameras and guide motion deblurring. However, in real-world scenarios, aligning discrete RGB images and continuous spike streams along both temporal and spatial axes is challenging due to the complexity of calibrating their coordinates, device displacements in vibrations, and time deviations. Misalignment of pixels leads to severe degradation of deblurring. We introduce the first framework for spike-guided motion deblurring without knowing the spatiotemporal alignment between spikes and images. To address the problem, we first propose a novel three-stage network containing a basic deblurring net, a carefully designed bi-directional deformable aligning module, and a flow-based multi-scale fusion net. Experimental results demonstrate that our approach can effectively guide the image deblurring with unknown alignment, surpassing the performance of other methods. Public project page: https://github.com/Leozhangjiyuan/UaSDN.
Coherence As Texture – Passive Textureless 3D Reconstruction by Self-interference
Wei-Yu Chen · Aswin C. Sankaranarayanan · Anat Levin · Matthew O’Toole
Passive depth estimation based on stereo, defocus, or shading relies on the presence of the texture on an object to resolve its depth. Hence, recovering the depth of a textureless object---for example, a large white wall---is not just hard but perhaps even impossible.Or is it? We show that spatial coherence, a property of natural light sources, can be used to resolve the depth of a scene point even when it is textureless. Our approach relies on the idea that light scattered off a scene point is fully coherent with itself, while incoherent with others; we use this insight to design an optical setup that uses self-interference as a criterion for estimating depth. Our lab prototype is capable of resolving depths of textureless objects in sunlight as well as indoor lights.
TurboSL: Dense Accurate and Fast 3D by Neural Inverse Structured Light
Parsa Mirdehghan · Maxx Wu · Wenzheng Chen · David B. Lindell · Kiriakos Kutulakos
We show how to turn a noisy and fragile active triangulation technique—three-pattern structured light with a grayscale camera—into a fast and powerful tool for 3D capture: able to output sub-pixel accurate disparities at megapixel resolution, along with reflectance, normals, and a no-reference estimate of its own pixelwise 3D error. To achieve this, we formulate structured-light decoding as a neural inverse rendering problem. We show that despite having just three or four input images—all from the same viewpoint—this problem can be tractably solved by TurboSL, an algorithm that combines (1) a precise image formation model, (2) a signed distance field scene representation, and (3) projection pattern sequences optimized for accuracy instead of precision. We use TurboSL to reconstruct a variety of complex scenes from images captured at up to 60 fps with a camera and a common projector. Our experiments highlight TurboSL’s potential for dense and highly-accurate 3D acquisition from data captured in fractions of a second.
SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing
Tomoki Ichikawa · Shohei Nobuhara · Ko Nishino
Can we capture shape and reflectance in stealth? Such capability would be valuable for many application domains in vision, xR, robotics, and HCI. We introduce structured polarization for invisible depth and reflectance sensing (SPIDeRS), the first depth and reflectance sensing method using patterns of polarized light. The key idea is to modulate the angle of linear polarization (AoLP) of projected light at each pixel. The use of polarization makes it invisible and lets us recover not only depth but also directly surface normals and even reflectance. We implement SPIDeRS with a liquid crystal spatial light modulator (SLM) and a polarimetric camera. We derive a novel method for robustly extracting the projected structured polarization pattern from the polarimetric object appearance. We evaluate the effectiveness of SPIDeRS by applying it to a number of real-world objects. The results show that our method successfully reconstructs object shapes of various materials and is robust to diffuse reflection and ambient light. We also demonstrate relighting using recovered surface normals and reflectance. We believe SPIDeRS opens a new avenue of polarization use in visual sensing.
CPP-Net: Embracing Multi-Scale Feature Fusion into Deep Unfolding CP-PPA Network for Compressive Sensing
Zhen Guo · Hongping Gan
In the domain of compressive sensing (CS), deep unfolding networks (DUNs) have garnered attention for their good performance and certain degree of interpretability rooted in CS domain, achieved by marrying traditional optimization solvers with deep networks.However, current DUNs are ill-suited for the intricate task of capturing fine-grained image details, leading to perceptible distortions and blurriness in reconstructed images, particularly at low CS ratios, e.g., 0.10 and below. In this paper, we propose CPP-Net, a novel deep unfolding CS framework, inspired by the primal-dual hybrid strategy of the Chambolle and Pock Proximal Point Algorithm (CP-PPA). First, we derive three iteration submodules, $\mathbf{X}^{(k)}$, $\mathbf{V}^{(k)}$ and $\mathbf{Y}^{(k)}$, by incorporating customized deep learning modules to solve the sparse basis related proximal operator within CP-PPA. Second, we design the Dual Path Fusion Block (DPFB) to adeptly extract and fuse multi-scale feature information, enhancing sensitivity to feature information at different scales and improving detail reconstruction. Third, we introduce the Iteration Fusion Strategy (IFS) to effectively weight the fusion of outputs from diverse reconstruction stages, maximizing the utilization of feature information and mitigating the information loss during reconstruction stages. Extensive experiments demonstrate that CPP-Net effectively reduces distortion and blurriness while preserving richer image details, outperforming current state-of-the-art methods. Codes are available at https://github.com/ICSResearch/CPP-Net.
SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting
Hoon Kim · Minje Jang · Wonjun Yoon · Jisoo Lee · Donghyun Na · Sanghyun Woo
We introduce a co-designed approach for human portrait relighting that combines a physics-guided architecture with a pre-training framework. Drawing on the Cook-Torrance reflectance model, we have meticulously configured the architecture design to precisely simulate light-surface interactions. Furthermore, to overcome the limitation of scarce high-quality lightstage data, we have developed a self-supervised pre-training strategy. This novel combination of accurate physical modeling and expanded training dataset establishes a new benchmark in relighting realism.
Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation
Dong Lao · Congli Wang · Alex Wong · Stefano Soatto
We describe a method for recovering the irradiance underlying a collection of images corrupted by atmospheric turbulence. Since supervised data is often technically impossible to obtain, assumptions and biases have to be imposed, and we choose to model them explicitly. Rather than initializing a latent irradiance (``template'') by heuristics to estimate deformation, we select one of the images as a reference, and model the deformation in this image by the aggregation of the optical flow from it to other images, exploiting a prior imposed by Central Limit Theorem. Then with a novel flow inversion module, the model registers each image TO the template but WITHOUT the template, avoiding artifacts related to poor template initialization. To illustrate the simplicity and robustness of the method, we simply select the first frame as the reference and use the simplest optical flow to estimate the warpings, yet the improvement in registration is decisive in the final reconstruction, as we achieve state-of-the-art performance despite its simplicity. The method establishes a strong baseline that can be improved by integrating it with more sophisticated pipelines, or with domain-specific methods if so desired.
Towards HDR and HFR Video from Rolling-Mixed-Bit Spikings
Yakun Chang · Yeliduosi Xiaokaiti · Yujia Liu · Bin Fan · Zhaojun Huang · Tiejun Huang · Boxin Shi
The spiking cameras offer the benefits of high dynamic range (HDR), high temporal resolution, and low data redundancy. However, reconstructing HDR videos in high-speed conditions using single-bit spikings presents challenges due to the limited bit depth. Increasing the bit depth of the spikings is advantageous for boosting HDR performance, but the readout efficiency will be decreased, which is unfavorable for achieving a high frame rate (HFR) video. To address these challenges, we propose a readout mechanism to obtain rolling-mixed-bit (RMB) spikings, which involves interleaving multi-bit spikings within the single-bit spikings in a rolling manner, thereby combining the characteristics of high bit depth and efficient readout. Furthermore, we introduce RMB-Net for reconstructing HDR and HFR videos. RMB-Net comprises a cross-bit attention block for fusing mixed-bit spikings and a cross-time attention block for achieving temporal fusion. Extensive experiments conducted on synthetic and real-synthetic data demonstrate the superiority of our method. For instance, pure 3-bit spikings result in 3 times of data volume, whereas our method achieves comparable performance with less than 2% increase in data volume.
Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI
Chong Wang · Lanqing Guo · Yufei Wang · Hao Cheng · Yi Yu · Bihan Wen
Deep unfolding networks (DUN) have emerged as a reliable iterative framework for accelerated magnetic resonance imaging (MRI) reconstruction.However, conventional DUN aims to reconstruct all the missing information within the entire null space in each iteration. Thus the reconstruction quality could be degraded due to the cumulative errors.In this work, we propose a Progressive Divide-And-Conquer (PDAC) strategy, aiming to break down the subsampling process in the actual severe degradation and thus perform reconstruction sequentially.Starting from decomposing the original maximum-a-posteriori problem of accelerated MRI, we present a rigorous derivation of the proposed PDAC framework, which could be further unfolded into an end-to-end trainable network.Specifically, each iterative stage in PDAC focuses on recovering a distinct moderate degradation according to the decomposition.Furthermore, as part of the PDAC iteration, such decomposition is adaptively learned as an auxiliary task through a degradation predictor which provides an estimation of the decomposed sampling mask.Following this prediction, the sampling mask is further integrated via a severity conditioning module to ensure awareness of the degradation severity at each stage.Extensive experiments demonstrate that our proposed method achieves superior performance on the publicly available fastMRI and Stanford2D FSE datasets in both single-coil and multi-coil settings.
Generative Quanta Color Imaging
Vishal Purohit · Junjie Luo · Yiheng Chi · Qi Guo · Stanley H. Chan · Qiang Qiu
The astonishing development of single-photon cameras has created an unprecedented opportunity for scientific and industrial imaging. However, the high data throughput generated by these 1-bit sensors creates a significant bottleneck for low-power applications. In this paper, we explore the possibility of generating a color image from a single binary frame of a single-photon camera. We evidently find this problem being particularly difficult to standard colorization approaches due to the substantial degree of exposure variation. The core innovation of our paper is an exposure synthesis model framed under a neural ordinary differential equation (NeuralODE) that allows us to generate a continuum of exposures from a single observation. This innovation ensures consistent exposure in binary images that colorizers take on, resulting in notably enhanced colorization. We demonstrate applications of the method in single-image and burst colorization and show superior generative performance over baselines.
UFC-Net: Unrolling Fixed-point Continuous Network for Deep Compressive Sensing
Xiaoyang Wang · Hongping Gan
Deep unfolding networks (DUNs), renowned for their interpretability and superior performance, have invigorated the realm of compressive sensing (CS). Nonetheless, existing DUNs frequently suffer from issues related to insufficient feature extraction and feature attrition during the iterative steps. In this paper, we propose Unrolling Fixed-point Continuous Network (UFC-Net), a novel deep CS framework motivated by the traditional fixed-point continuous optimization algorithm. Specifically, we introduce Convolution-guided Attention Module (CAM) to serve as a critical constituent within the reconstruction phase, encompassing tailored components such as Multi-head Attention Residual Block (MARB), Auxiliary Iterative Reconstruction Block (AIRB), etc. MARB effectively integrates multi-head attention mechanisms with convolution to reinforce feature extraction, transcending the confinement of localized attributes and facilitating the apprehension of long-range correlations. Meanwhile, AIRB introduces auxiliary variables, significantly bolstering the preservation of features within each iterative stage. Extensive experiments demonstrate that our proposed UFC-Net achieves remarkable performance both on image CS and CS-magnetic resonance imaging (CS-MRI) in contrast to state-of-the-art methods.
Batch Normalization Alleviates the Spectral Bias in Coordinate Networks
Zhicheng Cai · Hao Zhu · Qiu Shen · Xinran Wang · Xun Cao
Representing signals using coordinate networks dominates the area of inverse problems recently, and is widely applied in various scientific computing tasks. Still, there exists an issue of spectral bias in coordinate networks, limiting the capacity to learn high-frequency components. This problem is caused by the pathological distribution of the neural tangent kernel's (NTK's) eigenvalues of coordinate networks. We find that, this pathological distribution could be improved using the classical batch normalization (BN), which is a common deep learning technique but rarely used in coordinate networks. BN greatly reduces the maximum and variance of NTK's eigenvalues while slightly modifies the mean value, considering the max eigenvalue is much larger than the most, this variance change results in a shift of eigenvalues' distribution from a lower one to a higher one, therefore the spectral bias could be alleviated. This observation is substantiated by the significant improvements of applying BN-based coordinate networks to various tasks, including the image compression, computed tomography reconstruction, shape representation, magnetic resonance imaging and novel view synthesis.
EVS-assisted Joint Deblurring Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling
Rui Jiang · Fangwen Tu · Yixuan Long · Aabhaas Vaish · Bowen Zhou · Qinyi Wang · Wei Zhang · Yuntan Fang · Luis Eduardo García Capel · Bo Mu · Tiejun Dai · Andreas Suess
Event-based Vision Sensors (EVS) gain popularity in enhancing CMOS Image Sensor (CIS) video capture. Nonidealities of EVS such as pixel or readout latency can significantly influence the quality of the enhanced images and warrant dedicated consideration in the design of fusion algorithms. A novel approach for jointly computing deblurred, rolling-shutter artifact corrected high-speed videos with frame rates up to 10000 FPS using inherently blurry rolling shutter CIS frames of 120 FPS to 150 FPS in conjunction with EVS data from a hybrid CIS-EVS sensor is presented. EVS pixel latency, readout latency and the sensor's refractory period are explicitly incorporated into the measurement model. This inverse function problem is solved on a per-pixel manner using an optimization-based framework. The interpolated images are subsequently processed by a novel refinement network. The proposed method is evaluated using simulated and measured datasets, under natural and controlled environments. Extensive experiments show reduced shadowing effect, a 4 dB increment in PSNR, and a 12% improvement in LPIPS score compared to state-of-the-art methods.
Unsupervised Deep Unrolling Networks for Phase Unwrapping
Zhile Chen · Yuhui Quan · Hui Ji
Phase unwrapping (PU) is a technique to reconstruct original phase images from their noisy wrapped counterparts, finding many applications in scientific imaging. Although supervised learning has shown promise in PU, its utility is limited in ground-truth (GT) scarce scenarios. This paper presents an unsupervised learning approach that eliminates the need for GTs during end-to-end training. Our approach leverages the insight that both the gradients and wrapped gradients of wrapped phases serve as noisy labels for GT phase gradients, along with sparse outliers induced by the wrapping operation. A recorruption-based self-reconstruction loss in the gradient domain is proposed to mitigate the adverse effects of label noise, complemented with a self-distillation loss for improved generalization. Additionally, by unfolding a variational model of PU that utilizes wrapped gradients of wrapped phases for its data-fitting term, we develop a deep unrolling network that encodes physics of phase wrapping and incorporates special treatments on outliers. In the experiments on three types of phase data, our approach outperforms existing GT-free methods and competes well against the supervised ones.
LAN: Learning to Adapt Noise for Image Denoising
Changjin Kim · Tae Hyun Kim · Sungyong Baik
Removing noise from images, a.k.a image denoising, can be a very challenging task since the type and amount of noise can greatly vary for each image due to many factors including a camera model and capturing environments. While there have been striking improvements in image denoising with the emergence of advanced deep learning architectures and real-world datasets, recent denoising networks struggle to maintain performance on images with noise that has not been seen during training. One typical approach to address the challenge would be to adapt a denoising network to new noise distribution. Instead, in this work, we shift our focus to adapting the input noise itself, rather than adapting a network. Thus, we keep a pretrained network frozen, and adapt an input noise to capture the finegrained deviations. As such, we propose a new denoising algorithm, dubbed Learning-to-Adapt-Noise (LAN), where a learnable noise offset is directly added to a given noisy image to bring a given input noise closer towards the noise distribution a denoising network is trained to handle. Consequently, the proposed framework exhibits performance improvement on images with unseen noise, displaying the potential of the proposed research direction.
Snapshot Lidar: Fourier Embedding of Amplitude and Phase for Single-Image Depth Reconstruction
Sarah Friday · Yunzi Shi · Yaswanth Kumar Cherivirala · Vishwanath Saragadam · Adithya Pediredla
Amplitude modulated continuous-wave time-of-flight (AMCW-ToF) cameras are finding applications as flash Lidars in autonomous navigation, robotics, and AR/VR applications. A conventional CW-ToF camera requires illuminating the scene with a temporally varying light source and demodulating a set of quadrature measurements to recover the scene's depth and intensity. Capturing the four measurements in sequence renders the system slow, invariably causing inaccuracies in depth estimates due to motion in the scene or the camera. To mitigate this problem, we propose a snapshot Lidar that captures amplitude and phase simultaneously as a single time-of-flight hologram. Uniquely, our approach requires minimal changes to existing CW-ToF imaging hardware. To demonstrate the efficacy of the proposed system, we design and build a lab prototype, and evaluate it under varying scene geometries, illumination conditions, and compare the reconstructed depth measurements against conventional techniques. We rigorously evaluate the robustness of our system on diverse real-world scenes to show that our technique results in a significant reduction in data bandwidth with minimal loss in reconstruction accuracy. As high-resolution CW-ToF cameras are becoming ubiquitous, increasing their temporal resolution by four times enables robust real-time capture of geometries of dynamic scenes.
FC-GNN: Recovering Reliable and Accurate Correspondences from Interferences
Haobo Xu · Jun Zhou · Hua Yang · Renjie Pan · Cunyan Li
Finding correspondences between images is essential for many computer vision tasks and sparse matching pipelines have been popular for decades. However, matching noise within and between images, along with inconsistent keypoint detection, frequently degrades the matching performance. We review these problems and thus propose: 1) a novel and unified Filtering and Calibrating (FC) approach that jointly rejects outliers and optimizes inliers, and 2) leveraging both the matching context and the underlying image texture to remove matching uncertainties. Under the guidance of the above innovations, we construct Filtering and Calibrating Graph Neural Network (FC-GNN), which follows the FC approach to recover reliable and accurate correspondences from various interferences. FC-GNN conducts an effectively combined inference of contextual and local information through careful embedding and multiple information aggregations, predicting confidence scores and calibration offsets for the input correspondences to jointly filter out outliers and improve pixel-level matching accuracy. Moreover, we exploit the local coherence of matches to perform inference on local graphs, thereby reducing computational complexity. Overall, FC-GNN operates at lightning speed and can greatly boost the performance of diverse matching pipelines across various tasks, showcasing the immense potential of such approaches to become standard and pivotal components of image matching. Code is available at https://github.com/xuy123456/fcgnn.
Projecting Trackable Thermal Patterns for Dynamic Computer Vision
Mark Sheinin · Aswin C. Sankaranarayanan · Srinivasa G. Narasimhan
Adding artificial patterns to objects, like QR codes, can ease tasks such as object tracking, robot navigation, and conveying information (e.g., a label or a website link). However, these patterns require a physical application, and they alter the object's appearance. Conversely, projected patterns can temporarily change the object's appearance, aiding tasks like 3D scanning and retrieving object textures and shading. However, projected patterns impede dynamic tasks like object tracking because they do not `stick' to the object's surface. Or do they? This paper introduces a novel approach combining the advantages of projected and persistent physical patterns. Our system projects heat patterns using a laser beam (similar in spirit to a LIDAR), which a thermal camera observes and tracks. Such thermal patterns enable tracking poorly-textured objects whose tracking is highly challenging with standard cameras while not affecting the object's appearance or physical properties. To avail these thermal patterns in existing vision frameworks, we train a network to reverse heat diffusion's effects and remove inconsistent pattern points between different thermal frames. We prototyped and tested this approach on dynamic vision tasks like structure from motion, optical flow, and object tracking of everyday textureless objects.
PixelRNN: In-pixel Recurrent Neural Networks for End-to-end–optimized Perception with Neural Sensors
Haley So · Laurie Bose · Piotr Dudek · Gordon Wetzstein
Conventional image sensors digitize high-resolution images at fast frame rates, producing a large amount of data that needs to be transmitted off the sensor for further processing. This is challenging for perception systems operating on edge devices, because communication is power inefficient and induces latency. Fueled by innovations in stacked image sensor fabrication, emerging sensor—processors offer programmability and minimal processing capabilities directly on the sensor. We exploit these capabilities by developing an efficient recurrent neural network architecture, PixelRNN, that encodes spatio-temporal features on the sensor using purely binary operations. PixelRNN reduces the amount of data to be transmitted off the sensor by factors up to 256 compared to the raw sensor data while offering competitive accuracy for hand gesture recognition and lip reading tasks. We experimentally validate PixelRNN using a prototype implementation on the SCAMP-5 sensor--processor platform.
Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance
Tomer Garber · Tom Tirer
Training deep neural networks has become a common approach for addressing image restoration problems. An alternative for training a "task-specific" network for each observation model is to use pretrained deep denoisers for imposing only the signal's prior within iterative algorithms, without additional training. Recently, a sampling-based variant of this approach has become popular with the rise of diffusion/score-based generative models. Using denoisers for general purpose restoration requires guiding the iterations to ensure agreement of the signal with the observations. In low-noise settings, guidance that is based on back-projection (BP) has been shown to be a promising strategy (used recently also under the names "pseudoinverse" or "range/null-space" guidance). However, the presence of noise in the observations hinders the gains from this approach. In this paper, we propose a novel guidance technique, based on preconditioning that allows traversing from BP-based guidance to least squares based guidance along the restoration scheme. The proposed approach is robust to noise while still having much simpler implementation than alternative methods (e.g., it does not require SVD or a large number of iterations). We use it within both an optimization scheme and a sampling-based scheme, and demonstrate its advantages over existing methods for image deblurring and super-resolution.
DART: Implicit Doppler Tomography for Radar Novel View Synthesis
Tianshu Huang · John Miller · Akarsh Prabhakara · Tao Jin · Tarana Laroia · Zico Kolter · Anthony Rowe
Simulation is an invaluable tool for radio-frequency system designers that enables rapid prototyping of various algorithms for imaging, target detection, classification, and tracking. However, simulating realistic radar scans is a challenging task that requires an accurate model of the scene, radio frequency material properties, and a corresponding radar synthesis function. Rather than specifying these models explicitly, we propose DART --- Doppler Aided Radar Tomography, a Neural Radiance Field-inspired method which uses radar-specific physics to create a reflectance and transmittance-based rendering pipeline for range-Doppler images. We then evaluate DART by constructing a custom data collection platform and collecting a novel radar dataset together with accurate position and instantaneous velocity measurements from lidar-based localization. In comparison to state-of-the-art baselines, DART synthesizes superior radar range-Doppler images from novel views across all datasets and additionally can be used to generate high quality tomographic images.
Equivariant Plug-and-Play Image Reconstruction
Matthieu Terris · Thomas Moreau · Nelly Pustelnik · Julián Tachella
Plug-and-play algorithms constitute a popular framework for solving inverse imaging problems that rely on the implicit definition of an image prior via a denoiser. These algorithms can leverage powerful pre-trained denoisers to solve a wide range of imaging tasks, circumventing the necessity to train models on a per-task basis. Unfortunately, plug-and-play methods often show unstable behaviors, hampering their promise of versatility and leading to suboptimal quality of reconstructed images. In this work, we show that enforcing equivariance to certain groups of transformations (rotations, reflections and/or translations) on the denoiser strongly improves the stability of the algorithm as well as its reconstruction quality. We provide a theoretical analysis that illustrates the role of equivariance on better performance and stability. We present a simple algorithm that enforces equivariance on any existing denoiser by simply applying a random transformation to the input of the denoiser and the inverse transformation to the output at each iteration of the algorithm. Experiments on multiple imaging modalities and denoising networks show that the equivariant plug-and-play algorithm improves both the reconstruction performance and the stability compared to their non-equivariant counterparts.
CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras
Sachin Shah · Matthew Chan · Haoming Cai · Jingxi Chen · Sakshum Kulshrestha · Chahat Deep Singh · Yiannis Aloimonos · Christopher Metzler
Point-spread-function (PSF) engineering is a well-established computational imaging technique that uses phase masks and other optical elements to embed extra information (e.g., depth) into the images captured by conventional CMOS image sensors.To date, however, PSF-engineering has not been applied to neuromorphic event cameras; a powerful new image sensing technology that responds to changes in the log-intensity of light. This paper establishes theoretical limits (Cramér Rao bounds) on 3D point localization and tracking with PSF-engineered event cameras. Using these bounds, we first demonstrate that existing Fisher phase masks are already near-optimal for localizing static flashing point sources (e.g., blinking fluorescent molecules). We then demonstrate that existing designs are sub-optimal for tracking moving point sources and proceed to use our theory to design optimal phase masks and binary amplitude masks for this task. To overcome the non-convexity of the design problem, we leverage novel implicit neural representation based parameterizations of the phase and amplitude masks. We demonstrate the efficacy of our designs through extensive simulations. We also validate our method with a simple prototype.
WaveMo: Learning Wavefront Modulations to See Through Scattering
Mingyang Xie · Haiyun Guo · Brandon Y. Feng · Lingbo Jin · Ashok Veeraraghavan · Christopher Metzler
Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introduces a novel learning-based framework to address the gap. Our approach jointly optimizes wavefront modulations and a computationally lightweight feedforward ``proxy'' reconstruction network. This network is trained to recover scenes obscured by scattering, using measurements that are modified by these modulations. The learned modulations produced by our framework generalize effectively to unseen scattering scenarios and exhibit remarkable versatility. During deployment, the learned modulations can be decoupled from the proxy network to augment other more computationally expensive restoration algorithms. Through extensive experiments, we demonstrate our approach significantly advances the state of the art in imaging through scattering media. Our project webpage is at https://wavemo-2024.github.io/.
Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence
Ripon Saha · Dehao Qin · Nianyi Li · Jinwei Ye · Suren Jayasuriya
Tackling image degradation due to atmospheric turbulence, particularly in dynamic environment, remains a challenge for long-range imaging systems. Existing techniques have been primarily designed for static scenes or scenes with small motion. This paper presents the first segment-then-restore pipeline for restoring the videos of dynamic scenes in turbulent environment. We leverage mean optical flow with an unsupervised motion segmentation method to separate dynamic and static scene components prior to restoration. After camera shake compensation and segmentation, we introduce foreground/background enhancement leveraging the statistics of turbulence strength and a transformer model trained on a novel noise-based procedural turbulence generator for fast dataset augmentation. Benchmarked against existing restoration methods, our approach restores most of the geometric distortion and enhances sharpness for videos. We make our code, simulator, and data publicly available to advance the field of video restoration from turbulence.
DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model
Zhenghao Pan · Haijin Zeng · Jiezhang Cao · Kai Zhang · Yongyong Chen
This paper endeavors to advance the precision of snapshot compressive imaging (SCI) reconstruction for multispectral image (MSI). To achieve this, we integrate the advantageous attributes of established SCI techniques and an image generative model, propose a novel structured zero-shot diffusion model, dubbed DiffSCI. DiffSCI leverages the structural insights from the deep prior and optimization-based methodologies, complemented by the generative capabilities offered by the contemporary denoising diffusion model. Specifically, firstly, we employ a pre-trained diffusion model, which has been trained on a substantial corpus of RGB images, as the generative denoiser within the Plug-and-Play framework for the first time. This integration allows for the successful completion of SCI reconstruction, especially in the case that current methods struggle to address effectively. Secondly, we systematically account for spectral band correlations and introduce a robust methodology to mitigate wavelength mismatch, thus enabling seamless adaptation of the RGB diffusion model to MSIs. Thirdly, an accelerated algorithm is implemented to expedite the resolution of the data subproblem. This augmentation not only accelerates the convergence rate but also elevates the quality of the reconstruction process. We present extensive testing to show that DiffSCI exhibits discernible performance enhancements over prevailing self-supervised and zero-shot approaches, surpassing even supervised transformer counterparts across both simulated and real datasets. Our code will be available.
Resolution Limit of Single-Photon LiDAR
Stanley H. Chan · Hashan K Weerasooriya · Weijian Zhang · Pamela Abshire · Istvan Gyongy · Robert Henderson
Single-photon Light Detection and Ranging (LIDAR) systems are often equipped with an array of detectors for improved spatial resolution and sensing speed. However, given a fixed amount of flux produced by the laser transmitter across the scene, the per-pixel Signal-to-Noise Ratio (SNR) will decrease when more pixels are packed in a unit space. This presents a fundamental trade-off between the spatial resolution of the sensor array and the SNR received at each pixel. Theoretical characterization of this fundamental limit is explored. By deriving the photon arrival statistics and introducing a series of new approximation techniques, the Mean Squared Error (MSE) of the estimated time delay of a known scene is derived. The theoretical predictions have a good match with simulations and real data.
QN-Mixer: A Quasi-Newton MLP-Mixer Model for Sparse-View CT Reconstruction
Ishak Ayad · Nicolas Larue · Mai K. Nguyen
Inverse problems span across diverse fields. In medical contexts, computed tomography (CT) plays a crucial role in reconstructing a patient's internal structure, presenting challenges due to artifacts caused by inherently ill-posed inverse problems. Previous research advanced image quality via post-processing and deep unrolling algorithms but faces challenges, such as extended convergence times with ultra-sparse data. Despite enhancements, resulting images often show significant artifacts, limiting their effectiveness for real-world diagnostic applications. We aim to explore deep second-order unrolling algorithms for solving imaging inverse problems, emphasizing their faster convergence and lower time complexity compared to common first-order methods like gradient descent. In this paper, we introduce QN-Mixer, an algorithm based on the quasi-Newton approach. We use learned parameters through the BFGS algorithm and introduce Incept-Mixer, an efficient neural architecture that serves as a non-local regularization term, capturing long-range dependencies within images. To address the computational demands typically associated with quasi-Newton algorithms that require full Hessian matrix computations, we present a memory-efficient alternative. Our approach intelligently downsamples gradient information, significantly reducing computational requirements while maintaining performance. The approach is validated through experiments on the sparse-view CT problem, involving various datasets and scanning protocols, and is compared with post-processing and deep unrolling state-of-the-art approaches. Our method outperforms existing approaches and achieves state-of-the-art performance in terms of SSIM and PSNR, all while reducing the number of unrolling iterations required.
Dual-Scale Transformer for Large-Scale Single-Pixel Imaging
Gang Qu · Ping Wang · Xin Yuan
Single-pixel imaging (SPI) is a novel computational imaging technique in recent years, which utilizes a spatial light modulator (SLM) to modulate the light distribution and a single-pixel detector (SPD) to record the total reflected/transmissive light intensity for 2- or 3-dimensional object reconstruction. SPI enjoys the advantages of low cost, wide range of detection and high sensitivity compared with conventional array detector. However, SPI takes multiple projections for spatial resolution, the imaging time and quality are linearly related to the number of detection, which largely restricts its application in real time. The introduction of deep learning has achieved significant improvement for SPI in imaging quality and speed. How to further improve the interpretability and performance of deep learning, reduce the computational workload, especially for large-scale imaging, still remain unsolved issues. In this paper, we introduce a novel 2-D modulation method for large-scale SPI. Basically, we utilize the properties of Kronecker product to decompose the large-scale sampling matrix into two much more smaller ones for the initialization of deep learning, thus further improves the training speed and reduces the usage of GPU memory. Besides, a cross-stage multi-scale deep unfolding network (DUN) with Dual-Scale attention (DSA) is proposed for SPI reconstruction. The design of cross-stage multi-scale DUN guarantees the extraction of deep features and its adequate transfer among stages. Inspired by the multi-scale Transformer, the DSA is introduced into DUN to capture multi-frequencies features for further denoising. Finally, we demonstrate the feasibility and effectiveness of our proposed method with both simulation and real experimental results.
Rolling Shutter Correction with Intermediate Distortion Flow Estimation
Mingdeng Cao · Sidi Yang · Yujiu Yang · Yinqiang Zheng
This paper proposes to correct the rolling shutter (RS) distorted images by estimating the distortion flow from the global shutter (GS) to RS directly. Existing methods usually perform correction using the undistortion flow from the RS to GS. They initially predict the flow from consecutive RS frames, subsequently rescaling it as the displacement fields from the RS frame to the underlying GS image using time-dependent scaling factors. Following this, RS-aware forward warping is employed to convert the RS image into its GS counterpart. Nevertheless, this strategy is prone to two shortcomings. First, the undistortion flow estimation is rendered inaccurate by merely linear scaling the flow, due to the complex non-linear motion nature. Second, RS-aware forward warping often results in unavoidable artifacts. To address these limitations, we introduce a new framework that directly estimates the distortion flow and rectifies the RS image with the backward warping operation. More specifically, we first propose a global correlation-based flow attention mechanism to estimate the initial distortion flow and GS feature jointly, which are then refined by the following coarse-to-fine decoder layers. Additionally, a multi-distortion flow prediction strategy is integrated to mitigate the issue of inaccurate flow estimation further. Experimental results validate the effectiveness of the proposed method, which outperforms state-of-the-art approaches on various benchmarks while maintaining high efficiency. The project is available at https://github.com/ljzycmd/DFRSC.
Passive Snapshot Coded Aperture Dual-Pixel RGB-D Imaging
Bhargav Ghanekar · Salman Siddique Khan · Pranav Sharma · Shreyas Singh · Vivek Boominathan · Kaushik Mitra · Ashok Veeraraghavan
Passive, compact, single-shot 3D sensing is useful in many application areas such as microscopy, medical imaging, surgical navigation, and autonomous driving where form factor, time, and power constraints can exist. Obtaining RGB-D scene information over a short imaging distance, in an ultra-compact form factor, and in a passive, snapshot manner is challenging. Dual-pixel (DP) sensors are a potential solution to achieve the same. DP sensors collect light rays from two different halves of the lens in two interleaved pixel arrays, thus capturing two slightly different views of the scene, like a stereo camera system. However, imaging with a DP sensor implies that the defocus blur size is directly proportional to the disparity seen between the views. This creates a trade-off between disparity estimation vs. deblurring accuracy. To improve this trade-off effect, we propose CADS (Coded Aperture Dual-Pixel Sensing), in which we use a coded aperture in the imaging lens along with a DP sensor. In our approach, we jointly learn an optimal coded pattern and the reconstruction algorithm in an end-to-end optimization setting. Our resulting CADS imaging system demonstrates improvement of $>$1.5~dB PSNR in all-in-focus (AIF) estimates and 5-6\% in depth estimation quality over naive DP sensing for a wide range of aperture settings. Furthermore, we build the proposed CADS prototypes for DSLR photography settings and in an endoscope and a dermoscope form factor. Our novel coded dual-pixel sensing approach demonstrates accurate RGB-D reconstruction results in simulations and real-world experiments in a passive, snapshot, and compact manner.
Single View Refractive Index Tomography with Neural Fields
Brandon Zhao · Aviad Levis · Liam Connor · Pratul P. Srinivasan · Katherine Bouman
Refractive Index Tomography is the inverse problem of reconstructing the continuously-varying 3D refractive index in a scene using 2D projected image measurements. Although a purely refractive field is not directly visible, it bends light rays as they travel through space, thus providing a signal for reconstruction. The effects of such fields appear in many scientific computer vision settings, ranging from refraction due to transparent cells in microscopy to the lensing of distant galaxies caused by dark matter in astrophysics. Reconstructing these fields is particularly difficult due to the complex nonlinear effects of the refractive field on observed images. Furthermore, while standard 3D reconstruction and tomography settings typically have access to observations of the scene from many viewpoints, many refractive index tomography problem settings only have access to images observed from a \emph{single} viewpoint. We introduce a method that leverages prior knowledge of light sources scattered throughout the refractive medium to help disambiguate the single-view refractive index tomography problem. We differentiably trace curved rays through a neural field representation of the refractive field, and optimize its parameters to best reproduce the observed image. We demonstrate the efficacy of our approach by reconstructing simulated refractive fields, analyze the effects of light source distribution on the recovered field, and test our method on a simulated dark matter mapping problem where we successfully recover the 3D refractive field caused by a realistic dark matter distribution.
SPECAT: SPatial-spEctral Cumulative-Attention Transformer for High-Resolution Hyperspectral Image Reconstruction
Zhiyang Yao · Shuyang Liu · Xiaoyun Yuan · Lu Fang
Compressive spectral image reconstruction is a critical method for acquiring images with high spatial and spectral resolution. Current advanced methods, which involve designing deeper networks or adding more self-attention modules, are limited by the scope of attention modules and the irrelevance of attentions across different dimensions. This leads to difficulties in capturing non-local mutation features in the spatial-spectral domain and results in a significant parameter increase but only limited performance improvement. To address these issues, we propose SPECAT, a SPatial-spEctral Cumulative-Attention Transformer designed for high-resolution hyperspectral image reconstruction. SPECAT utilizes Cumulative-Attention Blocks (CABs) within an efficient hierarchical framework to extract features from non-local spatial-spectral details. Furthermore, it employs a projection-object Dual-domain Loss Function (DLF) to integrate the optical path constraint, a physical aspect often overlooked in current methodologies. Ultimately, SPECAT not only significantly enhances the reconstruction quality of spectral details but also breaks through the bottleneck of mutual restriction between the number of parameters and the accuracy of reconstruction in existing algorithms. Our experimental results demonstrate the superiority of SPECAT, achieving 40.3 dB in hyperspectral reconstruction benchmarks, outperforming the state-of-the-art (SOTA) algorithms by 1.2 dB while using only 5% of the network parameters and 10% of the computational cost.
Task-Driven Wavelets using Constrained Empirical Risk Minimization
Eric Marcus · Ray Sheombarsing · Jan-Jakob Sonke · Jonas Teuwen
Deep Neural Networks (DNNs) are widely used for their ability to effectively approximate large classes of functions. This flexibility, however, makes the strict enforcement of constraints on DNNs a difficult problem. In contexts where it is critical to limit the function space to which certain network components belong, such as wavelets employed in Multi-Resolution Analysis (MRA), naive constraints via additional terms in the loss function are inadequate. To address this, we introduce a Convolutional Neural Network (CNN) wherein the convolutional filters are strictly constrained to be wavelets. This allows the filters to update to task-optimized wavelets during the training procedure. Our primary contribution lies in the rigorous formulation of these filters via a constrained empirical risk minimization framework, thereby providing an exact mechanism to enforce these structural constraints. While our work is grounded in theory, we investigate our approach empirically through applications in medical imaging, particularly in the task of contour prediction around various organs, achieving superior performance compared to baseline methods.
Describing Differences in Image Sets with Natural Language
Lisa Dunlap · Yuhui Zhang · Xiaohan Wang · Ruiqi Zhong · Trevor Darrell · Jacob Steinhardt · Joseph Gonzalez · Serena Yeung
How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two **sets** of images, which we term Set Difference Captioning. This task takes in image sets $\mathcal{D}_A$ and $\mathcal{D}_B$, and outputs a description that is more often true on $\mathcal{D}_A$ than $\mathcal{D}_B$. We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff, which first captions the images and prompts a language model to propose candidate descriptions, then re-ranks these descriptions using CLIP. To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image sets with ground truth difference descriptions. We apply VisDiff to various domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing classification models (e.g., zero-shot CLIP vs. supervised ResNet), characterizing differences between generative models (e.g., StableDiffusionV1 and V2), and discovering what makes images memorable. Using VisDiff, we are able to find interesting and previously unknown differences in datasets and models, demonstrating its utility in revealing nuanced insights.
Alchemist: Parametric Control of Material Properties with Diffusion Models
Prafull Sharma · Varun Jampani · Yuanzhen Li · Xuhui Jia · Dmitry Lagun · Fredo Durand · William Freeman · Mark Matthews
We propose a method to control material attributes of objects like roughness, metallic, albedo, and transparency in real images. Our method capitalizes on the generative prior of text-to-image models known for photorealism, employing a scalar value and instructions to alter low-level material properties. Addressing the lack of datasets with controlled material attributes, we generated an object-centric synthetic dataset with physically-based materials. Fine-tuning a modified pre-trained text-to-image model on this synthetic dataset enables us to edit material properties in real-world images while preserving all other attributes. We show the potential application of our model to material edited NeRFs.
Generative Image Dynamics
Zhengqi Li · Richard Tucker · Noah Snavely · Aleksander Holynski
We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural, oscillatory dynamics of objects such as trees, flowers, candles, and clothes swaying in the wind. We model dense, long-term motion in the Fourier domain as spectral volumes, which we find are well-suited to prediction with diffusion models. Given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume, which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module, the predicted motion representation can be used for a number of downstream applications, such as turning still images into seamlessly looping videos, or allowing users to realistically interact with objects in a real picture by interpreting the spectral volumes as image-space modal bases, which approximate object dynamics.
Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models
Daniel Geng · Inbum Park · Andrew Owens
We consider the problem of synthesizing multi-view optical illusions---images that change appearance upon a transformation, such as a flip. We present a conceptually simple, zero-shot method to do so based on diffusion. For every diffusion step we estimate the noise from different views of a noisy image, combine the noise estimates, and perform a step of the reverse diffusion process. A theoretical analysis shows that this method works precisely for views that can be written as orthogonal transformations, of which permutations are a subset. This leads to the idea of a visual anagram, which includes images that change appearance upon a rotation or a flip, but also upon more exotic pixel permutations such as a jigsaw rearrangement. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method.
NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models
Yusuf Dalva · Pinar Yanardag
Generative models have been very popular in the recent years for their image generation capabilities. GAN-based models are highly regarded for their disentangled latent space, which is a key feature contributing to their success in controlled image editing. On the other hand, diffusion models have emerged as powerful tools for generating high-quality images. However, the latent space of diffusion models is not as thoroughly explored or understood. Existing methods that aim to explore the latent space of diffusion models usually relies on text prompts to pinpoint specific semantics. However, this approach may be restrictive in areas such as art, fashion, or specialized fields like medicine, where suitable text prompts might not be available or easy to conceive thus limiting the scope of existing work. In this paper, we propose an unsupervised method to discover latent semantics in text-to-image diffusion models without relying on text prompts. Our method takes a small set of unlabeled images from specific domains, such as faces or cats, and a pre-trained diffusion model, and discovers diverse semantics in unsupervised fashion using a contrastive learning objective. Moreover, the learned directions can be applied simultaneously, either within the same domain (such as various types of facial edits) or across different domains (such as applying cat and face edits within the same image) without interfering with each other. Our extensive experiments show that our method achieves highly disentangled edits, outperforming existing approaches in both diffusion-based and GAN-based latent space editing methods.
Analyzing and Improving the Training Dynamics of Diffusion Models
Tero Karras · Miika Aittala · Jaakko Lehtinen · Janne Hellsten · Timo Aila · Samuli Laine
Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling.As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.
Fourier Priors-Guided Diffusion for Zero-Shot Joint Low-Light Enhancement and Deblurring
Xiaoqian Lv · Shengping Zhang · Chenyang Wang · Yichen Zheng · Bineng Zhong · Chongyi Li · Liqiang Nie
Existing joint low-light enhancement and deblurring methods learn pixel-wise mappings from paired synthetic data, which results in limited generalization in real-world scenes. While some studies explore the rich generative prior of pre-trained diffusion models, they typically rely on the assumed degradation process and cannot handle unknown real-world degradations well. To address these problems, we propose a novel zero-shot framework, FourierDiff, which embeds Fourier priors into a pre-trained diffusion model to harmoniously handle the joint degradation of luminance and structures. FourierDiff is appealing in its relaxed requirements on paired training data and degradation assumptions. The key zero-shot insight is motivated by image characteristics in the Fourier domain: most luminance information concentrates on amplitudes while structure and content information are closely related to phases. Based on this observation, we decompose the sampled results of the reverse diffusion process in the Fourier domain and take advantage of the amplitude of the generative prior to align the enhanced brightness with the distribution of natural images. To yield a sharp and content-consistent enhanced result, we further design a spatial-frequency alternating optimization strategy to progressively refine the phase of the input. Extensive experiments demonstrate the superior effectiveness of the proposed method, especially in real-world scenes.
Color Shift Estimation-and-Correction for Image Enhancement
Yiyu Li · Ke Xu · Gerhard Hancke · Rynson W.H. Lau
Images captured under sub-optimal illumination conditions may contain both over- and under-exposures. Current approaches mainly focus on adjusting image brightness, which may exacerbate the color tone distortion in under-exposed areas and fail to restore accurate colors in over-exposed regions. We observe that under-exposed and over-exposed regions display opposite color tone distribution shifts with respect to each other, which may not be easily normalized in joint modeling as they usually do not have ``normal-exposed'' regions/pixels as reference. In this paper, we propose a novel method to enhance images with both over- and under-exposures by learning to estimate and correct such color shifts. Specifically, we first derive the color feature maps of the brightened and darkened versions of the input image via a UNet-based network, followed by a pseudo-normal feature generator to produce pseudo-normal color feature maps. We then propose a novel COlor Shift Estimation (COSE) module to estimate the color shifts between the derived brightened (or darkened) color feature maps and the pseudo-normal color feature maps. The COSE module corrects the estimated color shifts of the over- and under-exposed regions separately. We further propose a novel COlor MOdulation (COMO) module to modulate the separately corrected colors in the over- and under-exposed regions to produce the enhanced image. Comprehensive experiments show that our method outperforms existing approaches. We will release our codes.
Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention
Xingyu Zhou · Leheng Zhang · Xiaorui Zhao · Keze Wang · Leida Li · Shuhang Gu
Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task. Despite its superiority in VSR accuracy, the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices. In this paper, we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter-frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information.In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.
Distilling Semantic Priors from SAM to Efficient Image Restoration Models
Quan Zhang · Xiaoyu Liu · Wei Li · Hanting Chen · Junchao Liu · Jie Hu · Zhiwei Xiong · Chun Yuan · Yunhe Wang
In image restoration (IR), leveraging semantic priors from segmentation models has been a common approach to improve performance. The recent segment anything model (SAM) has emerged as a powerful tool for extracting advanced semantic priors to enhance IR tasks. However, the computational cost of SAM is prohibitive for IR, compared to existing smaller IR models. The incorporation of SAM for extracting semantic priors considerably hampers the model inference efficiency. To address this issue, we propose a general framework to distill SAM's semantic knowledge to boost exiting IR models without interfering with their inference process. Specifically, our proposed framework consists of the semantic prior fusion (SPF) scheme and the semantic prior distillation (SPD) scheme. SPF fuses two kinds of information between the restored image predicted by the original IR model and the semantic mask predicted by SAM for the refined restored image. SPD leverages a self-distillation manner to distill the fused semantic priors to boost the performance of the original IR model. Additionally, we design a semantic-guided relation (SGR) loss for SPD, which ensures semantic feature representation space consistency to fully distill the priors. We demonstrate the effectiveness of our general framework across multiple IR models and tasks, including deraining, deblurring, and denoising.
Beyond Average: Individualized Visual Scanpath Prediction
Xianyu Chen · Ming Jiang · Qi Zhao
Understanding how attention varies across individuals has significant scientific and societal impacts. However, existing visual scanpath models treat attention uniformly, neglecting individual differences. To bridge this gap, this paper focuses on individualized scanpath prediction (ISP), a new attention modeling task that aims to accurately predict how different individuals shift their attention in diverse visual tasks. It proposes an ISP method featuring three novel technical components: (1) an observer encoder to characterize and integrate an observer's unique attention traits, (2) an observer-centric feature integration approach that holistically combines visual features, task guidance, and observer-specific characteristics, and (3) an adaptive fixation prioritization mechanism that refines scanpath predictions by dynamically prioritizing semantic feature maps based on individual observers' attention traits. These novel components allow scanpath models to effectively address the attention variations across different observers. Our method is generally applicable to different datasets, model architectures, and visual tasks, offering a comprehensive tool for transforming general scanpath models into individualized ones. Comprehensive evaluations using value-based and ranking-based metrics verify the method's effectiveness and generalizability.
Multimodal Prompt Perceiver: Empower Adaptiveness Generalizability and Fidelity for All-in-One Image Restoration
Yuang Ai · Huaibo Huang · Xiaoqiang Zhou · Jiexiang Wang · Ran He
Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across many tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.
Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model
Dian Zheng · Xiao-Ming Wu · Shuzhou Yang · Jian Zhang · Jian-Fang Hu · Wei-Shi Zheng
Universal image restoration is a practical and potential computer vision task for real-world applications. The main challenge of this task is handling the different degradation distributions at once. Existing methods mainly utilize task-specific conditions (i.e., prompt) to guide the model to learn different distributions separately, named multi-partite mapping. However, it is not suitable for universal model learning as it ignores the shared information between different tasks. In this work, we propose an advanced $\textbf{selective hourglass mapping strategy}$ based on diffusion model, termed $\textbf{DiffUIR}$. Two novel considerations make our DiffUIR non-trivial. Firstly, we equip the model with strong condition guidance to obtain accurate generation direction of diffusion model ($\textbf{selective}$). More importantly, DiffUIR integrates a flexible shared distribution term (SDT) into the diffusion algorithm elegantly and naturally, which gradually maps different distributions into a shared one. In the reverse process, combined with SDT and strong condition guidance, DiffUIR iteratively guides the shared distribution to the task-specific distribution with high image quality ($\textbf{hourglass}$). Without bells and whistles, by only modifying the mapping strategy, we achieve state-of-the-art performance on five image restoration tasks, 22 benchmarks in the universal setting and zero-shot generalization setting. Surprisingly, by only using a lightweight model (only 0.89M), we could achieve outstanding performance. The source code and pre-trained models are available at https://github.com/iSEE-Laboratory/DiffUIR
SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution
Rongyuan Wu · Tao Yang · Lingchen Sun · Zhengqiang ZHANG · Shuai Li · Lei Zhang
Owe to the powerful generative priors, the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However, as a consequence of the heavy quality degradation of input low-resolution (LR) images, the destruction of local structures can lead to ambiguous image semantics. As a result, the content of reproduced high-resolution image may have semantic errors, deteriorating the super-resolution performance. To address this issue, we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First, we train a degradation-aware prompt extractor, which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags, aiming to enhance the local perception ability of the T2I model, while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts can encourage the T2I model to generate detailed and semantically accurate results. Furthermore, during the inference process, we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics.
Revisiting Single Image Reflection Removal In the Wild
Yurui Zhu · Bo Li · Xueyang Fu · Peng-Tao Jiang · Hao Zhang · Qibin Sun · Zheng-Jun Zha · Jinwei Chen
This research focuses on the issue of single-image reflection removal (SIRR) in real-world conditions, examining it from two angles: the collection pipeline of real reflection pairs and the perception of real reflection locations. We devise an advanced reflection collection pipeline that is highly adaptable to a wide range of real-world reflection scenarios and incurs reduced costs in collecting large-scale aligned reflection pairs. In the process, we develop a large-scale, high-quality reflection dataset named Reflection Removal in the Wild (RRW). RRW contains over 14,950 high-resolution real-world reflection pairs, a dataset forty-five times larger than its predecessors. Regarding perception of reflection locations, we identify that numerous virtual reflection objects visible in reflection images are not present in the corresponding ground-truth images. This observation, drawn from the aligned pairs, leads us to conceive the Maximum Reflection Filter (MaxRF). The MaxRF could accurately and explicitly characterize reflection locations from pairs of images. Building upon this, we design a reflection location-aware cascaded framework, specifically tailored for SIRR. Powered by these innovative techniques, our solution achieves superior performance than current leading methods across multiple real-world benchmarks. Codes and datasets will be publicly available.
ODCR: Orthogonal Decoupling Contrastive Regularization for Unpaired Image Dehazing
Zhongze Wang · Haitao Zhao · Jingchao Peng · Lujian Yao · Kaijie Zhao
Unpaired image dehazing (UID) holds significant research importance due to the challenges in acquiring haze/clear image pairs with identical backgrounds. This paper proposes a novel method for UID named Orthogonal Decoupling Contrastive Regularization (ODCR). Our method is grounded in the assumption that an image consists of both haze-related features, which influence the degree of haze, and haze-unrelated features, such as texture and semantic information. ODCR aims to ensure that the haze-related features of the dehazing result closely resemble those of the clear image, while the haze-unrelated features align with the input hazy image. To accomplish the motivation, Orthogonal MLPs optimized geometrically on the Stiefel manifold are proposed, which can project image features into an orthogonal space, thereby reducing the relevance between different features. Furthermore, a task-driven Depth-wise Feature Classifier (DWFC) is proposed, which assigns weights to the orthogonal features based on the contribution of each channel's feature in predicting whether the feature source is hazy or clear in a self-supervised fashion. Finally, a Weighted PatchNCE (WPNCE) loss is introduced to achieve the pulling of haze-related features in the output image toward those of clear images, while bringing haze-unrelated features close to those of the hazy input. Extensive experiments demonstrate the superior performance of our ODCR method on UID.
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
Haoning Wu · Zicheng Zhang · Erli Zhang · Chaofeng Chen · Liang Liao · Annan Wang · Kaixin Xu · Chunyi Li · Jingwen Hou · Guangtao Zhai · Xue Geng · Wenxiu Sun · Qiong Yan · Weisi Lin
Multi-modality large language models (MLLMs), as represented by GPT-4V, have introduced a paradigm shift for visual perception and understanding tasks, that a variety of abilities can be achieved within one foundation model. While current MLLMs demonstrate primary low-level visual abilities from the identification of low-level visual attributes (e.g., clarity, brightness) to the evaluation on image quality, there's still an imperative to further improve the accuracy of MLLMs to substantially alleviate human burdens. To address this, we collect the first dataset consisting of human natural language feedback on low-level vision. Each feedback offers a comprehensive description of an image's low-level visual attributes, culminating in an overall quality assessment. The constructed Q-Pathway dataset includes 58K detailed human feedbacks on 18,973 multi-sourced images with diverse low-level appearance. To ensure MLLMs can adeptly handle diverse queries, we further propose a GPT-participated transformation to convert these feedbacks into a rich set of 200K instruction-response pairs, termed Q-Instruct. Experimental results indicate that the Q-Instruct consistently elevates various low-level visual capabilities across multiple base models. We anticipate that our datasets can pave the way for a future that foundation models can assist humans on low-level visual tasks.
Enhancing Quality of Compressed Images by Mitigating Enhancement Bias Towards Compression Domain
Qunliang Xing · Mai Xu · Shengxi Li · Xin Deng · Meisong Zheng · huaida liu · Ying Chen
Existing quality enhancement methods for compressed images focus on aligning the enhancement domain with the raw domain to yield realistic images. However, these methods exhibit a pervasive enhancement bias towards the compression domain, inadvertently regarding it as more realistic than the raw domain. This bias makes enhanced images closely resemble their compressed counterparts, thus degrading their perceptual quality. In this paper, we propose a simple yet effective method to mitigate this bias and enhance the quality of compressed images. Our method employs a conditional discriminator with the compressed image as a key condition, and then incorporates a domain-divergence regularization to actively distance the enhancement domain from the compression domain. Through this dual strategy, our method enables the discrimination against the compression domain, and brings the enhancement domain closer to the raw domain. Comprehensive quality evaluations confirm the superiority of our method over other state-of-the-art methods without incurring inference overheads.
Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing
Dongyoung Kim · Jinwoo Kim · Junsang Yu · Seon Joo Kim
White balance (WB) algorithms in many commercial cameras assume single and uniform illumination, leading to undesirable results when multiple lighting sources with different chromaticities exist in the scene. Prior research on multi-illuminant WB typically predicts illumination at the pixel level without fully grasping the scene's actual lighting conditions, including the number and color of light sources. This often results in unnatural outcomes lacking in overall consistency. To handle this problem, we present a deep white balancing model that leverages the slot attention, where each slot is in charge of representing individual illuminants. This design enables the model to generate chromaticities and weight maps for individual illuminants, which are then fused to compose the final illumination map. Furthermore, we propose the centroid-matching loss, which regulates the activation of each slot based on the color range, thereby enhancing the model to separate illumination more effectively. Our method achieves the state-of-the-art performance on both single- and multi-illuminant WB benchmarks, and also offers additional information such as the number of illuminants in the scene and their chromaticity. This capability allows for illumination editing, an application not feasible with prior methods.
NightCC: Nighttime Color Constancy via Adaptive Channel Masking
Shuwei Li · Robby T. Tan
Nighttime conditions pose a significant challenge to color constancy due to the diversity of lighting conditions and the presence of substantial low-light noise. Existing color constancy methods struggle with nighttime scenes, frequently leading to imprecise light color estimations. To tackle nighttime color constancy, we propose a novel unsupervised domain adaptation approach that utilizes labeled daytime data to facilitate learning on unlabeled nighttime images. To specifically address the unique lighting conditions of nighttime and ensure the robustness of pseudo labels, we propose adaptive channel masking and reflective uncertainty. The adaptive channel masking is designed to guide the model to progressively learn features that are less influenced by variations in light colors and noise. Moreover, with our reflective uncertainty providing pixel-wise uncertainty estimation, our model can avoid learning from incorrect labels. Our model demonstrates a significant improvement in accuracy, achieving 20% lower Mean Angular Error (MAE) compared to the state-of-the-art method on our nighttime dataset.
Navigating Beyond Dropout: An Intriguing Solution towards Generalizable Image Super Resolution
Hongjun Wang · Jiyuan Chen · Yinqiang Zheng · Tieyong Zeng
Deep learning has led to a dramatic leap on Single Image Super-Resolution (SISR) performances in recent years. While most existing work assumes a simple and fixed degradation model (e.g., bicubic downsampling), the research of Blind SR seeks to improve model generalization ability with unknown degradation. Recently, Kong et al. pioneer the investigation of a more suitable training strategy for Blind SR using Dropout. Although such method indeed brings substantial generalization improvements via mitigating overfitting, we argue that Dropout simultaneously introduces undesirable side-effect that compromises model's capacity to faithfully reconstruct fine details. We show both the theoretical and experimental analyses in our paper, and furthermore, we present another easy yet effective training strategy that enhances the generalization ability of the model by simply modulating its first and second-order features statistics. Experimental results have shown that our method could serve as a model-agnostic regularization and outperforms Dropout on seven benchmark datasets including both synthetic and real-world scenarios. The code is released in our supplementary materials.
Learning Inclusion Matching for Animation Paint Bucket Colorization
Yuekun Dai · Shangchen Zhou · Blake Li · Chongyi Li · Chen Change Loy
Colorizing line art is a pivotal task in the production of hand-drawn cel animation. This typically involves digital painters using a paint bucket tool to manually color each segment enclosed by lines, based on RGB values predetermined by a color designer. This frame-by-frame process is both arduous and time-intensive. Current automated methods mainly focus on segment matching. This technique migrates colors from a reference to the target frame by aligning features within line-enclosed segments across frames. However, issues like occlusion and wrinkles in animations often disrupt these direct correspondences, leading to mismatches. In this work, we introduce a new learning-based inclusion matching pipeline, which directs the network to comprehend the inclusion relationships between segments rather than relying solely on direct visual correspondences. Our method features a two-stage pipeline that integrates a coarse color warping module with an inclusion matching module, enabling more nuanced and accurate colorization. To facilitate the training of our network, we also develope a unique dataset, referred to as PaintBucket-Character. This dataset includes rendered line arts alongside their colorized counterparts, featuring various 3D characters. Extensive experiments demonstrate the effectiveness and superiority of our method over existing techniques.
Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization
Yujia Liu · Chenxi Yang · Dingquan Li · Jianhao Ding · Tingting Jiang
The task of No-Reference Image Quality Assessment (NR-IQA) is to estimate the quality score of an input image without additional information. NR-IQA models play a crucial role in the media industry, aiding in performance evaluation and optimization guidance. However, these models are found to be vulnerable to adversarial attacks, which introduce imperceptible perturbations to input images, resulting in significant changes in predicted scores. In this paper, we propose a defense method to mitigate the variability in predicted scores caused by small perturbations, thus enhancing the adversarial robustness of NR-IQA models. To be specific, we present theoretical evidence showing that the extent of score changes is related to the $\ell_1$ norm of the gradient of the predicted score with respect to the input image when adversarial perturbations are $\ell_\infty$-bounded. Building on this theoretical foundation, we propose a norm regularization training strategy aimed at reducing the $\ell_1$ norm of the gradient, thereby boosting the adversarial robustness of NR-IQA models. Experiments conducted on four NR-IQA baseline models demonstrate the effectiveness of our strategy in reducing score changes in the presence of adversarial attacks. To the best of our knowledge, this work marks the first attempt to defend against adversarial attacks on NR-IQA models. Our study offers valuable insights into the adversarial robustness of NR-IQA models and provides a foundation for future research in this area.
Towards Backward-Compatible Continual Learning of Image Compression
Zhihao Duan · Ming Lu · Justin Yang · Jiangpeng He · Zhan Ma · Fengqing Zhu
This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model.We refer to this problem as continual learning of image compression.Our initial findings show that baseline solutions, such as end-to-end fine-tuning, do not preserve the desired backward compatibility.To tackle this, we propose a knowledge replay training strategy that effectively addresses this issue.We also design a new model architecture that enables more effective continual learning than existing baselines.Experiments are conducted for two scenarios: data-incremental learning and rate-incremental learning.The main conclusion of this paper is that neural image compressors can be fine-tuned to achieve better performance (compared to their pre-trained version) on new data and rates without compromising backward compatibility.Our code will be made publicly available.
APISR: Anime Production Inspired Real-World Anime Super-Resolution
Boyang Wang · Fengyu Yang · Xihang Yu · Chao Zhang · Hanbin Zhao
While real-world anime super-resolution (SR) has gained increasing attention in the SR community, most existing methods still adopt techniques from the photo-realistic domain. In this paper, we analyze the anime production workflow and rethink how to use characteristics of it for the sake of the real-world anime SR. First, we argue that video networks and datasets are not necessary for anime SR due to the repetition use of hand-drawing frames. Instead, we propose an anime image collection pipeline by choosing the least compressed and the most informative frames from the video sources. Based on this pipeline, we introduce the Anime Production-oriented Image (API) dataset. In addition, we identify two anime-specific challenges of distorted and faint hand-drawn lines and unwanted color artifacts. We address the first issue by introducing a prediction-oriented compression module in the image degradation model and a pseudo-ground truth with enhanced hand-drawn lines. In addition, we introduce the balanced twin perceptual loss combining both anime and photo-realistic high-level features to mitigate unwanted color artifacts and increase visual clarity. We evaluate our method through extensive experiments on the public benchmark, showing our method outperforms state-of-the-art approaches by a large margin. We will release code, models, and dataset upon acceptance.
Unifying Automatic and Interactive Matting with Pretrained ViTs
Zixuan Ye · Wenze Liu · He Guo · Yujia Liang · Chaoyi Hong · Hao Lu · Zhiguo Cao
Automatic and interactive matting largely improve image matting by respectively alleviating the need for auxiliary input and enabling object selection. Due to different settings on whether prompts exist, they either suffer from weakness in instance completeness or region details. Also, when dealing with different scenarios, directly switching between the two matting models introduces inconvenience and higher workload. Therefore, we wonder whether we can alleviate the limitations of both settings while achieving unification to facilitate more convenient use. Our key idea is to offer saliency guidance for automatic mode to enable its attention to detailed regions, and also refine the instance completeness in interactive mode by replacing the binary mask guidance with a more probabilistic form. With different guidance for each mode, we can achieve unification through adaptable guidance, defined as saliency information in automatic mode and user cue for interactive one. It is instantiated as candidate feature in our method, an automatic switch for class token in pretrained ViTs and average feature of user prompts, controlled by the existence of user prompts. Then we use the candidate feature to generate a probabilistic similarity map as the guidance to alleviate the over-reliance on binary mask. Extensive experiments show that our method can adapt well to both automatic and interactive scenarios with more light-weight framework. Code available at https://github.com/coconuthust/SmartMatting.
Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring
Chengxu Liu · Xuan Wang · Xiangyu Xu · Ruhao Tian · Shuai Li · Xueming Qian · Ming-Hsuan Yang
Eliminating image blur produced by various kinds of motion has been a challenging problem. Dominant approaches rely heavily on model capacity to remove blurring by reconstructing residual from blurry observation in feature space. These practices not only prevent the capture of spatially variable motion in the real world but also ignore the tailored handling of various motions in image space. In this paper, we propose a novel real-world deblurring filtering model called the Motion-adaptive Separable Collaborative (MISC) Filter. In particular, we use a motion estimation network to capture motion information from neighborhoods, thereby adaptively estimating spatially-variant motion flow, mask, kernels, weights, and offsets to obtain the MISC Filter. The MISC Filter first aligns the motion-induced blurring patterns to the motion middle along the predicted flow direction, and then collaboratively filters the aligned image through the predicted kernels, weights, and offsets to generate the output. This design can handle more generalized and complex motion in a spatially differentiated manner. Furthermore, we analyze the relationships between the motion estimation network and the residual reconstruction network. Extensive experiments on four widely used benchmarks demonstrate that our method provides an effective solution for real-world motion blur removal and achieves state-of-the-art performance. Code is available at \url{https://github.com/ChengxuLiu/MISCFilter}.
Genuine Knowledge from Practice: Diffusion Test-Time Adaptation for Video Adverse Weather Removal
Yijun Yang · Hongtao Wu · Angelica I. Aviles-Rivero · Yulun Zhang · Jing Qin · Lei Zhu
Real-world vision tasks frequently suffer from the appearance of unexpected adverse weather conditions, including rain, haze, snow, and raindrops. In the last decade, convolutional neural networks and vision transformers have yielded outstanding results in single-weather video removal. However, due to the absence of appropriate adaptation, most of them fail to generalize to other weather conditions. Although ViWS-Net is proposed to remove adverse weather conditions in videos with a single set of pre-trained weights, it is seriously blinded by seen weather at train-time and degenerates when coming to unseen weather during test-time. In this work, we introduce test-time adaptation into adverse weather removal in videos, and propose the first framework that integrates test-time adaptation into the iterative diffusion reverse process. Specifically, we devise a diffusion-based network with a novel temporal noise model to efficiently explore frame-correlated information in degraded video clips at training stage. During inference stage, we introduce a proxy task named Diffusion Tubelet Self-Calibration to learn the primer distribution of test video stream and optimize the model by approximating the temporal noise model for online adaptation. Experimental results, on benchmark datasets, demonstrate that our Test-Time Adaptation method with Diffusion-based network(Diff-TTA) outperforms state-of-the-art methods in terms of restoring videos degraded by seen weather conditions. Its generalizable capability is validated with unseen weather conditions in synthesized and real-world videos.
HomoFormer: Homogenized Transformer for Image Shadow Removal
Jie Xiao · Xueyang Fu · Yurui Zhu · Dong Li · Jie Huang · Kai Zhu · Zheng-Jun Zha
The spatial non-uniformity and diverse patterns of shadow degradation conflict with the weight sharing manner of dominant models, which may lead to an unsatisfactory compromise. To tackle with this issue, we present a novel strategy from the view of shadow transformation in this paper: directly homogenizing the spatial distribution of shadow degradation. Our key design is the random shuffle operation and its corresponding inverse operation. Specifically, random shuffle operation stochastically rearranges the pixels across spatial space and the inverse operation recovers the original order. After randomly shuffling, the shadow diffuses in the whole image and the degradation appears in a homogenized way, which can be effectively processed by the local self-attention layer. Moreover, we further devise a new feed forward network with position modeling to exploit image structural information. Based on these elements, we construct the final local window based transformer named HomoFormer for image shadow removal. Our HomoFormer can enjoy the linear complexity of local transformers while bypassing challenges of non-uniformity and diversity of shadow. Extensive experiments are conducted to verify the superiority of our HomoFormer across public datasets. Code will be publicly available.
Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining
Xiang Chen · Jinshan Pan · Jiangxin Dong
How to effectively explore multi-scale representations of rain streaks is important for image deraining. In contrast to existing Transformer-based methods that depend mostly on single-scale rain appearance, we develop an end-to-end multi-scale Transformer that leverages the potentially useful features in various scales to facilitate high-quality image reconstruction. To better explore the common degradation representations from spatially-varying rain streaks, we incorporate intra-scale implicit neural representations based on pixel coordinates with the degraded inputs in a closed-loop design, enabling the learned features to facilitate rain removal and improve the robustness of the model in complex scenarios. To ensure richer collaborative representation from different scales, we embed a simple yet effective inter-scale bidirectional feedback operation into our multi-scale Transformer by performing coarse-to-fine and fine-to-coarse information communication. Extensive experiments demonstrate that our approach, named as NeRD-Rain, performs favorably against the state-of-the-art ones on both synthetic and real-world benchmark datasets. The source code and trained models are available at https://github.com/cschenxiang/NeRD-Rain.
Event camera has significant advantages in capturing dynamic scene information while being prone to noise interference, particularly in challenging conditions like low threshold and low illumination. However, most existing research focuses on gentle situations, hindering event camer aapplications in realistic complex scenarios. To tackle this limitation and advance the field, we construct a new paired real-world event denoising dataset (LED), including 3K sequences with 18K seconds of high-resolution (1200*680) event streams and showing three notable distinctions compared to others: diverse noise levels and scenes, larger scale with high-resolution, and high-quality GT. Specifically,it contains stepped parameters and varying illumination with diverse scenarios. Moreover, based on the property of noise events inconsistency and signal events consistency, we propose a novel effective denoising framework (DED) using homogeneous dual events to generate the GT with better separating noise from the raw. Furthermore, we design a bio-inspired baseline leveraging Leaky-Integrate-and-Fire (LIF) neurons with dynamic thresholds to realize accurate denoising. The experimental results demonstrate that the remarkable performance of the proposed approach on different datasets.The dataset and codeare at https://github.com/Yee-Sing/led.
Seeing Motion at Nighttime with an Event Camera
Haoyue Liu · Shihan Peng · Lin Zhu · Yi Chang · Hanyu Zhou · Luxin Yan
We focus on a very challenging task: imaging at nighttime dynamic scenes. Most previous methods rely on the low-light enhancement of a conventional RGB camera. However, they would inevitably face a dilemma between the long exposure time of nighttime and the motion blur of dynamic scenes. Event cameras react to dynamic changes with higher temporal resolution (microsecond) and higher dynamic range (120dB), offering an alternative solution. In this work, we present a novel nighttime dynamic imaging method with an event camera. Specifically, we discover that the event at nighttime exhibits temporal trailing characteristics and spatial non-stationary distribution. Consequently, we propose a nighttime event reconstruction network (NER-Net) which mainly includes a learnable event timestamps calibration module (LETC) to align the temporal trailing events and a non-uniform illumination aware module (NIAM) to stabilize the spatiotemporal distribution of events. Moreover, we construct a paired real low-light event dataset (RLED) through a co-axial imaging system, including 64,200 spatially and temporally aligned image GTs and low-light events. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods in terms of visual quality and generalization ability on real-world nighttime datasets. The project are available at: https://github.com/Liu-haoyue/NER-Net.
Leveraging Frame Affinity for sRGB-to-RAW Video De-rendering
Chen Zhang · Wencheng Han · Yang Zhou · Jianbing Shen · Cheng-Zhong Xu · Wentao Liu
Unprocessed RAW video has shown distinct advantages over sRGB video in video editing and computer vision tasks. However, capturing RAW video is challenging due to limitations in bandwidth and storage. Various methods have been proposed to address similar issues in single image RAW capture through de-rendering. These methods utilize both the metadata and the sRGB image to perform sRGB-to-RAW de-rendering and recover high-quality single-frame RAW data. However, metadata-based methods always require additional computation for online metadata generation, imposing severe burden on mobile camera device for high frame rate RAW video capture. To address this issue, we propose a framework that utilizes frame affinity to achieve high-quality sRGB-to-RAW video reconstruction. Our approach consists of two main steps. The first step, temporal affinity prior extraction, uses motion information between adjacent frames to obtain a reference RAW image. The second step, spatial feature fusion and mapping, learns a pixel-level mapping function using scene-specific and position-specific features provided by the previous frame. Our method can be easily applied to current mobile camera equipment without complicated adaptations or added burden. To demonstrate the effectiveness of our approach, we introduce the first RAW Video De-rendering Benchmark. In this benchmark, our method outperforms state-of-the-art RAW image reconstruction methods, even without image-level metadata.
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
Fanghua Yu · Jinjin Gu · Zheyuan Li · Jinfan Hu · Xiangtao Kong · Xintao Wang · Jingwen He · Yu Qiao · Chao Dong
We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up. Leveraging multi-modal techniques and advanced generative prior, SUPIR marks a significant advance in intelligent and realistic image restoration. As a pivotal catalyst within SUPIR, model scaling dramatically enhances its capabilities and demonstrates new potential for image restoration. We collect a dataset comprising 20 million high-resolution, high-quality images for model training, each enriched with descriptive text annotations. SUPIR provides the capability to restore images guided by textual prompts, broadening its application scope and potential. Moreover, we introduce negative-quality prompts to further improve perceptual quality. We also develop a restoration-guided sampling method to suppress the fidelity issue encountered in generative-based restoration. Experiments demonstrate SUPIR's exceptional restoration effects and its novel capacity to manipulate restoration through textual prompts.
AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring
Xintian Mao · Xiwen Gao · Yan Wang
Despite the recent progress in enhancing the efficacy of image deblurring, the limited decoding capability constrains the upper limit of State-Of-The-Art (SOTA) methods. This paper proposes a pioneering work, Adaptive Patch Exiting Reversible Decoder (AdaRevD), to explore their insufficient decoding capability. By inheriting the weights of the well-trained encoder, we refactor a reversible decoder which scales up the single-decoder training to multi-decoder training while remaining GPU memory-friendly. Meanwhile, we show that our reversible structure gradually disentangles high-level degradation degree and low-level blur pattern (residual of the blur image and its sharp counterpart) from compact degradation representation. Besides, due to the spatially-variant motion blur kernels, different blur patches have various deblurring difficulties. We further introduce a classifier to learn the degradation degree of image patches, enabling them to exit at different sub-decoders for speedup. Experiments show that our AdaRevD pushes the limit of image deblurring, e.g., achieving 34.60 dB in PSNR on GoPro dataset.
Unsupervised Blind Image Deblurring Based on Self-Enhancement
Lufei Chen · Xiangpeng Tian · Shuhua Xiong · Yinjie Lei · Chao Ren
Significant progress in image deblurring has been achieved by deep learning methods, especially the remarkable performance of supervised models on paired synthetic data. However, real-world quality degradation is more complex than synthetic datasets, and acquiring paired data in real-world scenarios poses significant challenges. To address these challenges, we propose a novel unsupervised image deblurring framework based on self-enhancement. The framework progressively generates improved pseudo-sharp and blurry image pairs without the need for real paired datasets, and the generated image pairs with higher qualities can be used to enhance the performance of the reconstructor. To ensure the generated blurry images are closer to the real blurry images, we propose a novel re-degradation principal component consistency loss, which enforces the principal components of the generated low-quality images to be similar to those of re-degraded images from the original sharp ones. Furthermore, we introduce the self-enhancement strategy that significantly improves deblurring performance without increasing the computational complexity of network during inference. Through extensive experiments on multiple real-world blurry datasets, we demonstrate the superiority of our approach over other state-of-the-art unsupervised methods.
TTA-EVF: Test-Time Adaptation for Event-based Video Frame Interpolation via Reliable Pixel and Sample Estimation
Hoonhee Cho · Taewoo Kim · Yuhwan Jeong · Kuk-Jin Yoon
Video Frame Interpolation (VFI), which aims at generating high-frame-rate videos from low-frame-rate inputs, is a highly challenging task. The emergence of bio-inspired sensors known as event cameras, which boast microsecond-level temporal resolution, has ushered in a transformative era for VFI. Nonetheless, the application of event-based VFI techniques in domains with distinct environments from the training data can be problematic. This is mainly because event camera data distribution can undergo substantial variations based on camera settings and scene conditions, presenting challenges for effective adaptation. In this paper, we propose a test-time adaptation method for event-based VFI to address the gap between the source and target domains. Our approach enables sequential learning in an online manner on the target domain, which only provides low-frame-rate videos. We present an approach that leverages confident pixels as pseudo ground-truths, enabling stable and accurate online learning from low-frame-rate videos. Furthermore, to prevent overfitting during the continuous online process where the same scene is encountered repeatedly, we propose a method of blending historical samples with current scenes. Extensive experiments validate the effectiveness of our method, both in cross-domain and continuous domain shifting setups. We will make our code and dataset publicly available.
Learning Coupled Dictionaries from Unpaired Data for Image Super-Resolution
Longguang Wang · Juncheng Li · Yingqian Wang · Qingyong Hu · Yulan Guo
The difficulty of acquiring high-resolution (HR) and low-resolution (LR) image pairs in real scenarios limits the performance of existing learning-based image super-resolution (SR) methods in the real world. To conduct training on real-world unpaired data, current methods focus on synthesizing pseudo LR images to associate unpaired images. However, the realness and diversity of pseudo LR images are vulnerable due to the large image space. In this paper, we propose an alternative to build the connection between unpaired images in a compact proxy space without relying on synthesizing pseudo LR images. Specifically, we first construct coupled HR and LR dictionaries, and then encode HR and LR images into a common latent code space using these dictionaries. In addition, we develop an autoencoder-based framework to couple these dictionaries during optimization by reconstructing input HR and LR images. The coupled dictionaries enable our method to employ a shallow network architecture with only 18 layers to achieve efficient image SR. Extensive experiments show that our method (DictSR) can effectively model the LR-to-HR mapping in coupled dictionaries and produces state-of-the-art performance on benchmark datasets.
Empowering Resampling Operation for Ultra-High-Definition Image Enhancement with Model-Aware Guidance
Yu · Jie Huang · Li · Kaiwen Zheng · Qi Zhu · Man Zhou · Feng Zhao
Image enhancement algorithms have made remarkable advancements in recent years, but directly applying them to Ultra-high-definition (UHD) images presents intractable computational overheads. Therefore, previous straightforward solutions employ resampling techniques to reduce the resolution by adopting a "Downsampling-Enhancement-Upsampling" processing paradigm. However, this paradigm disentangles the resampling operators and inner enhancement algorithms, which results in the loss of information that is favored by the model, further leading to sub-optimal outcomes. In this paper, we propose a novel method of Learning Model-Aware Resampling (LMAR), which learns to customize resampling by extracting model-aware information from the UHD input image, under the guidance of model knowledge. Specifically, our method consists of two core designs, namely compensatory kernel estimation and steganographic resampling. At the first stage, we dynamically predict compensatory kernels tailored to the specific input and resampling scales. At the second stage, the image-wise compensatory information is derived with the compensatory kernels and embedded into the rescaled input images. This promotes the representation of the newly derived downscaled inputs to be more consistent with the full-resolution UHD inputs, as perceived by the model. Our LMAR enables model-aware and model-favored resampling while maintaining compatibility with existing resampling operators. Extensive experiments on multiple UHD image enhancement datasets and different backbones have shown consistent performance gains after correlating resizer and enhancer, e.g., up to 1.2dB PSNR gain for 1.8 times resampling scale on UHD-LOL4K. The code is available at https://github.com/YPatrickW/LMAR.
Generating Content for HDR Deghosting from Frequency View
Tao Hu · Qingsen Yan · Yuankai Qi · Yanning Zhang
Recovering ghost-free High Dynamic Range (HDR) images from multiple Low Dynamic Range (LDR) images becomes challenging when the LDR images exhibit saturation and significant motion. Recent Diffusion Models (DMs) have been introduced in HDR imaging field, demonstrating promising performance, particularly in achieving visually perceptible results compared to previous DNN-based methods. However, DMs require extensive iterations with large models to estimate entire images, resulting in inefficiency that hinders their practical application. To address this challenge, we propose the Low-Frequency aware Diffusion (LF-Diff) model for ghost-free HDR imaging. The key idea of LF-Diff is implementing the DMs in a highly compacted latent space and integrating it into a regression-based model to enhance the details of reconstructed images. Specifically, as low-frequency information is closely related to human visual perception we propose to utilize DMs to create compact low-frequency priors for the reconstruction process. In addition, to take full advantage of the above low-frequency priors, the Dynamic HDR Reconstruction Network (DHRNet) is carried out in a regression-based manner to obtain final HDR images. Extensive experiments conducted on synthetic and real-world benchmark datasets demonstrate that our LF-Diff performs favorably against several state-of-the-art methods and is 10× faster than previous DM-based methods.
Dual Prior Unfolding for Snapshot Compressive Imaging
Jiancheng Zhang · Haijin Zeng · Jiezhang Cao · Yongyong Chen · Dengxiu Yu · Yinping Zhao
Recently, deep unfolding methods have achieved remarkable success in the realm of Snapshot Compressive Imaging (SCI) reconstruction. However, the existing methods all follow the iterative framework of a single image prior, which limits the efficiency of the unfolding methods and makes it a problem to use other priors simply and effectively. To break out of the box, we derive an effective Dual Prior Unfolding (DPU), which achieves the joint utilization of multiple deep priors and greatly improves iteration efficiency. Our unfolding method is implemented through two parts, i.e., Dual Prior Framework (DPF) and Focused Attention (FA). In brief, in addition to the normal image prior, DPF introduces a residual into the iteration formula and constructs a degraded prior for the residual by considering various degradations to establish the unfolding framework. To improve the effectiveness of the image prior based on self-attention, FA adopts a novel mechanism inspired by PCA denoising to scale and filter attention, which lets the attention focus more on effective features with little computation cost. Besides, an asymmetric backbone is proposed to further improve the efficiency of hierarchical self-attention. Remarkably, our 5-stage DPU achieves state-of-the-art (SOTA) performance with the least FLOPs and parameters compared to previous methods, while our 9-stage DPU significantly outperforms other unfolding methods with less computational requirement.
Binarized Low-light Raw Video Enhancement
Gengchen Zhang · Yulun Zhang · Xin Yuan · Ying Fu
Recently, deep neural networks have achieved excellent performance on low-light raw video enhancement. However, they often come with high computational complexity and large memory costs, which hinder their applications on resource-limited devices. In this paper, we explore the feasibility of applying the extremely compact binary neural network (BNN) to low-light raw video enhancement. Nevertheless, there are two main issues with binarizing video enhancement models. One is how to fuse the temporal information to improve low-light denoising without complex modules. The other is how to narrow the performance gap between binary convolutions with the full precision ones. To address the first issue, we introduce a spatial-temporal shift operation, which is easy-to-binarize and effective. The temporal shift efficiently aggregates the features of neighbor frames and the spatial shift handles the misalignment caused by the large motion in videos. For the second issue, we present a distribution-aware binary convolution, which captures the distribution characteristics of real-valued input and incorporates them into plain binary convolutions to alleviate the degradation in performance. Extensive quantitative and qualitative experiments have shown our high-efficiency binarized low-light raw video enhancement method can attain a promising performance. The code is available at https://github.com/zhanggengchen/BRVE.
Neural Spline Fields for Burst Image Fusion and Layer Separation
Ilya Chugunov · David Shustin · Ruyu Yan · Chenyang Lei · Felix Heide
Each photo in an image burst can be considered a sample of a complex 3D scene: the product of parallax, diffuse and specular materials, scene motion, and illuminant variation. While decomposing all of these effects from a stack of misaligned images is a highly ill-conditioned task, the conventional align-and-merge burst pipeline takes the other extreme: blending them into a single image. In this work, we propose a versatile intermediate representation that consists of a two-layer alpha-composited image plus flow model constructed with neural spline fields -- networks trained to map input coordinates to spline control points. Our method is able to, during test-time optimization, jointly fuse a burst image capture into one high-resolution reconstruction and decompose it into transmission and obstruction layers. Then, by discarding the obstruction layer, we can perform a range of tasks including seeing through occlusions, reflection suppression, and shadow removal. We validate the method on complex synthetic and in-the-wild captures and find that our method, with no post-processing steps or learned priors, outperforms existing single-image and multi-view obstruction removal approaches.
Learning Degradation-Independent Representations for Camera ISP Pipelines
Yanhui Guo · Fangzhou Luo · Xiaolin Wu
Image signal processing (ISP) pipeline plays a fundamental role in digital cameras, which converts raw Bayer sensor data to RGB images. However, ISP-generated images usually suffer from imperfections due to the compounded degradations that stem from sensor noises, demosaicing noises, compression artifacts, and possibly adverse effects of erroneous ISP hyperparameter settings such as ISO and gamma values. In a general sense, these ISP imperfections can be considered as degradations. The highly complex mechanisms of ISP degradations, some of which are even unknown, pose great challenges to the generalization capability of deep neural networks (DNN) for image restoration and to their adaptability to downstream tasks. To tackle the issues, we propose a novel DNN approach to learn degradation-independent representations (DiR) through the refinement of a self-supervised learned baseline representation. The proposed DiR learning technique has remarkable domain generalization capability and consequently, it outperforms state-of-the-art methods across various downstream tasks, including blind image restoration, object detection, and instance segmentation, as verified in our experiments.
SeD: Semantic-Aware Discriminator for Image Super-Resolution
Bingchen Li · Xin Li · Hanxin Zhu · YEYING JIN · Ruoyu Feng · Zhizheng Zhang · Zhibo Chen
Generative Adversarial Networks (GANs) have been widely used to recover vivid textures in image super-resolution (SR) tasks. In particular, one discriminator is utilized to enable the SR network to learn the distribution of real-world high-quality images in an adversarial training manner. However, the distribution learning is overly coarse-grained, which is susceptible to virtual textures and causes counter-intuitive generation results. To mitigate this, we propose the simple and effective Semantic-aware Discriminator (denoted as SeD), which encourages the SR network to learn the fine-grained distributions by introducing the semantics of images as a condition. Concretely, we aim to excavate the semantics of images from a well-trained semantic extractor. Under different semantics, the discriminator is able to distinguish the real-fake images individually and adaptively, which guides the SR network to learn the more fine-grained semantic-aware textures. To obtain accurate and abundant semantics, we take full advantage of recently popular pre-trained large vision models (LVMs) with a large dataset, and then incorporate its semantic features into the discriminator through a well-designed spatial cross-attention module. In this way, our proposed semantic-aware discriminator empowered the SR network to produce more photo-realistic and pleasing images. Extensive experiments on two typical tasks, i.e., SR and Real SR have demonstrated the effectiveness of our proposed methods.
SinSR: Diffusion-Based Image Super-Resolution in a Single Step
Yufei Wang · Wenhan Yang · Xinyuan Chen · Yaohui Wang · Lanqing Guo · Lap-Pui Chau · Ziwei Liu · Yu Qiao · Alex C. Kot · Bihan Wen
While super-resolution (SR) methods based on diffusion models exhibit promising results, their practical application is hindered by the substantial number of required inference steps. Recent methods utilize the degraded images in the initial state, thereby shortening the Markov chain. Nevertheless, these solutions either rely on a precise formulation of the degradation process or still necessitate a relatively lengthy generation path (e.g., 15 iterations). To enhance inference speed, we propose a simple yet effective method for achieving single-step SR generation, named SinSR. Specifically, we first derive a deterministic sampling process from the most recent state-of-the-art (SOTA) method for accelerating diffusion-based SR. This allows the mapping between the input random noise and the generated high-resolution image to be obtained in a reduced and acceptable number of inference steps during training. We show that this deterministic mapping can be distilled into a student model that performs SR within only one inference step. Additionally, we propose a novel consistency-preserving loss to simultaneously leverage the ground-truth image during the distillation process, ensuring that the performance of the student model is not solely bound by the feature manifold of the teacher model, resulting in further performance improvement. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method can achieve comparable or even superior performance compared to both previous SOTA methods and the teacher model, in just one sampling step, resulting in a remarkable up to ×10 speedup for inference. Our code and model will be released.
Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution
Qingping Zheng · Ling Zheng · Yuanfan Guo · Ying Li · Songcen Xu · Jiankang Deng · Hang Xu
Artifact-free super-resolution (SR) aims to translate low-resolution images into their high-resolution counterparts with a strict integrity of the original content, eliminating any distortions or synthetic details. While traditional diffusion-based SR techniques have demonstrated remarkable abilities to enhance image detail, they are prone to artifact introduction during iterative procedures. Such artifacts, ranging from trivial noise to unauthentic textures, deviate from the true structure of the source image, thus challenging the integrity of the super-resolution process. In this work, we propose Self-Adaptive Reality-Guided Diffusion (SARGD), a training-free method that delves into the latent space to effectively identify and mitigate the propagation of artifacts. Our SARGD begins by using an artifact detector to identify implausible pixels, creating a binary mask that highlights artifacts. Following this, the Reality Guidance Refinement (RGR) process refines artifacts by integrating this mask with realistic latent representations, improving alignment with the original image. Nonetheless, initial realistic-latent representations from lower-quality images result in over-smoothing in the final output. To address this, we introduce a Self-Adaptive Guidance (SAG) mechanism. It dynamically computes a reality score, enhancing the sharpness of the realistic latent. These alternating mechanisms collectively achieve artifact-free super-resolution. Extensive experiments demonstrate the superiority of our method, delivering detailed artifact-free high-resolution images while reducing sampling steps by 2X.
Improving Spectral Snapshot Reconstruction with Spectral-Spatial Rectification
Jiancheng Zhang · Haijin Zeng · Yongyong Chen · Dengxiu Yu · Yinping Zhao
How to effectively utilize the spectral and spatial characteristics of Hyperspectral Image (HSI) is always a key problem in spectral snapshot reconstruction. Recently, the spectra-wise transformer has shown great potential in capturing inter-spectra similarities of HSI, but the classic design of the transformer, i.e., multi-head division in the spectral (channel) dimension hinders the modeling of global spectral information and results in mean effect. In addition, previous methods adopt the normal spatial priors without taking imaging processes into account and fail to address the unique spatial degradation in snapshot spectral reconstruction. In this paper, we analyze the influence of multi-head division and propose a novel Spectral-Spatial Rectification (SSR) method to enhance the utilization of spectral information and improve spatial degradation. Specifically, SSR includes two core parts: Window-based Spectra-wise Self-Attention (WSSA) and spAtial Rectification Block (ARB). WSSA is proposed to capture global spectral information and account for local differences, whereas ARB aims to mitigate the spatial degradation using a spatial alignment strategy. The experimental results on simulation and real scenes demonstrate the effectiveness of the proposed modules, and we also provide models at multiple scales to demonstrate the superiority of our approach.
Diffusion-based Blind Text Image Super-Resolution
Yuzhe Zhang · jiawei zhang · Hao Li · Zhouxia Wang · Luwei Hou · Dongqing Zou · Liheng Bian
Recovering degraded low-resolution text images is challenging, especially for Chinese text images with complex strokes and severe degradation in real-world scenarios. Ensuring both text fidelity and style realness is crucial for high-quality text image super-resolution. Recently, diffusion models have achieved great success in natural image synthesis and restoration due to their powerful data distribution modeling abilities and data generation capabilities. In this work, we propose an Image Diffusion Model (IDM) to restore text images with realistic styles. For diffusion models, they are not only suitable for modeling realistic image distribution but also appropriate for learning text distribution. Since text prior is important to guarantee the correctness of the restored text structure according to existing arts, we also propose a Text Diffusion Model (TDM) for text recognition which can guide IDM to generate text images with correct structures. We further propose a Mixture of Multi-modality module (MoM) to make these two diffusion models cooperate with each other in all the diffusion steps. Extensive experiments on synthetic and real-world datasets demonstrate that our Diffusion-based Blind Text Image Super-Resolution (DiffTSR) can restore text images with more accurate text structures as well as more realistic appearances simultaneously.
CAMixerSR: Only Details Need More "Attention"
Yan Wang · Yi Liu · Shijie Zhao · Junlin Li · Li zhang
To satisfy the rapidly increasing demands on the large image (2K-8K) super-resolution (SR), prevailing methods follow two independent tracks: 1) accelerate existing networks by content-aware routing, and 2) design better super-resolution networks via token mixer refining. Despite directness, they encounter unavoidable defects (e.g., inflexible route or non-discriminative processing) limiting further improvements of quality-complexity trade-off. To erase the drawbacks, we integrate these schemes by proposing a content-aware mixer (CAMixer), which assigns convolution for simple contexts and additional deformable window-attention for sparse textures. Specifically, the CAMixer uses a learnable predictor to generate multiple bootstraps, including offsets for windows warping, a mask for classifying windows, and convolutional attentions for endowing convolution with the dynamic property, which modulates attention to include more useful textures self-adaptively and improves the representation capability of convolution. We further introduce a global classification loss to improve the accuracy of predictors. By simply stacking CAMixers, we obtain CAMixerSR which achieves superior performance on large-image SR, lightweight SR, and omnidirectional-image SR.
ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation
Jia-Hao Wu · Fu-Jen Tsai · Yan-Tsung Peng · Charles Tsai · Chia-Wen Lin · Yen-Yu Lin
Image deblurring aims to remove undesired blurs from an image captured in a dynamic scene. Much research has been dedicated to improving deblurring performance through model architectural designs. However, there is little work on data augmentation for image deblurring. Since continuous motion causes blurred artifacts during image exposure, we aspire to develop a groundbreaking blur augmentation method to generate diverse blurred images by simulating motion trajectories in a continuous space. This paper proposes Implicit Diffusion-based reBLurring AUgmentation (ID-Blau), utilizing a sharp image paired with a controllable blur condition map to produce a corresponding blurred image. We parameterize the blur patterns of a blurred image with their orientations and magnitudes as a pixel-wise blur condition map to simulate motion trajectories and implicitly represent them in a continuous space. By sampling diverse blur conditions, ID-Blau can generate various blurred images unseen in the training set. Experimental results demonstrate that ID-Blau can produce realistic blurred images for training and thus significantly improve performance for state-of-the-art deblurring models.
Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning
Haoyu Chen · Wenbo Li · Jinjin Gu · Jingjing Ren · Haoze Sun · Xueyi Zou · Youliang Yan · Zhensong Zhang · Lei Zhu
For image super-resolution (SR), bridging the gap between the performance on synthetic datasets and real-world degradation scenarios remains a challenge. This work introduces a novel "Low-Res Leads the Way" (LWay) training framework, merging Supervised Pre-training with Self-supervised Learning to enhance the adaptability of SR models to real-world images. Our approach utilizes a low-resolution (LR) reconstruction network to extract degradation embeddings from LR images, merging them with super-resolved outputs for LR image reconstruction. Leveraging unseen LR images for self-supervised learning guides the model to adapt its modeling space to the target domain, facilitating fine-tuning of SR models without requiring paired high-resolution (HR) images. The integration of Discrete Wavelet Transform (DWT) further refines the focus on high-frequency details. Extensive evaluations show that our method significantly improves the generalization and detail restoration capabilities of SR models on unseen real-world datasets, outperforming existing methods. Our training regime is universally compatible, requiring no network architecture modifications, making it a practical solution for real-world SR applications.
CoSeR: Bridging Image and Language for Cognitive Super-Resolution
Haoze Sun · Wenbo Li · Jianzhuang Liu · Haoyu Chen · Renjing Pei · Xueyi Zou · Youliang Yan · Yujiu Yang
Existing super-resolution (SR) models primarily focus on restoring local texture details, often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding, which not only activates prior information from large text-to-image diffusion models but also facilitates the generation of high-quality reference images to optimize the SR process. To further improve image fidelity, we propose a novel condition injection scheme called ''All-in-Attention'', consolidating all conditional information into a single module. Consequently, our method successfully restores semantically correct and photorealistic details, demonstrating state-of-the-art performance across multiple benchmarks. Project page: https://coser-main.github.io/
Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization
Insoo Kim · Jae Seok Choi · Geonseok Seo · Kinam Kwon · Jinwoo Shin · Hyong-Euk Lee
As recent advances in mobile camera technology have enabled the capability to capture high-resolution images, such as 4K images, the demand for an efficient deblurring model handling large motion has increased. In this paper, we discover that the image residual errors, i.e., blur-sharp pixel differences, can be grouped into some categories according to their motion blur type and how complex their neighboring pixels are. Inspired by this, we decompose the deblurring (regression) task into blur pixel discretization (pixel-level blur classification) and discrete-to-continuous conversion (regression with blur class map) tasks. Specifically, we generate the discretized image residual errors by identifying the blur pixels and then transform them to a continuous form, which is computationally more efficient than naively solving the original regression problem with continuous values. Here, we found that the discretization result, i.e., blur segmentation map, remarkably exhibits visual similarity with the image residual errors. As a result, our efficient model shows comparable performance to state-of-the-art methods in realistic benchmarks, while our method is up to 10 times computationally more efficient.
SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder
Dihan Zheng · Yihang Zou · Xiaowen Zhang · Chenglong Bao
The data bottleneck has emerged as a fundamental challenge in learning based image restoration methods. Researchers have attempted to generate synthesized training data using paired or unpaired samples to address this challenge. This study proposes SeNM-VAE, a semi-supervised noise modeling method that leverages both paired and unpaired datasets to generate realistic degraded data. Our approach is based on modeling the conditional distribution of degraded and clean images with a specially designed graphical model. Under the variational inference framework, we develop an objective function for handling both paired and unpaired data. We employ our method to generate paired training samples for real-world image denoising and super-resolution tasks. Our approach excels in the quality of synthetic degraded images compared to other unpaired and paired noise modeling methods. Furthermore, our approach demonstrates remarkable performance in downstream image restoration tasks, even with limited paired data. With more paired data, our method achieves the best performance on the SIDD dataset.
Text-guided Explorable Image Super-resolution
Kanchana Vaishnavi Gandikota · Paramanand Chandramouli
In this paper, we introduce the problem of zero-shot text guided exploration of the solutions to open-domain image super-resolution. Our goal is to allow users to explore diverse, semantically accurate reconstructions which preserve data consistency with the low-resolution inputs for different large downsampling factors without explicitly training for these specific degradations. We propose two approaches for zero-shot text guided super-resolution - i) modifying the generative process of text-to-image (T2I) diffusion models to promote consistency with low-resolution inputs, and ii) incorporating language guidance into zero-shot diffusion based restoration methods. We show that these approaches result in diverse solutions which match the semantic meaning provided by the text prompt, while preserving data consistency with the degraded inputs. We evaluate the proposed baselines for the task of extreme super-resolution and demonstrate advantages in terms of restoration quality, diversity and explorability of solutions.
Equivariant Multi-Modality Image Fusion
Zixiang Zhao · Haowen Bai · Jiangshe Zhang · Yulun Zhang · Kai Zhang · Shuang Xu · Dongdong Chen · Radu Timofte · Luc Van Gool
Multi-modality image fusion is a technique that combines information from different sensors or modalities, enabling the fused image to retain complementary features from each modality, such as functional highlights and texture details. However, effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue, we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently, we introduce a novel training paradigm that encompasses a fusion module, a pseudo-sensing module, and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images, concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at https://github.com/Zhaozixiang1228/MMIF-EMMA.
Revisiting Spatial-Frequency Information Integration from a Hierarchical Perspective for Panchromatic and Multi-Spectral Image Fusion
Jiangtong Tan · Jie Huang · Naishan Zheng · Man Zhou · Keyu Yan · Danfeng Hong · Feng Zhao
Pan-sharpening is a super-resolution problem that essentially relies on spectra fusion of panchromatic (PAN) images and low-resolution multi-spectral (LRMS) images. The previous methods have validated the effectiveness of information fusion in the Fourier space of the whole image. However, they haven't fully explored the Fourier relationships at different hierarchies between PAN and LRMS images. To this end, we propose a Hierarchical Frequency Integration Network (HFIN) to facilitate hierarchical Fourier information integration for pan-sharpening. Specifically, our network consists of two designs: information stratification and information integration. For information stratification, we hierarchically decompose PAN and LRMS information into spatial, global Fourier and local Fourier information, and fuse them independently. For information integration, the above hierarchical fused information is processed to further enhance their relationships and undergo comprehensive integration. Our method extend a new space for exploring the relationships of PAN and LRMS images, enhancing the integration of spatial-frequency information. Extensive experiments robustly validate the effectiveness of the proposed network, showcasing its superior performance compared to other state-of-the-art methods and generalization in real-world scenes and other fusion tasks as a general image fusion framework.
MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation
Haokai Zhu · Si-Yuan Cao · Jianxin Hu · Sitong Zuo · Beinan Yu · Jiacheng Ying · Junwei Li · Hui-Liang Shen
We propose Multiscale Correlation searching homography estimation Network, namely MCNet, an iterative deep homography estimation architecture. Different from previous approaches that achieve iterative refinement by correlation searching within a single scale, MCNet combines the multiscale strategy with correlation searching incurring nearly ignored computational overhead. Moreover, MCNet adopts a Fine-Grained Optimization loss function, named FGO loss, to further boost the network training at the convergent stage, which can improve the estimation accuracy without additional computational overhead. According to our experiments, using the above two simple strategies can produce significant homography estimation accuracy with considerable efficiency. We show that MCNet achieves state-of-the-art performance on a variety of datasets, including common scene MSCOCO, cross-modal scene GoogleEarth and GoogleMap, and dynamic scene SPID. Compared to the previous SOTA method, 2-scale RHWF, our MCNet reduces inference time, FLOPs, parameter cost, and memory cost by 78.9%, 73.5%, 34.1%, and 33.2% respectively, while achieving 20.5% (MSCOCO), 43.4% (GoogleEarth), and 41.1% (GoogleMap) mean average corner error (MACE) reduction. Source code is available at https://github.com/zjuzhk/MCNet.
Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment
Ziyu Shan · Yujie Zhang · Qi Yang · Haichen Yang · Yiling Xu · Jenq-Neng Hwang · Xiaozhong Xu · Shan Liu
No-reference point cloud quality assessment (NR-PCQA) aims to automatically evaluate the perceptual quality of distorted point clouds without available reference, which have achieved tremendous improvements due to the utilization of deep neural networks. However, learning-based NR-PCQA methods suffer from the scarcity of labeled data and usually perform suboptimally in terms of generalization. To solve the problem, we propose a novel contrastive pre-training framework tailored for PCQA (CoPA), which enables the pre-trained model to learn quality-aware representations from unlabeled data. To obtain anchors in the representation space, we project point clouds with different distortions into images and randomly mix their local patches to form mixed images with multiple distortions. Utilizing the generated anchors, we constrain the pre-training process via a quality-aware contrastive loss following the philosophy that perceptual quality is closely related to both content and distortion. Furthermore, in the model fine-tuning stage, we propose a semantic-guided multi-view fusion module to effectively integrate the features of projected images from multiple perspectives. Extensive experiments show that our method outperforms the state-of-the-art PCQA methods on popular benchmarks. Further investigations demonstrate that CoPA can also benefit existing learning-based PCQA models.
MuGE: Multiple Granularity Edge Detection
Caixia Zhou · Yaping Huang · Mengyang Pu · Qingji Guan · Ruoxi Deng · Haibin Ling
Edge segmentation is well-known to be subjective due to personalized annotation styles and preferred granularity. However, most existing deterministic edge detection methods only produce a single edge map for one input image. We argue that generating multiple edge maps is more reasonable than generating a single one considering the subjectivity and ambiguity of the edges.Thus motivated, in this paper we propose multiple granularity edge detection, called MuGE, which can produce a wide range of edge maps, from approximate object contours to fine texture edges. Specifically, we first propose to design an edge granularity network to estimate the edge granularity from an individual edge annotation. Subsequently, to guide the generation of diversified edge maps, we integrate such edge granularity into the multi-scale feature maps in the spatial domain. Meanwhile, we decompose the feature maps into low-frequency and high-frequency parts, where the encoded edge granularity is further fused into the high-frequency part to achieve more precise control over the details of the produced edge maps. Compared to previous methods, MuGE can not only generate multiple edge maps at different controllable granularities but also achieve a competitive performance on the BSDS500 and Multicue datasets.
KVQ: Kwai Video Quality Assessment for Short-form Videos
Yiting Lu · Xin Li · Yajing Pei · Kun Yuan · Qizhi Xie · Yunpeng Qu · Ming Sun · Chao Zhou · Zhibo Chen
Short-form UGC video platforms, like Kwai and TikTok, have been an emerging and irreplaceable mainstream media form, thriving on user-friendly engagement, and kaleidoscope creation, etc. However, the advancing contentgeneration modes, e.g., special effects, and sophisticated processing workflows, e.g., de-artifacts, have introduced significant challenges to recent UGC video quality assessment: (i) the ambiguous contents hinder the identification of quality-determined regions. (ii) the diverse and complicated hybrid distortions are hard to distinguish. To tackle the above challenges and assist in the development of short-form videos, we establish the first large-scale Kwai short Video database for Quality assessment, termed KVQ, which comprises 600 user-uploaded short videos and 3600processed videos through the diverse practical processing workflows, including pre-processing, transcoding, and enhancement. Among them, the absolute quality score of each video and partial ranking score among indistinguish samples are provided by a team of professional researchers. specializing in image processing. Based on this database, we propose the first short-form video quality evaluator,i.e., KSVQE, which enables the quality evaluator to identify the quality-determined semantics with the content understanding of large vision language models (i.e., CLIP) and distinguish the distortions with the distortion understanding module. Experimental results have shown the effectiveness of KSVQE on our KVQ database and popular VQA databases. The project can be found at https://lixinustc.github.io/projects/KVQ/.
Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise, their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-world image recognition and segmentation. Yet, the potential for leveraging CLIP to enhance the robustness of low-level tasks remains largely unexplored. This paper uncovers that certain dense features extracted from the frozen ResNet image encoder of CLIP exhibit distortion-invariant and content-related properties, which are highly desirable for generalizable denoising. Leveraging these properties, we devise an asymmetrical encoder-decoder denoising network, which incorporates dense features including the noisy image and its multi-scale features from the frozen ResNet encoder of CLIP into a learnable image decoder to achieve generalizable denoising. The progressive feature augmentation strategy is further proposed to mitigate feature overfitting and improve the robustness of the learnable decoder. Extensive experiments and comparisons conducted across diverse OOD noises, including synthetic noise, real-world sRGB noise, and low-dose CT image noise, demonstrate the superior generalization ability of our method.
Improved Implicit Neural Representation with Fourier Reparameterized Training
Kexuan Shi · Xingyu Zhou · Shuhang Gu
Implicit Neural Representation (INR) as a mighty representation paradigm has achieved success in various computer vision tasks recently. Due to the low-frequency bias issue of vanilla multi-layer perceptron (MLP), existing methods have investigated advanced techniques, such as positional encoding and periodic activation function, to improve the accuracy of INR. In this paper, we connect the network training bias with the reparameterization technique and theoretically prove that weight reparameterization could provide us a chance to alleviate the spectral bias of MLP. Based on our theoretical analysis, we propose a Fourier reparameterization method which learns coefficient matrix of fixed Fourier bases to compose the weights of MLP. We evaluate the proposed Fourier reparameterization method on different INR tasks with various MLP architectures, including vanilla MLP, MLP with positional encoding and MLP with advanced activation function, etc. The superiority approximation results on different MLP architectures clearly validate the advantage of our proposed method. Armed with our Fourier reparameterization method, better INR with more textures and less artifacts can be learned from the training data. The codes are available at https: //github.com/LabShuHangGU/FR-INR.
Deep Video Inverse Tone Mapping Based on Temporal Clues
Yuyao Ye · Ning Zhang · Yang Zhao · Hongbin Cao · Ronggang Wang
Inverse tone mapping (ITM) aims to reconstruct high dynamic range (HDR) radiance from low dynamic range (LDR) content. Although many deep image ITM methods can generate impressive results, the field of video ITM is still to be explored. Processing video sequences by image ITM methods may cause temporal inconsistency. Besides, they aren't able to exploit the potentially useful information in the temporal domain. In this paper, we analyze the process of video filming, and then propose a Global Sample and Local Propagate strategy to better find and utilize temporal clues. To better realize the proposed strategy, we design modules named Incremental Clue Aggregation Module and Feature and Clue Propagation Module. They can align and fuse frames effectively under the condition of brightness changes and propagate features and temporal clues to all frames efficiently. Our temporal clues based video ITM method can recover realistic and temporal consistent results with high fidelity in over-exposed regions. Qualitative and quantitative experiments on public datasets show that the proposed method has significant advantages over existing methods.
Boosting Flow-based Generative Super-Resolution Models via Learned Prior
Li-Yuan Tsao · Yi-Chen Lo · Chia-Che Chang · Hao-Wei Chen · Roy Tseng · Chien Feng · Chun-Yi Lee
Flow-based super-resolution (SR) models have demonstrated astonishing capabilities in generating high-quality images. However, these methods encounter several challenges during image generation, such as grid artifacts, exploding inverses, and suboptimal results due to a fixed sampling temperature. To overcome these issues, this work introduces a conditional learned prior to the inference phase of a flow-based SR model. This prior is a latent code predicted by our proposed latent module conditioned on the low-resolution image, which is then transformed by the flow model into an SR image. Our framework is designed to seamlessly integrate with any contemporary flow-based SR model without modifying its architecture or pre-trained weights. We evaluate the effectiveness of our proposed framework through extensive experiments and ablation analyses. The proposed framework successfully addresses all the inherent issues in flow-based SR models and enhances their performance in various SR scenarios. Our code is available at: https://github.com/liyuantsao/FlowSR-LP
Look-Up Table Compression for Efficient Image Restoration
Yinglong Li · Jiacheng Li · Zhiwei Xiong
Look-Up Table (LUT) has recently gained increasing attention for restoring High-Quality (HQ) images from Low-Quality (LQ) observations, thanks to its high computational efficiency achieved through a ``space for time'' strategy of caching learned LQ-HQ pairs. However, incorporating multiple LUTs for improved performance comes at the cost of a rapidly growing storage size, which is ultimately restricted by the allocatable on-device cache size. In this work, we propose a novel LUT compression framework to achieve a better trade-off between storage size and performance for LUT-based image restoration models. Based on the observation that most cached LQ image patches are distributed along the diagonal of a LUT, we devise a Diagonal-First Compression (DFC) framework, where diagonal LQ-HQ pairs are preserved and carefully re-indexed to maintain the representation capacity, while non-diagonal pairs are aggressively subsampled to save storage. Extensive experiments on representative image restoration tasks demonstrate that our DFC framework significantly reduces the storage size of LUT-based models (including our new design) while maintaining their performance. For instance, DFC saves up to 90\% of storage at a negligible performance drop for $\times 4$ super-resolution.
Latent Modulated Function for Computational Optimal Continuous Image Representation
Zongyao He · Zhi Jin
The recent work Local Implicit Image Function (LIIF) and subsequent Implicit Neural Representation (INR) based works have achieved remarkable success in Arbitrary-Scale Super-Resolution (ASSR) by using MLP to decode Low-Resolution (LR) features. However, these continuous image representations typically implement decoding in High-Resolution (HR) High-Dimensional (HD) space, leading to a quadratic increase in computational cost and seriously hindering the practical applications of ASSR. To tackle this problem, we propose a novel Latent Modulated Function (LMF), which decouples the HR-HD decoding process into shared latent decoding in LR-HD space and independent rendering in HR Low-Dimensional (LD) space, thereby realizing the first computational optimal paradigm of continuous image representation. Specifically, LMF utilizes an HD MLP in latent space to generate latent modulations of each LR feature vector. This enables a modulated LD MLP in render space to quickly adapt to any input feature vector and perform rendering at arbitrary resolution.Furthermore, we leverage the positive correlation between modulation intensity and input image complexity to design a Controllable Multi-Scale Rendering (CMSR) algorithm, offering the flexibility to adjust the decoding efficiency based on the rendering precision. Extensive experiments demonstrate that converting existing INR-based ASSR methods to LMF can reduce the computational cost by up to 99.9%, accelerate inference by up to 57 $\times$, and save up to 76% of parameters, while maintaining competitive performance.
Task-Aware Encoder Control for Deep Video Compression
Xingtong Ge · Jixiang Luo · XINJIE ZHANG · Tongda Xu · Guo Lu · Dailan He · Jing Geng · Yan Wang · Jun Zhang · Hongwei Qin
Prior research on deep video compression (DVC) for machine tasks typically necessitates training a unique codec for each specific task, mandating a dedicated decoder per task. In contrast, traditional video codecs employ a flexible encoder controller, enabling the adaptation of a single codec to different tasks through mechanisms like mode prediction. Drawing inspiration from this, we introduce an innovative encoder controller for deep video compression for machines. This controller features a mode prediction and a Group of Pictures (GoP) selection module. Our approach centralizes control at the encoding stage, allowing for adaptable encoder adjustments across different tasks, such as detection and tracking, while maintaining compatibility with a standard pre-trained DVC decoder. Empirical evidence demonstrates that our method is applicable across multiple tasks with various existing pre-trained DVCs. Moreover, extensive experiments demonstrate that our method outperforms previous DVC by about 25% bitrate for different tasks, with only one pre-trained decoder.
A Dynamic Kernel Prior Model for Unsupervised Blind Image Super-Resolution
Zhixiong Yang · Jingyuan Xia · Shengxi Li · Xinghua Huang · Shuanghui Zhang · Zhen Liu · Yaowen Fu · Yongxiang Liu
Deep learning-based methods have achieved significant successes on solving the blind super-resolution (BSR) problem. However, most of them request supervised pre-training on labelled datasets.This paper proposes an unsupervised kernel estimation model, named dynamic kernel prior (DKP), to realize an unsupervised and pre-training-free learning-based algorithm for solving the BSR problem. DKP can adaptively learn dynamic kernel priors to realize real-time kernel estimation, and thereby enables superior HR image restoration performances. This is achieved by a Markov chain Monte Carlo sampling process on random kernel distributions. The learned kernel prior is then assigned to optimize a blur kernel estimation network, which entails a network-based Langevin dynamic optimization strategy. These two techniques ensure the accuracy of the kernel estimation.DKP can be easily used to replace the kernel estimation models in the existing methods, such as Double-DIP and FKP-DIP, or be added to the off-the-shelf image restoration model, such as diffusion model. In this paper, we incorporate our DKP model with DIP and diffusion model, referring to DIP-DKP and Diff-DKP, for validations. Extensive simulations on Gaussian and motion kernel scenarios demonstrate that the proposed DKP model can significantly improve the kernel estimation with comparable runtime and memory usage, leading to state-of-the-art BSR results. An example code is given in the supplementary.
Zero-Reference Low-Light Enhancement via Physical Quadruple Priors
Wenjing Wang · Huan Yang · Jianlong Fu · Jiaying Liu
Understanding illumination and reducing the need for supervision pose a significant challenge in low-light enhancement. Current approaches are highly sensitive to data usage during training and illumination-specific hyper-parameters, limiting their ability to handle unseen scenarios.In this paper, we propose a new zero-reference low-light enhancement framework trainable solely with normal light images. To accomplish this, we devise an illumination-invariant prior inspired by the theory of physical light transfer. This prior serves as the bridge between normal and low-light images.Then, we develop a prior-to-image framework trained without low-light data.During testing, this framework is able to restore our illumination-invariant prior back to images, automatically achieving low-light enhancement.Within this framework, we leverage a pretrained generative diffusion model for model ability, introduce a bypass decoder to handle detail distortion, as well as offer a lightweight version for practicality.Extensive experiments demonstrate our framework's superiority in various scenarios as well as good interpretability, robustness, and efficiency. Code will be released after the review process.
ParamISP: Learned Forward and Inverse ISPs using Camera Parameters
Woohyeok Kim · Geonu Kim · Junyong Lee · Seungyong Lee · Seung-Hwan Baek · Sunghyun Cho
RAW images are rarely shared mainly due to its excessive data size compared to their sRGB counterparts obtained by camera ISPs. Learning the forward and inverse processes of camera ISPs has been recently demonstrated, enabling physically-meaningful RAW-level image processing on input sRGB images. However, existing learning-based ISP methods fail to handle the large variations in the ISP processes with respect to camera parameters such as ISO and exposure time, and have limitations when used for various applications. In this paper, we propose ParamISP, a learning-based method for forward and inverse conversion between sRGB and RAW images, that adopts a novel neural-network module to utilize camera parameters, which is dubbed as ParamNet.Given the camera parameters provided in the EXIF data, ParamNet converts them into a feature vector to control the ISP networks.Extensive experiments demonstrate that ParamISP achieve superior RAW and sRGB reconstruction results compared to previous methods and it can be effectively used for a variety of applications such as deblurring dataset synthesis, raw deblurring, HDR reconstruction, and camera-to-camera transfer.
FSC: Few-point Shape Completion
Xianzu Wu · Xianfeng Wu · Tianyu Luan · Yajing Bai · Zhongyuan Lai · Junsong Yuan
While previous studies have demonstrated successful 3D object shape completion with a sufficient number of points, they often fail in scenarios when a few points, e.g. tens of points, are observed. Surprisingly, via entropy analysis, we find that even a few points, e.g. 64 points, could retain substantial information to help recover the 3D shape of the object. To address the challenge of shape completion with very sparse point clouds, we then propose Few-point Shape Completion (FSC) model, which contains a novel dual-branch feature extractor for handling extremely sparse inputs, coupled with an extensive branch for maximal point utilization with a saliency branch for dynamic importance assignment. This model is further bolstered by a two-stage revision network that refines both the extracted features and the decoder output, enhancing the detail and authenticity of the completed point cloud. Our experiments demonstrate the feasibility of recovering 3D shapes from a few points. The proposed Few-point Shape Completion (FSC) model outperforms previous methods on both few-point inputs and many-point inputs, and shows good generalizability to different object categories.
Generative Latent Coding for Ultra-Low Bitrate Image Compression
Zhaoyang Jia · Jiahao Li · Bin Li · Houqiang Li · Yan Lu
Most existing image compression approaches perform transform coding in the pixel space to reduce its spatial redundancy. However, they encounter difficulties in achieving both high-realism and high-fidelity at low bitrate, as the pixel-space distortion may not align with human perception. To address this issue, we introduce a Generative Latent Coding (GLC) architecture, which performs transform coding in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE), instead of in the pixel space. The generative latent space is characterized by greater sparsity, richer semantic and better alignment with human perception, rendering it advantageous for achieving high-realism and high-fidelity compression. Additionally, we introduce a categorical hyper module to reduce the bit cost of hyper-information, and a code-prediction-based supervision to enhance the semantic consistency. Experiments demonstrate that our GLC maintains high visual quality with less than 0.04 bpp on natural images and less than 0.01 bpp on facial images. On the CLIC2020 test set, we achieve the same FID as MS-ILLM with 45\% fewer bits. Furthermore, the powerful generative latent space enables various applications built on our GLC pipeline, such as image restoration and style transfer.
The emerging conditional coding-based neural video codec (NVC) shows superiority over commonly-used residual coding-based codec and the latest NVC already claims to outperform the best traditional codec. However, there still exist critical problems blocking the practicality of NVC. In this paper, we propose a powerful conditional coding-based NVC that solves two critical problems via feature modulation. The first is how to support a wide quality range in a single model. Previous NVC with this capability only supports about 3.8 dB PSNR range on average. To tackle this limitation, we modulate the latent feature of the current frame via the learnable quantization scaler. During the training, we specially design the uniform quantization parameter sampling mechanism to improve the harmonization of encoding and quantization. This results in a better learning of the quantization scaler and helps our NVC support about 11.4 dB PSNR range. The second is how to make NVC still work under a long prediction chain. We expose that the previous SOTA NVC has an obvious quality degradation problem when using a large intra-period setting. To this end, we propose modulating the temporal feature with a periodically refreshing mechanism to boost the quality. Notably, under single intra-frame setting, our codec can achieve 29.7% bitrate saving over previous SOTA NVC with 16% MACs reduction. Our codec serves as a notable landmark in the journey of NVC evolution. The codes are at https://github.com/microsoft/DCVC.
Driving-Video Dehazing with Non-Aligned Regularization for Safety Assistance
Junkai Fan · Jiangwei Weng · Kun Wang · Yijun Yang · Jianjun Qian · Jun Li · Jian Yang
Real driving-video dehazing poses a significant challenge due to the inherent difficulty in acquiring precisely aligned hazy/clear video pairs for effective model training, especially in dynamic driving scenarios with unpredictable weather conditions. In this paper, we propose a pioneering approach that addresses this challenge through a non-aligned regularization strategy. Our core concept involves identifying clear frames that closely match hazy frames, serving as references to supervise a video dehazing network. Our approach comprises two key components: reference matching and video dehazing. Firstly, we introduce a non-aligned reference frame matching module, leveraging an adaptive sliding window to match high-quality reference frames from clear videos. Video dehazing incorporates flow-guided cosine attention sampler and deformable cosine attention fusion modules to enhance spatial multi-frame alignment and fuse their improved information. To validate our approach, we collect a GoProHazy dataset captured effortlessly with GoPro cameras in diverse rural and urban road environments. Extensive experiments demonstrate the superiority of the proposed method over current state-of-the-art methods in the challenging task of real driving-video dehazing. Project page.
Image Processing GNN: Breaking Rigidity in Super-Resolution
Yuchuan Tian · Hanting Chen · Chao Xu · Yunhe Wang
Super-Resolution (SR) reconstructs high-resolution images from low-resolution ones. CNNs and window-attention methods are two major categories of canonical SR models. However, these measures are rigid: in both operations, each pixel gathers the same number of neighboring pixels, hindering their effectiveness in SR tasks. Alternatively, we leverage the flexibility of graphs and propose the Image Processing GNN (IPG) model to break the rigidity that dominates previous SR methods. Firstly, SR is unbalanced in that most reconstruction efforts are concentrated to a small proportion of detail-rich image parts. Hence, we leverage degree flexibility by assigning higher node degrees to detail-rich image nodes. Then in order to construct graphs for SR-effective aggregation, we treat images as pixel node sets rather than patch nodes. Lastly, we hold that both local and global information are crucial for SR performance. In the hope of gathering pixel information from both local and global scales efficiently via flexible graphs, we search node connections within nearby regions to construct local graphs; and find connections within a strided sampling space of the whole image for global graphs. The flexibility of graphs boosts the SR performance of the IPG model. Experiment results on various datasets demonstrates that the proposed IPG outperforms State-of-the-Art baselines. Codes are available at https://github.com/huawei-noah/Efficient-Computing/tree/master/LowLevel/IPG.
CFAT: Unleashing Triangular Windows for Image Super-resolution
Abhisek Ray · Gaurav Kumar · Maheshkumar Kolekar
Transformer-based models have revolutionized the field of image super-resolution by harnessing their inherent ability to capture complex contextual features. The overlapping rectangular shifted window technique used in transformer architecture nowadays is a common practice in super-resolution models to improve the quality and robustness of image upscaling. However, it suffers from distortion at the boundaries and has limited unique shifting modes. To overcome these weaknesses, we propose an overlapping triangular window technique that synchronously works with the rectangular one to reduce boundary-level distortion and allow the model to access more unique sifting modes. In this paper, we propose a Composite Fusion Attention Transformer (CFAT) that incorporates triangular-rectangular window-based local attention with a channel-based global attention technique in image super-resolution. As a result, CFAT enables attention mechanisms to be activated on more image pixels and captures long-range, multi-scale features to improve SR performance. The extensive experimental results and ablation study demonstrate the effectiveness of CFAT in the SR domain. Our proposed model shows a significant 0.7 dB performance improvement over other state-of-the-art SR architectures.
Zero-Shot Structure-Preserving Diffusion Model for High Dynamic Range Tone Mapping
Ruoxi Zhu · Shusong Xu · Peiye Liu · Sicheng Li · Yanheng Lu · Dimin Niu · Zihao Liu · Zihao Meng · Li Zhiyong · Xinhua Chen · Yibo Fan
Tone mapping techniques, aiming to convert high dynamic range (HDR) images to high-quality low dynamic range (LDR) images for display, play a more crucial role in real-world vision systems with the increasing application of HDR images. However, obtaining paired HDR and high-quality LDR images is difficult, posing a challenge to deep learning based tone mapping methods. To overcome this challenge, we propose a novel zero-shot tone mapping framework that utilizes shared structure knowledge, allowing us to transfer a pre-trained mapping model from the LDR domain to HDR fields without paired training data. Our approach involves decomposing both the LDR and HDR images into two components: structural information and tonal information. To preserve the original image's structure, we modify the reverse sampling process of a diffusion model and explicitly incorporate the structure information into the intermediate results. Additionally, for improved image details, we introduce a dual-control network architecture that enables different types of conditional inputs to control different scales of the output. Experimental results demonstrate the effectiveness of our approach, surpassing previous state-of-the-art methods both qualitatively and quantitatively. Moreover, our model exhibits versatility and can be applied to other low-level vision tasks without retraining. The code is available at https://github.com/ZSDM-HDR/Zero-Shot-Diffusion-HDR.
Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
Chenyu You · Yifei Min · Weicheng Dai · Jasjeet Sekhon · Lawrence Staib · James Duncan
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally, these tuned models tend to become highly specialized, limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features -- patterns that correlate with the target in training data, but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features, largely based on the assumption that we can identify such features, does not provide definitive assurance for real-world applications. As a piloting study, this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end, we systematically study the existence of spurious correlation on CLIP and CILP+ERM. We first, following recent work on Deep Feature Reweighting (DFR), verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them, we advocate a lightweight representation calibration method for fine-tuning CLIP, by first generating a calibration set using the pretrained CLIP, and then calibrating representations of samples within this set through contrastive learning, all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals, largely reducing reliance and significantly boosting the model generalization.
Learn from View Correlation: An Anchor Enhancement Strategy for Multi-view Clustering
Suyuan Liu · KE LIANG · Zhibin Dong · Siwei Wang · Xihong Yang · sihang zhou · En Zhu · Xinwang Liu
In recent years, anchor-based methods have achieved promising progress in multi-view clustering. The performances of these methods are significantly affected by the quality of the anchors. However, the anchors generated by previous works solely rely on single-view information, ignoring the correlation among different views. In particular, we observe that similar patterns are more likely to exist between similar views so such correlation information can be leveraged to enhance the quality of the anchors, which is also omitted. To this end, we propose a novel plug-and-play anchor enhancement strategy through view correlation for multi-view clustering. Specifically, we construct a view graph based on aligned initial anchor graphs to explore inter-view correlations. By learning from view correlation, we enhance the anchors of the current view using the relationships between anchors and samples on neighboring views, thereby narrowing the spatial distribution of anchors on similar views. Experimental results on seven datasets demonstrate the superiority of our proposed method over other existing methods. Furthermore, extensive comparative experiments validate the effectiveness of the proposed anchor enhancement module when applied to various anchor-based methods.
Circuit Design and Efficient Simulation of Quantum Inner Product and Empirical Studies of Its Effect on Near-Term Hybrid Quantum-Classic Machine Learning
Hao Xiong · Yehui Tang · Xinyu Ye · Junchi Yan
For the essential operation, namely inner product (IP) as widely adopted in classic computing e.g. matrix multiplication, its quantum counterpart: quantum inner product (QIP), has also been recently theoretically explored with a verifiable lower complexity on quantum computers. However, it remains unclear for the embodiment of the quantum circuits (QC) for QIP, let alone a (thorough) evaluation of the QIP circuits, especially in a practical context in the NISQ era by applying QIP to ML via hybrid quantum-classic pipelines. In this paper, we carefully design the QIP circuits from scratch, whose complexity is in accordance with the theoretical complexity. To make the simulation tractable on classic computers, especially when it is integrated in the gradient-based hybrid ML pipelines, we further devise a highly-efficient simulation scheme by directly simulates the output state. Experiments show that the scheme accelerates the simulation for more than 68k times compared with the previous circuit simulator. This allows our empirical evaluation on typical machine learning tasks, ranging from supervised and self-supervised learning via neural nets, to K-Means clustering. The results show that the calculation error brought by typical quantum mechanisms would incur in general little influence on the final numerical results given sufficient qubits. However, certain tasks e.g. ranking in K-Means could be more sensitive to quantum noise.
Discriminability-Driven Channel Selection for Out-of-Distribution Detection
Yue Yuan · Rundong He · Yicong Dong · Zhongyi Han · Yilong Yin
Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world environments. Activation-based methods are a key approach in OOD detection, working to mitigate overconfident predictions of OOD data. These techniques rectifying anomalous activations, enhancing the distinguishability between in-distribution (ID) data and OOD data. However, they assume by default that every channel is necessary for OOD detection, and rectify anomalous activations in each channel. Empirical evidence has shown that there is a significant difference among various channels in OOD detection, and discarding some channels can greatly enhance the performance of OOD detection. Based on this insight, we propose \underline{D}iscriminability-\underline{D}riven \underline{C}hannel \underline{S}election~(DDCS), which leverages an adaptive channel selection by estimating the discriminative score of each channel to boost OOD detection. The discriminative score takes inter-class similarity and inter-class variance of training data into account. However, the estimation of discriminative score itself is susceptible to anomalous activations. To better estimate score, we pre-rectify anomalous activations for each channel mildly. The experimental results show that DDCS achieves state-of-the-art performance on CIFAR and ImageNet-1K benchmarks. Moreover, DDCS can generalize to different backbones and OOD scores.
Efficient Hyperparameter Optimization with Adaptive Fidelity Identification
Jiantong Jiang · Zeyi Wen · Atif Mansoor · Ajmal Mian
Hyperparameter Optimization and Neural Architecture Search are powerful in attaining state-of-the-art machine learning models, with Bayesian Optimization (BO) standing out as a mainstream method. Extending BO into the multi-fidelity setting has been an emerging research topic in this field, but faces the challenge of determining an appropriate fidelity for each hyperparameter configuration to fit the surrogate model. To tackle the challenge, we propose a multi-fidelity BO method named FastBO, which excels in adaptively deciding the fidelity for each configuration and providing strong performance while ensuring efficient resource usage. These advantages are achieved through our proposed techniques based on the concepts of efficient point and saturation point for each configuration, which can be obtained from the empirical learning curve of the configuration, estimated from early observations. Extensive experiments demonstrate FastBO's superior anytime performance and efficiency in identifying high-quality configurations and architectures. We also show that our method provides a way to extend any single-fidelity method to the multi-fidelity setting, highlighting the wide applicability of our approach.
Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing
Jan-Nico Zaech · Martin Danelljan · Tolga Birdal · Luc Van Gool
Adiabatic quantum computing (AQC) is a promising approach for discrete and often NP-hard optimization problems. Current AQCs allow to implement problems of research interest, which has sparked the development of quantum representations for many computer vision tasks. Despite requiring multiple measurements from the noisy AQC, current approaches only utilize the best measurement, discarding information contained in the remaining ones. In this work, we explore the potential of using this information for probabilistic balanced k-means clustering. Instead of discarding non-optimal solutions, we propose to use them to compute calibrated posterior probabilities with little additional compute cost. This allows us to identify ambiguous solutions and data points, which we demonstrate on a D-Wave AQC on synthetic tasks and real visual data.
Online Task-Free Continual Generative and Discriminative Learning via Dynamic Cluster Memory
飞 叶 · Adrian Bors
Online Task-Free Continual Learning (OTFCL) aims to learn novel concepts from streaming data without accessing task information. The memory-based approaches have shown remarkable results in OTFCL, but most require accessing supervised signals to implement their sample selection mechanism, limiting their applicability in unsupervised learning. In this study, we address this issue by proposing a novel memory management approach, Dynamic Cluster Memory (DCM), which adaptively builds new memory clusters to capture distribution shifts over time without accessing supervised signals. Specifically, the DCM introduces a novel memory expansion mechanism based on a knowledge discrepancy measure criterion, which evaluates the novelty of the incoming data as the signal for the memory expansion, ensuring a compact memory capacity. Additionally, we propose a new sample selection approach that automatically stores incoming data samples with similar semantic information in the same memory cluster, facilitating knowledge diversity among memory clusters. Furthermore, a novel memory pruning approach is proposed to automatically remove information overlapping memory clusters through a graph relation evaluation, ensuring a fixed memory capacity while maintaining diversity among the samples stored in the memory. The proposed DCM is model-free, plug-and-play, and can be performed in both supervised and unsupervised learning without any modifications. Empirical results on OTFCL experiments show that the proposed DCM outperforms the state-of-the-art while memorizing fewer data samples.
S²MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering
Zhen Long · Qiyuan Wang · Yazhou Ren · Yipeng Liu · Ce Zhu
Anchor-based large-scale multi-view clustering has attracted considerable attention for its effectiveness in handling massive datasets. However, current methods mainly seek the consensus embedding feature for clustering by exploring global correlations between anchor graphs or projection matrices.In this paper, we propose a simple yet efficient scalable multi-view tensor clustering (S$^2$MVTC) approach, where our focus is on learning higher-order correlations of embedding features across views. Specifically, r by stacking the embedding features of different views into a tensor and then rotating, we build a novel tensor low-frequency approximation (TLFA) operator to efficiently explore higher-order correlations. Furthermore, to enhance clustering accuracy, consensus constraints are applied to embedding features to ensure inter-view semantic consistency. Experimental results on six large-scale multi-view datasets demonstrate that S$^2$MVTC significantly outperforms state-of-the-art algorithms in terms of clustering performance and CPU execution time, especially when handling massive data.
Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning
xin zhang · Jiawei Du · Weiying Xie · Yunsong Li · Joey Tianyi Zhou
Dataset pruning aims to construct a coreset capable of achieving performance comparable to the original, full dataset. Most existing dataset pruning methods rely on snapshot-based criteria to identify representative samples, often resulting in poor generalization across various pruning and cross-architecture scenarios. Recent studies have addressed this issue by expanding the scope of training dynamics considered, including factors such as forgetting event and probability change, typically using an averaging approach. However, these works struggle to integrate a broader range of training dynamics without overlooking well-generalized samples, which may not be sufficiently highlighted in an averaging manner. In this study, we propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS utilizes a dual-depth strategy to achieve a balance between incorporating extensive training dynamics and identifying representative samples for dataset pruning. In the first depth, we estimate the series of each sample's individual contributions spanning the training progress, ensuring comprehensive integration of training dynamics. In the second depth, we focus on the variability of the sample-wise contributions identified in the first depth to highlight well-generalized samples. Extensive experiments conducted on CIFAR and ImageNet datasets verify the superiority of TDDS over previous SOTA methods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%. Our codes are available at https://github.com/zhangxin-xd/Dataset-Pruning-TDDS.
An Aggregation-Free Federated Learning for Tackling Data Heterogeneity
Yuan Wang · Huazhu Fu · Renuga Kanagavelu · Qingsong Wei · Yong Liu · Rick Goh
The performance of Federated Learning (FL) hinges on the effectiveness of utilizing knowledge from distributed datasets. Traditional FL methods adopt an aggregate-then-adapt framework, where clients update local models based on a global model aggregated by the server from the previous training round. This process can cause client drift, especially with significant cross-client data heterogeneity, impacting model performance and convergence of the FL algorithm. To address these challenges, we introduce FedAF, a novel aggregation-free FL algorithm. In this framework, clients collaboratively learn condensed data by leveraging peer knowledge, the server subsequently trains the global model using the condensed data and soft labels received from the clients. FedAF inherently avoids the issue of client drift, enhances the quality of condensed data amid notable data heterogeneity, and improves the global model performance. Extensive numerical studies on several popular benchmark datasets show FedAF surpasses various state-of-the-art FL algorithms in handling label-skew and feature-skew data heterogeneity, leading to superior global model accuracy and faster convergence.
POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning
Jiayi Guan · Li Shen · Ao Zhou · Lusong Li · Han Hu · Xiaodong He · Guang Chen · Changjun Jiang
Multi-constraint offline reinforcement learning (RL) promises to learn policies that satisfy both cumulative and state-wise costs from offline datasets. This arrangement provides an effective approach for the widespread application of RL in high-risk scenarios where both cumulative and state-wise costs need to be considered simultaneously. However, previously constrained offline RL algorithms are primarily designed to handle single-constraint problems related to cumulative cost, which faces challenges when addressing multi-constraint tasks that involve both cumulative and state-wise costs. In this work, we propose a novel Primal policy Optimization with Conservative Estimation algorithm (POCE) to address the problem of multi-constraint offline RL. Concretely, we reframe the objective of multi-constraint offline RL by introducing the concept of Maximum Markov Decision Processes (MMDP). Subsequently, we present a primal policy optimization algorithm to confront the multi-constraint problems, which improves the stability and convergence speed of model training. Furthermore, we propose a conditional Bellman operator to estimate cumulative and state-wise Q-values, reducing the extrapolation error caused by out-of-distribution (OOD) actions. Finally, extensive experiments demonstrate that the POCE algorithm achieves competitive performance across multiple experimental tasks, particularly outperforming baseline algorithms in terms of safety.
SVDinsTN: A Tensor Network Paradigm for Efficient Structure Search from Regularized Modeling Perspective
Yu-Bang Zheng · Xile Zhao · Junhua Zeng · Chao Li · Qibin Zhao · Heng-Chao Li · Ting-Zhu Huang
Tensor network (TN) representation is a powerful technique for computer vision and machine learning. TN structure search (TN-SS) aims to search for a customized structure to achieve a compact representation, which is a challenging NP-hard problem. Recent "sampling-evaluation"-based methods require sampling an extensive collection of structures and evaluating them one by one, resulting in prohibitively high computational costs. To address this issue, we propose a novel TN paradigm, named SVD-inspired TN decomposition (SVDinsTN), which allows us to efficiently solve the TN-SS problem from a regularized modeling perspective, eliminating the repeated structure evaluations. To be specific, by inserting a diagonal factor for each edge of the fully-connected TN, SVDinsTN allows us to calculate TN cores and diagonal factors simultaneously, with the factor sparsity revealing a compact TN structure. In theory, we prove a convergence guarantee for the proposed method. Experimental results demonstrate that the proposed method achieves approximately 100~1000 times acceleration compared to the state-of-the-art TN-SS methods while maintaining a comparable level of representation ability.
Fine-Grained Bipartite Concept Factorization for Clustering
Chong Peng · Pengfei Zhang · Yongyong Chen · zhao kang · Chenglizhao Chen · Qiang Cheng
In this paper, we propose a novel concept factorization method that seeks factor matrices using a cross-order positive semi-definite neighbor graph, which provides comprehensive and complementary neighbor information of the data. The factor matrices are learned with bipartite graph partitioning, which exploits explicit cluster structure of the data and is more geared towards clustering application. We develop an effective and efficient optimization algorithm for our method, and provide elegant theoretical results about the convergence. Extensive experimental results confirm the effectiveness of the proposed method.
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld
Yijun Yang · Tianyi Zhou · kanxue Li · Dapeng Tao · Lusong Li · Li Shen · Xiaodong He · Jing Jiang · Yuhui Shi
While large language models (LLMs) excel in a simulated world of texts, they struggle to interact with the more realistic world without perceptions of other modalities such as visual or audio signals. Although vision-language models (VLMs) integrate LLM modules (1) aligned with static image features, and (2) may possess prior knowledge of world dynamics (as demonstrated in the text world), they have not been trained in an embodied visual world and thus cannot align with its dynamics. On the other hand, training an embodied agent in a noisy visual world without expert guidance is often challenging and inefficient. In this paper, we train a VLM agent living in a visual world using an LLM agent excelling in a parallel text world. Specifically, we distill LLM's reflection outcomes (improved actions by analyzing mistakes) in a text world's tasks to finetune the VLM on the same tasks of the visual world, resulting in an Embodied Multi-Modal Agent (EMMA) quickly adapting to the visual world dynamics. Such cross-modality imitation learning between the two parallel worlds is achieved by a novel DAgger-DPO algorithm, enabling EMMA to generalize to a broad scope of new tasks without any further guidance from the LLM expert. Extensive evaluations on the ALFWorld benchmark's diverse tasks highlight EMMA's superior performance to SOTA VLM-based agents, e.g., 20%-70% improvement in the success rate.
The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes
Myeongseob Ko · Feiyang Kang · Weiyan Shi · Ming Jin · Zhou Yu · Ruoxi Jia
Large-scale black-box models have become ubiquitous across numerous applications. Understanding the influence of individual training data sources on predictions made by these models is crucial for improving their trustworthiness. Current influence estimation techniques involve computing gradients for every training point or repeated training on different subsets. These approaches face obvious computational challenges when scaled up to large datasets and models. In this paper, we introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data. Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem: assessing how the predictions for training samples would be altered if the model were trained on specific test samples. Through both empirical and theoretical validations, we demonstrate the wide applicability of our hypothesis. Inspired by this, we introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point. This approach can capitalize on the common asymmetry in scenarios where the number of test samples under concurrent examination is much smaller than the scale of the training dataset, thus gaining a significant improvement in efficiency compared to existing approaches. We demonstrate the applicability of our method across a range of scenarios, including data attribution in diffusion models, data leakage detection, analysis of memorization, mislabeled data detection, and tracing behavior in language models.
Improved Baselines with Visual Instruction Tuning
Haotian Liu · Chunyuan Li · Yuheng Li · Yong Jae Lee
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this paper, we present the first systematic study to investigate the design choices of LMMs in a controlled setting under the LLaVA framework. We show that the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. Furthermore, we present some early exploration of open problems in LMMs, including scaling to higher resolution inputs, compositional capabilities, and model hallucination, etc. We hope this makes state-of-the-art LMM research more accessible. Code and model will be publicly available.
Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment
Zheren Fu · Lei Zhang · Hou Xia · Zhendong Mao
Cross-modal alignment aims to build a bridge connecting vision and language. It is an important multi-modal task that efficiently learns the semantic similarities between images and texts. Traditional fine-grained alignment methods heavily rely on pre-trained object detectors to extract region features for subsequent region-word alignment, thereby incurring substantial computational costs for region detection and error propagation issues for two-stage training. In this paper, we focus on the mainstream vision transformer, incorporating patch features for patch-word alignment, while addressing the resultant issue of visual patch redundancy and patch ambiguity for semantic alignment. We propose a novel Linguistic-Aware Patch Slimming (LAPS) framework for fine-grained alignment, which explicitly identifies redundant visual patches with language supervision and rectifies their semantic and spatial information to facilitate more effective and consistent patch-word alignment. Extensive experiments on various evaluation benchmarks and model backbones show LAPS outperforms the state-of-the-art fine-grained alignment methods by 5%-15% rSum.
FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
Shuai Tan · Bin Ji · Ye Pan
Generating emotional talking faces is a practical yet challenging endeavor. To create a lifelike avatar, we draw upon two critical insights from a human perspective: 1) The connection between audio and the non-deterministic facial dynamics, encompassing expressions, blinks, poses, should exhibit synchronous and one-to-many mapping. 2) Vibrant expressions are often accompanied by emotion-aware high-definition (HD) textures and finely detailed teeth. However, both aspects are frequently overlooked by existing methods. To this end, this paper proposes using normalizing Flow and Vector-Quantization modeling to produce emotional talking faces that satisfy both insights concurrently (FlowVQTalker). Specifically, we develop a flow-based coefficient generator that encodes the dynamics of facial emotion into a multi-emotion-class latent space represented as a mixture distribution. The generation process commences with random sampling from the modeled distribution, guided by the accompanying audio, enabling both lip-synchronization and the uncertain nonverbal facial cues generation. Furthermore, our designed vector-quantization image generator treats the creation of expressive facial images as a code query task, utilizing a learned codebook to provide rich, high-quality textures that enhance the emotional perception of the results. Extensive experiments are conducted to showcase the effectiveness of our approach.
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Jinxiang Liu · Yikun Liu · Ferenas · Chen Ju · Ya Zhang · Yanfeng Wang
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories based on their temporal characteristics, i.e., neighboring frame (NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs, DFs have long temporal distances from the labeled frame, which share semantic-similar objects with appearance variations. Considering their unique characteristics, we propose a versatile framework that effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides, we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames, which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method, unleashing the power of the abundant unlabeled frames.
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Fengyu Yang · Chao Feng · Ziyang Chen · Hyoungseob Park · Daniel Wang · Yiming Dou · Ziyao Zeng · xien chen · Suchisrit Gangopadhyay · Andrew Owens · Alex Wong
The ability to associate touch with other modalities has huge implications for humans and computational systems. However, multimodal learning with touch remains challenging due to the expensive data collection process and non-standardized sensor outputs. We introduce UniTouch, a unified tactile model for vision-based touch sensors connected to multiple modalities, including vision, language and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. We further propose learnable sensor-specific tokens, allowing the model to learn from a set of heterogeneous tactile sensors, all at the same time. UniTouch is capable of conducting various touch sensing tasks in the zero-shot setting, from robot grasping prediction to touch image question and answering. To the best of our knowledge, UniTouch is the first to demonstrate such capabilities.
MoDE: CLIP Data Experts via Clustering
Jiawei Ma · Po-Yao Huang · Saining Xie · Shang-Wen Li · Luke Zettlemoyer · Shih-Fu Chang · Wen-tau Yih · Hu Xu
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the hierarchical structure in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<30%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. Model and code will be available.
X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization
Anna Kukleva · Fadime Sener · Edoardo Remelli · Bugra Tekin · Eric Sauser · Bernt Schiele · Shugao Ma
Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos has been largely unexplored. To address this gap, we propose a simple yet effective cross-modal adaptation framework, which we call X-MIC. Using a video adapter, our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for fine-grained cross-dataset action generalization, demonstrating the effectiveness of our method.
PixelLM: Pixel Reasoning with Large Multimodal Model
Zhongwei Ren · Zhicheng Huang · Yunchao Wei · Yao Zhao · Dongmei Fu · Jiashi Feng · Xiaojie Jin
While large multimodal models (LMMs) have achieved remarkable progress, generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap, we introduce PixelLM, an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM are a novel, lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens, which encode detailed target-relevant information. With this design, PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore, we propose a token fusion method to enhance the model's ability to differentiate between multiple targets, leading to substantially improved mask quality. To advance research in this area, we construct MUSE, a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks, outperforming well-established methods in multiple benchmarks, including MUSE, and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code, models, and datasets will be publicly available.
Probing Synergistic High-Order Interaction in Infrared and Visible Image Fusion
Naishan Zheng · Man Zhou · Jie Huang · Junming Hou · Haoying Li · Yuan Xu · Feng Zhao
Infrared and visible image fusion aims to generate a fused image by integrating and distinguishing complementary information from multiple sources. While the cross-attention mechanism with global spatial interactions appears promising, it only capture second-order spatial interactions, neglecting higher-order interactions in both spatial and channel dimensions. This limitation hampers the exploitation of synergies between multi-modalities. To bridge this gap, we introduce a Synergistic High-order Interaction Paradigm (SHIP), designed to systematically investigate the spatial fine-grained and global statistics collaborations between infrared and visible images across two fundamental dimensions: 1) Spatial dimension: we construct spatial fine-grained interactions through element-wise multiplication, mathematically equivalent to global interactions, and then foster high-order formats by iteratively aggregating and evolving complementary information, enhancing both efficiency and flexibility; 2) Channel dimension: expanding on channel interactions with first-order statistics (mean), we devise high-order channel interactions to facilitate the discernment of inter-dependencies between source images based on global statistics. Harnessing high-order interactions significantly enhances our model's ability to exploit multi-modal synergies, leading to superior performance over state-of-the-art alternatives, as shown through comprehensive experiments across various benchmarks. Code is available at https://github.com/zheng980629/SHIP.
The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective
Wenqi Jia · Miao Liu · Hao Jiang · Ishwarya Ananthabhotla · James Rehg · Vamsi Krishna Ithapu · Ruohan Gao
In recent years, the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions, where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction problem, marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework---Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of conversation behaviors---speaking and listening---for both the camera wearer as well as all other social partners present in the egocentric video. Specifically, we adopt the self-attention mechanism to model the representations across-time, across-subjects, and across-modalities. To validate our method, we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model.
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
Yining Hong · Zishuo Zheng · Peihao Chen · Yian Wang · Junyan Li · Chuang Gan
Human beings possess the capability to multiply a mélange of multisensory cues while actively exploring and interacting with the 3D world.Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information.To usher in the study of this area, we propose MultiPLY, a multisensory embodied LLM that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data, we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes the actions within the environment, and state tokens that represent the multisensory state observations of the agent at each time step. In the inference time, MultiPLY could generate action tokens, instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval, tool use, multisensory captioning, and task decomposition.
GPT4Point: A Unified Framework for Point-Language Understanding and Generation
Zhangyang Qi · Ye Fang · Zeyi Sun · Xiaoyang Wu · Tong Wu · Jiaqi Wang · Dahua Lin · Hengshuang Zhao
Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation, but their understanding of the 3D world is notably deficient, limiting progress in 3D language understanding and generation. To solve this problem, we introduce GPT4Point, an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point is equipped with advanced capabilities for controllable 3D generation, it can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors. To support the expansive needs of 3D object-text pairs, we develop Pyramid-XL, a point-language dataset annotation engine. It constructs a large-scale database over 1M objects of varied text granularity levels from the Objaverse-XL dataset, essential for training GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D point-language understanding capabilities. In extensive evaluations, GPT4Point has demonstrated superior performance in understanding and generation.
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning
Sijin Chen · Xin Chen · Chi Zhang · Mingsheng Li · Gang Yu · Hao Fei · Hongyuan Zhu · Jiayuan Fan · Tao Chen
Recent progress in Large Multimodal Models (LMM) has opened up great possibilities for various applications in the field of human-machine interactions. However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud representations of the 3D scene. Existing works seek help from multi-view images by projecting 2D features to 3D space, which inevitably leads to huge computational overhead and performance degradation. In this paper, we present LL3DA, a Large Language 3D Assistant that takes point cloud as the direct input and responds to both text instructions and visual interactions. The additional visual interaction enables LMMs to better comprehend human interactions with the 3D environment and further remove the ambiguities within plain texts. Experiments show that LL3DA achieves remarkable results and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
Jiasen Lu · Christopher Clark · Sangho Lee · Zichen Zhang · Savya Khosla · Ryan Marten · Derek Hoiem · Aniruddha Kembhavi
We present Unified-IO 2, a multimodal and multi-skill unified model capable of following novel instructions. Unified-IO 2 can use text, images, audio, and/or videos as input and can generate text, image, or audio outputs, which is accomplished in a unified way by tokenizing these different inputs and outputs into a shared semantic space that can then be processed by a single encoder-decoder transformer model. Unified-IO 2 is trained from scratch on a custom-built multimodal pre-training corpus and then learns an expansive set of skills through fine-tuning on over 120 datasets, including datasets for segmentation, object detection, image editing, audio localization, video tracking, embodied AI, and 3D detection. To facilitate instruction-following, we add prompts and other data augmentations to these tasks to allow Unified-IO 2 to generalize these skills to new tasks zero-shot.Unified-IO 2 is the first model to be trained on such a diverse and wide-reaching set of skills and unify three separate generation capabilities. Unified-IO 2 achieves state-of-the-art performance on the multi-task GRIT benchmark and achieves strong results on 30 diverse datasets, including SEED-Bench image and video understanding, TIFA image generation, VQA 2.0, ScienceQA, VIMA robotic manipulation, VGG-Sound, and Kinetics-Sounds and can perform unseen tasks and generate free-form responses. We release our model and code to facilitate future work.
SHAP-EDITOR: Instruction-Guided Latent 3D Editing in Seconds
Minghao Chen · Junyu Xie · Iro Laina · Andrea Vedaldi
We propose a novel feed-forward 3D editing framework called Shap-Editor. Prior research on editing 3D objects primarily concentrated on editing individual objects by leveraging off-the-shelf 2D image editing networks, utilizing a process called 3D distillation, which transfers knowledge from the 2D network to the 3D asset. Distillation necessitates at least tens of minutes per asset to attain satisfactory editing results, thus it is not very practical. In contrast, we ask whether 3D editing can be carried out directly by a feed-forward network, eschewing test-time optimization. In particular, we hypothesise that this process can be greatly simplified by first encoding 3D objects into a suitable latent space. We validate this hypothesis by building upon the latent space of Shap-E. We demonstrate that direct 3D editing in this space is possible and efficient by learning a feed-forward editor network that only requires approximately one second per edit. Our experiments show that Shap-Editor generalises well to both in-distribution and out-of-distribution 3D assets with different prompts and achieves superior performance compared to methods that carry out test-time optimisation for each edited instance.
Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge
Dongjin Kim · Sung Jin Um · Sangmin Lee · Jung Uk Kim
The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object identification (IOI) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL
Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow
Hanyu Zhou · Yi Chang · Zhiwei Shi
Single RGB or LiDAR is the mainstream sensor for the challenging scene flow, which relies heavily on visual features to match motion features. Compared with single modality, existing methods adopt a fusion strategy to directly fuse the cross-modal complementary knowledge in motion space. However, these direct fusion methods may suffer the modality gap due to the visual intrinsic heterogeneous nature between RGB and LiDAR, thus deteriorating motion features. We discover that event has the homogeneous nature with RGB and LiDAR in both visual and motion spaces. In this work, we bring the event as a bridge between RGB and LiDAR, and propose a novel hierarchical visual-motion fusion framework for scene flow, which explores a homogeneous space to fuse the cross-modal complementary knowledge for physical interpretation. In visual fusion, we discover that event has a complementarity (relative v.s. absolute) in luminance space with RGB for high dynamic imaging, and has a complementarity (local boundary v.s. global shape) in scene structure space with LiDAR for structure integrity. In motion fusion, we figure out that RGB, event and LiDAR are complementary (spatial-dense, temporal-dense v.s. spatiotemporal-sparse) to each other in correlation space, which motivates us to fuse their motion correlations for motion continuity. The proposed hierarchical fusion can explicitly fuse the multimodal knowledge to progressively improve scene flow from visual space to motion space. Extensive experiments have been performed to verify the superiority of the proposed method.
Dispel Darkness for Better Fusion: A Controllable Visual Enhancer based on Cross-modal Conditional Adversarial Learning
HAO ZHANG · Linfeng Tang · Xinyu Xiang · Xuhui Zuo · Jiayi Ma
We propose a controllable visual enhancer, named DDBF, which is based on cross-modal conditional adversarial learning and aims to dispel darkness and achieve better visible and infrared modalities fusion. Specifically, a guided restoration module (GRM) is firstly designed to enhance weakened information in the low-light visible modality. The GRM utilizes the light-invariant high-contrast characteristics of the infrared modality as the central target distribution, and constructs a multi-level conditional adversarial sample set to enable continuous controlled brightness enhancement of visible images. Then, we develop an information fusion module (IFM) to integrate the advantageous features of the enhanced visible image and the infrared image. Thanks to customized explicit information preservation and hue fidelity constraints, the IFM produces visually pleasing results with rich textures, significant contrast, and vivid colors. The brightened visible image and the final fused image compose the dual output of our DDBF to meet the diverse visual preferences of users. We evaluate DDBF on the public datasets, achieving state-of-the-art performances of low-light enhancement and information integration that is available for both day and night scenarios. The experiments also demonstrate that our DDBF is effective in improving decision accuracy for object detection and semantic segmentation. Moreover, we offer a user-friendly interface for the convenient application of our model. The code is publicly available at https://github.com/HaoZhang1018/DDBF.
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
Yuanhong Chen · Yuyuan Liu · Hu Wang · Fengbei Liu · Chong Wang · Helen Frazer · Gustavo Carneiro
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment between sound and visual objects. Successful audio-visual learning requires two essential components: 1) a challenging dataset with high-quality pixel-level multi-class annotated images associated with audio files, and 2) a model that can establish strong links between audio information and its corresponding visual object. However, these requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. We also propose a new informative sample mining method for audio-visual supervised contrastive learning to leverage discriminative contrastive samples to enforce cross-modal understanding. We show empirical results that demonstrate the effectiveness of our benchmark. Furthermore, experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning
Haoran Xu · Peixi Peng · Guang Tan · Yuan Li · Xinhai Xu · Yonghong Tian
We explore visual reinforcement learning (RL) using two complementary visual modalities: frame-based RGB camera and event-based Dynamic Vision Sensor (DVS). Existing multi-modality visual RL methods often encounter challenges in effectively extracting task-relevant information from multiple modalities while suppressing the increased noise, only using indirect reward signals instead of pixel-level supervision. To tackle this, we propose a Decomposed Multi-Modality Representation (DMR) framework for visual RL. It explicitly decomposes the inputs into three distinct components: combined task-relevant features (co-features), RGB-specific noise, and DVS-specific noise. The co-features represent the full information from both modalities that is relevant to the RL task; the two noise components, each constrained by a data reconstruction loss to avoid information leak, are contrasted with the co-features to maximize their difference. Extensive experiments demonstrate that, by explicitly separating the different types of information, our approach achieves substantially improved policy performance compared to state-of-the-art approaches.
Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation
Mingyu Lee · Jongwon Choi
We propose a text-guided variational image generation method to address the challenge of getting clean data for anomaly detection in industrial manufacturing. Our method utilizes text information about the target object, learned from extensive text library documents, to generate non-defective data images resembling the input image. The proposed framework ensures that the generated non-defective images align with anticipated distributions derived from textual and image-based knowledge, ensuring stability and generality. Experimental results demonstrate the effectiveness of our approach, surpassing previous methods even with limited non-defective data. Our approach is validated through generalization tests across four baseline models and three distinct datasets. We present an additional analysis to enhance the effectiveness of anomaly detection models by utilizing the generated images.
Tactile-Augmented Radiance Fields
Yiming Dou · Fengyu Yang · Yi Liu · Antonio Loquercio · Andrew Owens
Humans can quickly assess how different parts of a scene would feel if touched. However, this ability still eludes current techniques in scene reconstruction. This work presents a scene representation that brings vision and touch into a shared 3D space, which we define as a tactile-augmented radiance field. This representation capitalizes on two key insights: (i) ubiquitous touch sensors are built on perspective cameras, and (ii) visually and structurally similar regions of a scene share the same tactile features. We leverage these insights to train a conditional diffusion model that, provided with an RGB image and a depth map rendered from a neural radiance field, generates its corresponding tactile ``image''. To train this diffusion model, we collect the largest collection of spatially-aligned visual and tactile data, significantly surpassing the size of the largest prior dataset. Through qualitative and quantitative experiments, we demonstrate the accuracy of our cross-modal generative model and the utility of collected and rendered visual-tactile pairs across a range of downstream tasks.
LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Gongwei Chen · Leyang Shen · Rui Shao · Xiang Deng · Liqiang Nie
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-$\textbf{L}$evel v$\textbf{I}$sual kn$\textbf{O}$wledge e$\textbf{N}$hanced Multimodal Large Language Model ($\textbf{LION}$), which empowers the MLLM by injecting visual knowledge in two levels. $\textbf{1)}$ $\textbf{Progressive}$ $\textbf{incorporation}$ $\textbf{of}$ $\textbf{fine-grained}$ $\textbf{spatial-aware}$ $\textbf{visual}$ $\textbf{knowledge}$. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. $\textbf{2)}$ $\textbf{Soft}$ $\textbf{prompting}$ $\textbf{of}$ $\textbf{high-level}$ $\textbf{semantic}$ $\textbf{visual}$ $\textbf{evidence}$. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model ($\textit{e.g.}$, improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).
SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking
Xiaojun Hou · Jiazheng Xing · Yijie Qian · Yaowei Guo · Shuo Xin · Junhao Chen · Kai Tang · Mengmeng Wang · Zhengkai Jiang · Liang Liu · Yong Liu
Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities.To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner.Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure.Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at: https://github.com/hoqolo/SDSTrack.
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
Yichi Zhang · Yinpeng Dong · Siyuan Zhang · Tianzan Min · Hang Su · Jun Zhu
Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.
Mask Grounding for Referring Image Segmentation
Yong Xien Chng · Henry Zheng · Yizeng Han · Xuchong QIU · Gao Huang
Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses. To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han · Kaixiong Gong · Yiyuan Zhang · Jiaqi Wang · Kaipeng Zhang · Dahua Lin · Yu Qiao · Peng Gao · Xiangyu Yue
Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM
EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning
Hongxia Xie · Chu-Jun Peng · Yu-Wen Tseng · Hung-Jen Chen · Chan-Feng Hsu · Hong-Han Shuai · Wen-Huang Cheng
Visual Instruction Tuning represents a novel learning paradigm involving the fine-tuning of pre-trained language models using task-specific instructions. This paradigm shows promising zero-shot results in various natural language processing tasks but is still unexplored in vision emotion understanding. In this work, we focus on enhancing the model's proficiency in understanding and adhering to instructions related to emotional contexts. Initially, we identify key visual clues critical to visual emotion recognition. Subsequently, we introduce a novel GPT-assisted pipeline for generating emotion visual instruction data, effectively addressing the scarcity of annotated instruction data in this domain. Expanding on the groundwork established by InstructBLIP, our proposed EmoVIT architecture incorporates emotion-specific instruction data, leveraging the powerful capabilities of Large Language Models to enhance performance. Through extensive experiments, our model showcases its proficiency in emotion classification, adeptness in affective reasoning, and competence in comprehending humor. The comparative analysis provides a robust benchmark for Emotion Visual Instruction Tuning in the era of LLMs, providing valuable insights and opening avenues for future exploration in this domain.
ModaVerse: Efficiently Transforming Modalities with LLMs
Xinyu Wang · Bohan Zhuang · Qi Wu
Humans possess the capability to comprehend diverse modalities and seamlessly transfer information between them. In this work, we introduce ModaVerse, a Multi-modal Large Language Model (MLLM) capable of comprehending and transforming content across various modalities including images, videos, and audio. Predominant MLLM frameworks have largely relied on the alignment of latent spaces of textual and non-textual features. This alignment process, which synchronizes a language model trained on textual data with encoders and decoders trained on multi-modal data, often necessitates extensive training of several projection layers in multiple stages. Inspired by LLM-as-agent methodologies, we propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language. It aligns the LLM's output with the input of generative models, avoiding the complexities associated with latent feature alignments, and simplifying the multiple training stages of existing MLLMs into a single, efficient process. This conceptual advancement leads to significant reductions in both data and computational costs. By conducting experiments on several benchmarks, we demonstrate that our approach attains comparable performance with the state of the art while achieving considerable efficiencies in data usage and training duration.
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
Zheng Li · Xiang Li · xinyi fu · Xin Zhang · Weiqiang Wang · Shuo Chen · Jian Yang
Prompt learning has emerged as a valuable technique in enhancing vision-language models (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly focuses on designing various learning forms of prompts, neglecting the potential of prompts as effective distillers for learning from larger teacher models. In this paper, we introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model through prompt-based imitation using unlabeled domain images. Specifically, our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain few-shot labels. After pre-training, we leverage the unique decoupled-modality characteristics of CLIP by pre-computing and storing the text features as class vectors only once through the teacher text encoder. In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits. We align the logits of both the teacher and student models via KL divergence, encouraging the student image encoder to generate similar probability distributions to the teacher through the learnable prompts. The proposed prompt distillation process eliminates the reliance on labeled data, enabling the algorithm to leverage a vast amount of unlabeled images within the domain.Finally, the well-trained student image encoders and pre-stored text features (class vectors) are utilized for inference. To our best knowledge, we are the first to perform domain-specific prompt-based knowledge distillation for CLIP using unlabeled data. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.
Dynamic Prompt Optimizing for Text-to-Image Generation
Wenyi Mo · Tianyu Zhang · Yalong Bai · Bing Su · Ji-Rong Wen · Qing Yang
Text-to-image generative models, specifically those based on diffusion models like Imagen and Stable Diffusion, have made substantial advancements. Recently, there has been a surge of interest in the delicate refinement of text prompts. Users assign weights or alter the injection time steps of certain words in the text prompts to improve the quality of generated images. However, the success of fine-control prompts depends on the accuracy of the text prompts and the careful selection of weights and time steps, which requires significant manual intervention. To address this, we introduce the \textbf{P}rompt \textbf{A}uto-\textbf{E}diting (PAE) method. Besides refining the original prompts for image generation, we further employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts. The reward function during training encourages the model to consider aesthetic score, semantic consistency, and user preferences. Experimental results demonstrate that our proposed method effectively improves the original prompts, generating visually more appealing images while maintaining semantic alignment.
Domain Prompt Learning with Quaternion Networks
Qinglong Cao · Zhengqin Xu · Yuntian Chen · Chao Ma · Xiaokang Yang
Prompt learning has emerged as an effective and data-efficient technique in large Vision-Language Models (VLMs). However, when adapting VLMs to specialized domains such as remote sensing and medical imaging, domain prompt learning remains underexplored. While large-scale domain-specific foundation models can help tackle this challenge, their concentration on a single vision level makes it challenging to prompt both vision and language modalities. To overcome this, we propose to leverage domain-specific knowledge from domain-specific foundation models to transfer the robust recognition ability of VLMs from generalized to specialized domains, using quaternion networks. Specifically, the proposed method involves using domain-specific vision features from domain-specific foundation models to guide the transformation of generalized contextual embeddings from the language branch into a specialized space within the quaternion networks. Moreover, we present a hierarchical approach that generates vision prompt features by analyzing intermodal relationships between hierarchical language prompt features and domain-specific vision features. In this way, quaternion networks can effectively mine the intermodal relationships in the specific domain, facilitating domain-specific vision-language contrastive learning. Extensive experiments on domain-specific datasets show that our proposed method achieves new state-of-the-art results in prompt learning.
ViT-Lens: Towards Omni-modal Representations
Stan Weixian Lei · Yixiao Ge · Kun Yi · Jianfeng Zhang · Difei Gao · Dylan Sun · Yuying Ge · Ying Shan · Mike Zheng Shou
Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens to learn representations for 3D point cloud, depth, audio, tactile and EEG, and set new state-of-the-art results across various understanding tasks, such as zero-shot classification. By seamlessly integrating ViT-Lens into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens.
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation
Sihan liu · Yiwei Ma · Xiaoqing Zhang · Haowei Wang · Jiayi Ji · Xiaoshuai Sun · Rongrong Ji
Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that combines computer vision and natural language processing, delineating specific regions in aerial images as described by textual queries. Traditional Referring Image Segmentation (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery, leading to suboptimal segmentation results. To address these challenges, we introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS. RMSIN incorporates an Intra-scale Interaction Module (IIM) to effectively address the fine-grained detail required at multiple scales and a Cross-scale Interaction Module (CIM) for integrating these details coherently across the network. Furthermore, RMSIN employs an Adaptive Rotated Convolution (ARC) to account for the diverse orientations of objects, a novel contribution that significantly enhances segmentation accuracy. To assess the efficacy of RMSIN, we have curated an expansive dataset comprising 17,402 image-caption-mask triplets, which is unparalleled in terms of scale and variety. This dataset not only presents the model with a wide range of spatial and rotational scenarios but also establishes a stringent benchmark for the RRSIS task, ensuring a rigorous evaluation of performance. Our experimental evaluations demonstrate the exceptional performance of RMSIN, surpassing existing state-of-the-art models by a significant margin. All datasets and code will be made available.
Cyclic Learning for Binaural Audio Generation and Localization
Zhaojian Li · Bin Zhao · Yuan Yuan
Binaural audio is obtained by simulating the biological structure of human ears, which plays an important role in artificial immersive spaces. A promising approach is to utilize mono audio and corresponding vision to synthesize binaural audio, thereby avoiding expensive binaural audio recording. However, most existing methods directly use the entire scene as a guide, ignoring the correspondence between sounds and sounding objects. In this paper, we advocate generating binaural audio using fine-grained raw waveform and object-level visual information as guidance. Specifically, we propose a Cyclic Locating-and-UPmixing (CLUP) framework that jointly learns visual sounding object localization and binaural audio generation. Visual sounding object localization establishes the correspondence between specific visual objects and sound modalities, which provides object-aware guidance to improve binaural generation performance. Meanwhile, the spatial information contained in the generated binaural audio can further improve the performance of sounding object localization. In this case, visual sounding object localization and binaural audio generation can achieve cyclic learning and benefit from each other. Experimental results demonstrate that on the FAIR-Play benchmark dataset, our method is significantly ahead of the existing baselines in multiple evaluation metrics (STFT$\downarrow$: 0.787 vs. 0.851, ENV$\downarrow$: 0.128 vs. 0.134, WAV$\downarrow$: 5.244 vs. 5.684, SNR$\uparrow$: 7.546 vs. 7.044).
Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval
Haochen Han · Qinghua Zheng · Guang Dai · Minnan Luo · Jingdong Wang
Collecting well-matched multimedia datasets is crucial for training cross-modal retrieval models. However, in real-world scenarios, massive multimodal data are harvested from the Internet, which inevitably contains Partially Mismatched Pairs (PMPs). Undoubtedly, such semantical irrelevant data will remarkably harm the cross-modal retrieval performance. Previous efforts tend to mitigate this problem by estimating a soft correspondence to down-weight the contribution of PMPs. In this paper, we aim to address this challenge from a new perspective: the potential semantic similarity among unpaired samples makes it possible to excavate useful knowledge from mismatched pairs. To achieve this, we propose L2RM, a general framework based on Optimal Transport (OT) that learns to rematch mismatched pairs. In detail, L2RM aims to generate refined alignments by seeking a minimal-cost transport plan across different modalities. To formalize the rematching idea in OT, first, we propose a self-supervised cost function that automatically learns from explicit similarity-cost mapping relation. Second, we present to model a partial OT problem while restricting the transport among false positives to further boost refined alignments. Extensive experiments on three benchmarks demonstrate our L2RM significantly improves the robustness against PMPs for existing models. The code is available at https://github.com/hhc1997/L2RM.
VILA: On Pre-training for Visual Language Models
Ji Lin · Danny Yin · Wei Ping · Pavlo Molchanov · Mohammad Shoeybi · Song Han
Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge.
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Jack Urbanek · Florian Bordes · Pietro Astolfi · Mary Williamson · Vasu Sharma · Adriana Romero-Soriano
Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 8012 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.
How to Configure Good In-Context Sequence for Visual Question Answering
Li Li · Jiawei Peng · huiyi chen · Chongyang Gao · Xu Yang
Inspired by the success of Large Language Models in dealing with new tasks via In-Context Learning (ICL) in NLP, researchers have also developed Large Vision-Language Models (LVLMs) with ICL capabilities. However, when implementing ICL using these LVLMs, researchers usually resort to the simplest way like random sampling to configure the in-context sequence, thus leading to sub-optimal results. To enhance the ICL performance, in this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations to find the powerful ones. Additionally, through observing the changes of the LVLM outputs by altering the in-context sequence, we gain insights into the inner properties of LVLMs, improving our understanding of them. Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved in-context samples. Through exhaustive experiments on three VQA datasets: VQAv2, VizWiz, and OK-VQA, we uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance. Our code is provided in: https://anonymous.4open.science/r/CVPR2024ICLVQA.
CrossMAE: Cross-Modality Masked Autoencoders for Region-Aware Audio-Visual Pre-Training
Yuxin Guo · Siyang Sun · Shuailei Ma · Kecheng Zheng · Xiaoyi Bao · Shijie Ma · Wei Zou · Yun Zheng
Learning joint and coordinated features across modalities is essential for many audio-visual tasks. Existing pre-training methods primarily focus on global information, neglecting fine-grained features and positions, leading to suboptimal performance in dense prediction tasks. To address this issue, we take a further step towards region-aware audio-visual pre-training and propose CrossMAE, which excels in cross-modality interaction and region alignment. Specifically, we devise two masked autoencoding (MAE) pretext tasks at both pixel and embedding levels, namely Cross-Conditioned Reconstruction and Cross-Embedding Reconstruction. Taking the visual modality as an example (the same goes for audio), in Cross-Conditioned Reconstruction, the visual modality reconstructs the input image pixels conditioned on audio Attentive Tokens. As for the more challenging Cross-Embedding Reconstruction, unmasked visual tokens reconstruct complete audio features under the guidance of learnable queries implying positional information, which effectively enhances the interaction between modalities and exploits fine-grained semantics. Experimental results demonstrate that CrossMAE achieves state-of-the-art performance not only in classification and retrieval, but also in dense prediction tasks. Furthermore, we dive into the mechanism of modal interaction and region alignment of CrossMAE, highlighting the effectiveness of the proposed components.
Modality-Collaborative Test-Time Adaptation for Action Recognition
Baochen Xiong · Xiaoshan Yang · Yaguang Song · Yaowei Wang · Changsheng Xu
Video-based Unsupervised Domain Adaptation (VUDA) method improves the generalization of the video model, enabling it to be applied to action recognition tasks in different environments. However, these methods require continuous access to source data during the adaptation process, which are impractical in real scenarios where the source videos are not available with concerns in transmission efficiency or privacy issues. To address this problem, in this paper, we propose to solve the Multimodal Video Test-Time Adaptation task (MVTTA). Existing image-based TTA methods cannot be directly applied to this task because video have domain shift in multimodal and temporal, which brings difficulties to adaptation. To address the above challenges, we propose a Modality-Collaborative Test-Time Adaptation (MC-TTA) Network. We maintain teacher and student memory banks respectively for generating pseudo-prototypes and target-prototypes. In the teacher model, we propose Self-assembled Source-friendly Feature Reconstruction (SSFR) module to encourage the teacher memory bank to store features that are more likely to be consistent with the source distribution. Through multimodal prototype alignment and cross-modal relative consistency, our method can effectively alleviate domain shift in videos. We evaluate the proposed model on four public video datasets. The results show that our model outperforms existing state-of-the-art methods.
T-VSL: Text-Guided Visual Sound Source Localization in Mixtures
Tanvir Mahmud · Yapeng Tian · Diana Marculescu
Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures.Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods. Our code and pre-trained models will be released.
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
Yuanhuiyi Lyu · Xu Zheng · Jiazhou Zhou · Addison, Lin Wang
We present UniBind, a flexible and efficient approach that learns a unified representation space for seven diverse modalities-- images, text, audio, point cloud, thermal, video, and event data. Existing works, eg., ImageBind, treat the image as the central modality and build an image-centered representation space; however, the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover, the category names are directly used to extract text embeddings for the downstream tasks, making it hardly possible to represent the semantics of multi-modal data. The 'out-of-the-box' insight of our UniBind is to make the alignment center modality-agnostic and further learn a unified and balanced representation space, empowered by the large language models (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable performance boosts. To make this possible, we 1) construct a knowledge base of text embeddings with the help of LLMs and multi-modal LLMs; 2) adaptively build LLM-augmented class-wise embedding center on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding center via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition performance gains over prior arts by an average of 6.36%. Finally, we achieve new state-of-the-art performance, eg., a 6.75% gain on ImageNet, on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters.
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li · Biao Yang · Qiang Liu · Zhiyin Ma · Shuo Zhang · Jingxu Yang · Yabo Sun · Yuliang Liu · Xiang Bai
Large Multimodal Models (LMMs) have shown promise in vision-language tasks but struggle with high-resolution input and detailed scene understanding. Addressing these challenges, we introduce Monkey to enhance LMM capabilities. Firstly, Monkey processes input images by dividing them into uniform patches, each matching the size (e.g., 448$\times$448) used in the original training of the well-trained vision encoder. Equipped with individual adapter for each patch, Monkey can handle higher resolutions up to 1344$\times$896 pixels, enabling the detailed capture of complex visual information. Secondly, it employs a multi-level description generation method, enriching the context for scene-object associations. This two-part strategy ensures more effective learning from generated data: the higher resolution allows for a more detailed capture of visuals, which in turn enhances the effectiveness of comprehensive descriptions. Extensive ablative results validate the effectiveness of our designs. Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats. Specially, in qualitative tests focused on dense text question answering, Monkey has exhibited encouraging results compared with GPT4V. Code is available at https://github.com/Yuliang-Liu/Monkey.
Rethinking Multi-view Representation Learning via Distilled Disentangling
Guanzhou Ke · Bo Wang · Xiao-Li Wang · Shengfeng He
Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain, highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end, we propose an innovative framework for multi-view representation learning, which incorporates a technique we term `distilled disentangling'.Our method introduces the concept of masked cross-view prediction, enabling the extraction of compact, high-quality view-consistent representations from various sources, without incurring extra computational overhead. Additionally, we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations, resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations, enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover, we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations. Our code can be found at: https://anonymous.4open.science/r/MRDD-7FCD.
Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection
Taeheon Kim · Sebin Shin · Youngjoon Yu · Hak Gu Kim · Yong Man Ro
RGBT multispectral pedestrian detection has emerged as a promising solution for safety-critical applications that require day/night operations. However, the modality bias problem remains unsolved as multispectral pedestrian detectors learn the statistical bias in datasets. Specifically, datasets in multispectral pedestrian detection mainly distribute between ROTO (day) and RXTO (night) data; the majority of the pedestrian labels statistically co-occur with their thermal features. As a result, multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation, such as ROTX data. To address this problem, we propose a novel Causal Mode Multiplexer (CMM) framework that effectively learns the causalities between multispectral inputs and predictions. Moreover, we construct a new dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian detection. ROTX-MP mainly includes ROTX examples not presented in previous datasets. Extensive experiments demonstrate that our proposed CMM framework generalizes well on existing datasets (KAIST, CVC-14, FLIR) and the new ROTX-MP. We will release our new dataset to the public for future research.
Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation
Ji-Jia Wu · Andy Chia-Hao Chang · Chieh-Yu Chuang · Chun-Pei Chen · Yu-Lun Liu · Min-Hung Chen · Hou-Ning Hu · Yung-Yu Chuang · Yen-Yu Lin
This paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-language model, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets.
Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities
AJ Piergiovanni · Isaac Noble · Dahun Kim · Michael Ryoo · Victor Gomes · Anelia Angelova
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g. a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly, producing compact but expressive representations. This allows us to scale to 512 input video frames without increase in model parameters. Our approach achieves the state-of-the-art on multiple well established multimodal benchmarks. It effectively addresses the high computational demand of media inputs by learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.
Efficient Vision-Language Pre-training by Cluster Masking
Zihao Wei · Zixuan Pan · Andrew Owens
The quest for optimal vision-language pretraining strategies has led to the exploration of masking techniques as a way to enhance data efficiency. Previous approaches include random masking and semantic masking, the latter requiring the retention or exclusion of patches in areas with similar semantics. Despite its effectiveness, semantic masking often needs an additional, complex model for identifying semantically related patches, increasing computational demands. Our method utilizes naturally emerging clusters within images unlike other approaches using text supervision. We employ random clusters of image patches for masking, utilizing the raw RGB values of patches as the feature representation. This method capitalizes on the observation that basic visual similarity measures can effectively identify coherent visual structures, such as parts of objects. Our approach, therefore, combines the computational efficiency of random patch dropping with the enhanced performance achieved through masking coherent visual structures.
MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models
Sanjoy Chowdhury · Sayan Nag · Joseph K J · Balaji Vasan Srinivasan · Dinesh Manocha
Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.
Weakly Misalignment-free Adaptive Feature Alignment for UAVs-based Multimodal Object Detection
Chen Chen · Jiahao Qi · Xingyue Liu · Kangcheng Bin · Ruigang Fu · Xikun Hu · Ping Zhong
Visible-infrared (RGB-IR) image fusion has shown great potentials in object detection based on unmanned aerial vehicles (UAVs). However, the weakly misalignment problem between multimodal image pairs limits its performance in object detection. Most existing methods often ignore the modality gap and emphasize a strict alignment, resulting in an upper bound of alignment quality and an increase of implementation costs. To address these challenges, we propose a novel method named Offset-guided Adaptive Feature Alignment (OAFA), which could adaptively adjust the relative positions between multimodal features. Considering the impact of modality gap on the cross-modality spatial matching, a Cross-modality Spatial Offset Modeling (CSOM) module is designed to establish a common subspace to estimate the precise feature-level offsets. Then, an Offset-guided Deformable Alignment and Fusion (ODAF) module is utilized to implicitly capture optimal fusion positions for detection task rather than conducting a strict alignment. Comprehensive experiments demonstrate that our method not only achieves state-of-the-art performance in the UAVs-based object detection task but also shows strong robustness to the weakly misalignment problem.
DiVAS: Video and Audio Synchronization with Dynamic Frame Rates
Clara Maria Fernandez Labrador · Mertcan Akcay · Eitan Abecassis · Joan Massich · Christopher Schroers
Synchronization issues between audio and video are one of the most disturbing quality defects in film production and live broadcasting. Even a discrepancy as short as 45 millisecond can degrade the viewer’s experience enough to warrant manual quality checks over entire movies. In this paper, we study the automatic discovery of such issues. Specifically, we focus on the alignment of lip movements with spoken words, targeting realistic production scenarios which can include background noise and music, intricate head poses, excessive makeup, or scenes with multiple individuals where the speaker is unknown. Our model’s robustness also extends to various media specifications, including different video frame rates and audio sample rates. To address these challenges, we present a model fully based on transformers that encodes face crops or full video frames and raw audio using timestamp information, identifies the speaker and provides highly accurate synchronization predictions much faster than previous methods.
Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model
Tian Liang · Jing Huang · Ming Kong · Luyuan Chen · Qiang Zhu
Recent advancements in language models pre-trained on large-scale corpora have significantly propelled developments in the NLP domain and advanced progress in multimodal tasks. In this paper, we propose a Parameter-Efficient multimodal language model learning strategy, named QaP (Querying as Prompt). Its core innovation is a novel modality-bridging method that allows a set of modality-specific queries to be input as soft prompts into a frozen pre-trained language model. Specifically, we introduce an efficient Text-Conditioned Resampler that is easy to incorporate into the language models, which enables adaptive injection of text-related multimodal information at different levels of the model through query learning. This approach effectively bridges multimodal information to the language models while fully leveraging its token fusion and representation potential. We validated our method across four datasets in three distinct multimodal tasks. The results demonstrate that our QaP multimodal language model achieves state-of-the-art performance in various tasks with training only 4.6% parameters.
SonicVisionLM: Playing Sound with Vision Language Models
Zhifeng Xie · Shengye Yu · Qile He · Mengtian Li
There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision-language models(VLMs). Instead of generating audio directly from video, we use the capabilities of powerful VLMs. When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio, enhancing synchronization with the visuals, and improving alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/
Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion
Zixian Gao · Xun Jiang · Xing Xu · Fumin Shen · Yujie Li · Heng Tao Shen
As a fundamental problem in multimodal learning, multimodal fusion aims to compensate for the inherent limitations of a single modality. One challenge of multimodal fusion is that the unimodal data in their unique embedding space mostly contains potential noise, which leads to corrupted cross-modal interactions. However, in this paper, we show that the potential noise in unimodal data could be well quantified and further employed to enhance more stable unimodal embeddings via contrastive learning. Specifically, we propose a novel generic and robust multimodal fusion strategy, termed Embracing Aleatoric Uncertainty (EAU), which is simple and can be applied to kinds of modalities. It consists of two key steps: (1) the Stable Unimodal Feature Augmentation (SUFA) that learns a stable unimodal representation by incorporating the aleatoric uncertainty into self-supervised contrastive learning. (2) Robust Multimodal Feature Integration (RMFI) leveraging an information-theoretic strategy to learn a robust compact joint representation. We evaluate our proposed EAU method on five multimodal datasets, where the video, RGB image, text, audio, and depth image are involved. Extensive experiments demonstrate the EAU method is more noise-resistant over existing multimodal fusion strategies and establishes new state-of-the-art on several benchmarks.
C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
Juntao Zhang · Yuehuai LIU · Yu-Wing Tai · Chi-Keung Tang
We present Compound Conditioned ControlNet, C3Net, a novel generative neural architecture taking conditions from multiple modalities and synthesizing multimodal contents simultaneously (e.g., image, text, audio). C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model and its trainable copies. Specifically, C3Net first aligns the conditions from multi-modalities to the same semantic latent space using modality-specific encoders based on contrastive training. Then, it generates multimodal outputs based on the aligned latent space, whose semantic information is combined using a ControlNet-like architecture called Control C3-UNet. Correspondingly, with this system design, our model offers an improved solution for joint-modality generation through learning and explaining multimodal conditions, involving more than just linear interpolation within the latent space. Meanwhile, as we align conditions to a unified latent space, C3Net only requires one trainable Control C3-UNet to work on multimodal semantic information. Furthermore, our model employs unimodal pretraining on the condition alignment stage, outperforming the non-pretrained alignment even on relatively scarce training data and thus demonstrating high-quality compound condition generation. We contribute the first high-quality tri-modal validation set to validate quantitatively that C3Net outperforms or is on par with the first and contemporary state-of-the-art multimodal generation. Our codes and tri-modal dataset will be released here.
Composed Video Retrieval via Enriched Context and Discriminative Embeddings
Omkar Thawakar · Muzammal Naseer · Rao Anwer · Salman Khan · Michael Felsberg · Mubarak Shah · Fahad Shahbaz Khan
Composed video retrieval (CoVR) is a challenging prob- lem in computer vision which has recently highlighted the in- tegration of modification text with visual queries for more so- phisticated video search in large databases. Existing works predominantly rely on visual queries combined with modi- fication text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descrip- tions to explicitly encode query-specific contextual informa- tion and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art per- formance for both CovR and zero-shot CoIR tasks, achiev- ing gains as high as around 7% in terms of recall@K=1 score. Our code, detailed language descriptions for WebViD- CoVR dataset are available at https://github.com/OmkarThawakar/composed-video-retrieval.
Looking Similar Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning
Nikhil Singh · Chih-Wei Wu · Iroro Orife · Kalayeh
Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video. Our results, from a comprehensive set of experiments investigating different training strategies, show this general approach improves performance on a range of downstream auditory and audiovisual tasks, without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks.
Anchor-based Robust Finetuning of Vision-Language Models
Jinwei Han · Zhiwen Lin · Zhongyisun Sun · Yingguo Gao · Ke Yan · Shouhong Ding · Yuan Gao · Gui-Song Xia
We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD) generalization. We address two types of OOD generalization, i.e., i) domain shift such as natural to sketch images, and ii) zero-shot capability to recognize the category that was not contained in the finetune data. Arguably, the diminished OOD generalization after finetuning stems from the excessively simplified finetuning target, which only provides the class information, such as ``a photo of a [CLASS]''. This is distinct from the process in that CLIP was pretrained, where there is abundant text supervision with rich semantic information. Therefore, we propose to compensate for the finetune process using auxiliary supervision with rich semantic information, which acts as anchors to preserve the OOD generalization. Specifically, two types of anchors are elaborated in our methods, including i) text-compensated anchor which uses the images from the finetune set but enriches the text supervision from a pretrained captioner, ii) image-text-pair anchor which is retrieved from the dataset similar to pretraining data of CLIP according to the downstream task, associating with the original CLIP text with rich semantics. Those anchors are utilized as auxiliary semantic information to maintain the original feature space of CLIP, thereby preserving the OOD generalization capabilities. Comprehensive experiments demonstrate that our method achieves in-distribution performance akin to conventional finetuning while attaining new state-of-the-art results on domain shift and zero-shot learning benchmarks.
Event-based Visible and Infrared Fusion via Multi-task Collaboration
Mengyue Geng · Lin Zhu · Lizhi Wang · Wei Zhang · Ruiqin Xiong · Yonghong Tian
Visible and Infrared image Fusion (VIF) offers a comprehensive scene description by combining thermal infrared images with the rich textures from visible cameras. However, conventional VIF systems may capture over/under exposure or blurry images in extreme lighting and high dynamic motion scenarios, leading to degraded fusion results. To address these problems, we propose a novel Event-based Visible and Infrared Fusion (EVIF) system that employs a visible event camera as an alternative to traditional frame-based cameras for the VIF task. With extremely low latency and high dynamic range, event cameras can effectively address blurriness and are robust against diverse luminous ranges. To produce high-quality fused images, we develop a multi-task collaborative framework that simultaneously performs event-based visible texture reconstruction, event-guided infrared image deblurring, and visible-infrared fusion. Rather than independently learning these tasks, our framework capitalizes on their synergy, leveraging cross-task event enhancement for efficient deblurring and bi-level min-max mutual information optimization to achieve higher fusion quality. Experiments on both synthetic and real data show that EVIF achieves remarkable performance in dealing with extreme lighting conditions and high-dynamic scenes, ensuring high-quality fused images across a broad range of practical scenarios.
Pre-trained vision-language models have shown impressive success on various computer vision tasks with their zero-shot generalizability. Recently, prompt learning approaches have been explored to efficiently and effectively adapt the vision-language models to a variety of downstream tasks. However, most existing prompt learning methods suffer from \textit{task overfitting} since the general knowledge of the pre-trained vision language models is forgotten while the prompts are finetuned on a small data set from a specific target task. To address this issue, we propose a Prompt Meta-Regularization~(ProMetaR) to improve the generalizability of prompt learning for vision-language models. Specifically, ProMetaR meta-learns both the regularizer and the soft prompts to harness the task-specific knowledge from the downstream tasks and task-agnostic general knowledge from the vision-language models. Further, ProMetaR augments the task to generate multiple virtual tasks to alleviate the meta-overfitting. In addition, we provide the analysis to comprehend how ProMetaR improves the generalizability of prompt tuning in the perspective of the gradient alignment. Our extensive experiments demonstrate that our ProMetaR improves the generalizability of conventional prompt learning methods under base-to-base/base-to-new and domain generalization settings.
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Yucheng Suo · Fan Ma · Linchao Zhu · Yi Yang
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets. Previous works learn a pseudo-word token by projecting the reference image features to the text embedding space via image-only contrastive learning. However, they focus on the global visual representation, ignoring the representation of detailed attributes, e.g., color, object number and layout. To address this challenge, we propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs). KEDs implicitly models the attributes of the reference image by incorporating a database. The database enriches the pseudo-word tokens by providing relevant images and captions, emphasizing shared attribute information in various aspects. In this way, KEDs recognizes the reference image from diverse perspectives.Moreover, KEDs adopts an extra stream that aligns pseudo-word tokens with textual concepts, leveraging pseudo-triplets mined from image-text pairs. The pseudo-word tokens generated in this stream are explicitly aligned with fine-grained attribute semantics in the text embedding space. Extensive experiments on widely used benchmarks, i.e. ImageNet-R, COCO object, Fashion-IQ and CIRR, show that KEDs outperforms previous zero-shot composed image retrieval methods.
Contextual Augmented Global Contrast for Multimodal Intent Recognition
Kaili Sun · Zhiwen Xie · Mang Ye · Huyin Zhang
Multimodal intent recognition (MIR) aims to perceive the human intent polarity via language, visual, and acoustic modalities. The inherent ambiguity of intent makes it challenging to recognize in multimodal scenarios. Existing MIR methods tend to model the individual videos independently, ignoring the contextual information across the videos. This learning manner inevitably introduces perception biases, exacerbated by inconsistencies in multimodal information, amplifying uncertainty in intent understanding.This challenge motivates us to explore effective global context modeling. Thus, we propose a context-augmented global contrast (CAGC) method to capture rich global context features by mining both intra-and cross-video context interactions for MIR. Concretely, we design a context-augmented transformer module to extract global context dependencies across videos. To further alleviate error accumulation and interference, we develop a cross-video bank that retrieves effective video sources by considering both intentional tendency and video similarity. Furthermore, we introduce a global context-guided contrastive learning scheme, designed to mitigate inconsistencies arising from global context representations and individual modalities in different feature spaces.This scheme incorporates global cues as supervision, ensuring the effectiveness of global contextual information while also enhancing the consistency learning. Experiments demonstrate CAGC obtains superior performance than state-of-the-art MIR methods.We also generalize our approach to a closely related task, multimodal sentiment analysis, achiveing the comparable performance.
MRFS: Mutually Reinforcing Image Fusion and Segmentation
HAO ZHANG · Xuhui Zuo · Jie Jiang · Chunchao Guo · Jiayi Ma
This paper proposes a coupled learning framework to break the performance bottleneck of infrared-visible image fusion and segmentation, called MRFS. By leveraging the intrinsic consistency between vision and semantics, it emphasizes mutual reinforcement rather than treating these tasks as separate issues. First, we embed weakened information recovery and salient information integration into the image fusion task, employing the CNN-based interactive gated mixed attention (IGM-Att) module to extract high-quality visual features. This aims to satisfy human visual perception, producing fused images with rich textures, high contrast, and vivid colors. Second, a transformer-based progressive cycle attention (PC-Att) module is developed to enhance semantic segmentation. It establishes single-modal self-reinforcement and cross-modal mutual complementarity, enabling more accurate decisions in machine semantic perception. Then, the cascade of IGM-Att and PC-Att couples image fusion and semantic segmentation tasks, implicitly bringing vision-related and semantics-related features into closer alignment. Therefore, they mutually provide learning priors to each other, resulting in visually satisfying fused images and more accurate segmentation decisions. Extensive experiments on public datasets showcase the advantages of our method in terms of visual satisfaction and decision accuracy. The code is publicly available at https://github.com/HaoZhang1018/MRFS.
POPDG: Popular 3D Dance Generation with PopDanceSet
Zhenye Luo · Min Ren · Xuecai Hu · Yongzhen Huang · Li Yao
Generating dances that are both lifelike and well-aligned with music continues to be a challenging task in the cross- modal domain. This paper introduces PopDanceSet, the first dataset tailored to the preferences of young audiences, enabling the generation of aesthetically oriented dances. And it surpasses the AIST++ dataset in music genre di- versity and the intricacy and depth of dance movements. Moreover, the proposed POPDG model within the iD- DPM framework enhances dance diversity and, through the Space Augmentation Algorithm, strengthens spatial physi- cal connections between human body joints, ensuring that increased diversity does not compromise generation qual- ity. A streamlined Alignment Module is also designed to improve the temporal alignment between dance and mu- sic. Extensive experiments show that POPDG achieves SOTA results on two datasets. Furthermore, the paper also expands on current evaluation metrics. The dataset and code are available at https://github.com/Luke-Luo1/POPDG.
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
Yuxin Chen · Zongyang Ma · Ziqi Zhang · Zhongang Qi · Chunfeng Yuan · Bing Li · Junfu Pu · Ying Shan · Xiaojuan Qi · Weiming Hu
Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy, while the cross-encoder models offer higher accuracy at the expense of efficiency. Distilling cross-modality matching knowledge from cross-encoder to dual-encoder provides a natural approach to harness their strengths. Thus, we investigate the following valuable question: how to make cross-encoder a good teacher for dual-encoder? Our findings are threefold: (1) Cross-modal similarity score distribution of cross-encoder is more concentrated, while the result of dual-encoder is nearly normal, making vanilla logit distillation less effective. However, ranking distillation remains practical, as it is not affected by the score distribution. (2) Only the relative order between hard negatives conveys valid knowledge, while the order information between easy negatives has little significance. (3) Maintaining the coordination between distillation loss and dual-encoder training loss is beneficial for knowledge transfer. Based on these findings, we propose a novel Contrastive Partial Ranking Distillation (CPRD) method, which implements the objective of mimicking relative order between hard negative samples with contrastive learning. This approach coordinates with the training of the dual-encoder, transferring valid knowledge from the cross-encoder to the dual-encoder effectively. Extensive experiments on image-text retrieval and ranking tasks show that our method surpasses other distillation methods and significantly improve the accuracy of dual-encoder.
Active Prompt Learning in Vision Language Models
Jihwan Bang · Sumyeong Ahn · Jae-Gil Lee
Pre-trained Vision Language Models (VLMs) have demonstrated notable progress in various zero-shot tasks, such as classification and retrieval. Despite their performance, because improving performance on new tasks requires task-specific knowledge, their adaptation is essential. While labels are needed for the adaptation, acquiring them is typically expensive. To overcome this challenge, active learning, a method of achieving a high performance by obtaining labels for a small number of samples from experts, has been studied. Active learning primarily focuses on selecting unlabeled samples for labeling and leveraging them to train models. In this study, we pose the question, "how can the pre-trained VLMs be adapted under the active learning framework?" In response to this inquiry, we observe that (1) simply applying a conventional active learning framework to pre-trained VLMs even may degrade performance compared to random selection because of the class imbalance in labeling candidates, and (2) the knowledge of VLMs can provide hints for achieving the balance before labeling. Based on these observations, we devise a novel active learning framework for VLMs, denoted as PCB. To assess the effectiveness of our approach, we conduct experiments on seven different real-world datasets, and the results demonstrate that PCB surpasses conventional active learning and random sampling methods.
Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning
Christopher Liao · Theodoros Tsiligkaridis · Brian Kulis
Over the past year, a large body of multimodal research has emerged around zero-shot evaluation using GPT descriptors. These studies boost the zero-shot accuracy of pretrained VL models with an ensemble of label-specific text generated by GPT. A recent study, WaffleCLIP, demonstrated that similar zero-shot accuracy can be achieved with an ensemble of random descriptors. However, both zero-shot methods are un-trainable and consequently sub-optimal when some few-shot out-of-distribution (OOD) training data is available. Inspired by these prior works, we present two more flexible methods called descriptor and word soups, which do not require an LLM at test time and can leverage training data to increase OOD target accuracy. Descriptor soup greedily selects a small set of textual descriptors using generic few-shot training data, then calculates robust class embeddings using the selected descriptors. Word soup greedily assembles a chain of words in a similar manner. Compared to existing few-shot soft prompt tuning methods, word soup requires fewer parameters by construction and less GPU memory, since it does not require backpropagation. Both soups outperform current published few-shot methods, even when combined with SoTA zero-shot methods, on cross-dataset and domain generalization benchmarks. Compared with SoTA prompt and descriptor ensembling methods, such as ProDA and WaffleCLIP, word soup achieves higher OOD accuracy with fewer ensemble members. Please checkout our code: github.com/Chris210634/word_soups
Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion
Xunpeng Yi · Han Xu · HAO ZHANG · Linfeng Tang · Jiayi Ma
Image fusion aims to combine information from different source images to create a comprehensively representative image. Existing fusion methods are typically helpless in dealing with degradations in low-quality source images and non-interactive to multiple subjective and objective needs. To solve them, we introduce a novel approach that leverages semantic text guidance image fusion model for degradation-aware and interactive image fusion task, termed as Text-IF. It innovatively extends the classical image fusion to the text guided image fusion along with the ability to harmoniously address the degradation and interaction issues during fusion. Through the text semantic encoder and semantic interaction fusion decoder, Text-IF is accessible to the all-in-one infrared and visible image degradation-aware processing and the interactive flexible fusion outcomes. In this way, Text-IF achieves not only multi-modal image fusion, but also multi-modal information fusion. Extensive experiments prove that our proposed text guided image fusion strategy has obvious advantages over SOTA methods in the image fusion performance and degradation treatment. The code is available at https://github.com/XunpengYi/Text-IF.
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
Chaoya Jiang · Haiyang Xu · Mengfan Dong · Jiaxing Chen · Wei Ye · Ming Yan · Qinghao Ye · Ji Zhang · Fei Huang · Shikun Zhang
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinatory text and visual samples closer while pushing way representations of non-hallucinatory and hallucinatory text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66\% /29.5\% improvement over the baseline MiniGPT-4/LLaVA.
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
Lei Zhu · Fangyun Wei · Yanye Lu
In this work, we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity, and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. With this innovative image encoding, the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion—crucially, without any fine-tuning. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. Code and models are available at https://github.com/zh460045050/V2L-Tokenizer.
Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos
Sagnik Majumder · Ziad Al-Halah · Kristen Grauman
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. Through extensive experiments, we show that our features are generic enough to improve over multiple state-of-the-art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio, EgoCom and EasyCom.
ES³: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
Yuanhang Zhang · Shuang Yang · Shiguang Shan · Xilin Chen
We propose a novel strategy, ES$^3$, for self-supervised learning of robust audio-visual speech representations from unlabeled talking face videos. While many recent approaches for this task primarily rely on guiding the learning process using the audio modality alone to capture information shared between audio and video, we reframe the problem as the acquisition of *shared*, *unique* (modality-specific) and *synergistic* speech information to address the inherent **asymmetry** between the modalities. Based on this formulation, we propose a novel "evolving" strategy that progressively builds joint audio-visual speech representations that are strong for both uni-modal (audio & visual) and bi-modal (audio-visual) speech. First, we leverage the more easily learnable audio modality to initialize audio and visual representations by capturing audio-unique and shared speech information. Next, we incorporate video-unique speech information and bootstrap the audio-visual representations on top of the previously acquired shared knowledge. Finally, we maximize the total audio-visual speech information, including synergistic information to obtain robust and comprehensive representations. We implement ES$^3$ as a simple Siamese framework and experiments on both English benchmarks and a newly contributed large-scale Mandarin dataset show its effectiveness. In particular, on LRS2-BBC, our smallest model is on par with SoTA models with only 1/2 parameters and 1/8 unlabeled data (223h).
PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization
Xu Peng · Junwei Zhu · Boyuan Jiang · Ying Tai · Donghao Luo · Jiangning Zhang · Wei Lin · Taisong Jin · Chengjie Wang · Rongrong Ji
Recent advancements in personalized image generation using diffusion models have been noteworthy. However, existing methods suffer from inefficiencies due to the requirement for subject-specific fine-tuning. This computationally intensive process hinders efficient deployment, limiting practical usability. Moreover, these methods often grapple with identity distortion and limited expression diversity. In light of these challenges, we propose PortraitBooth, an innovative approach designed for high efficiency, robust identity preservation, and expression-editable text-to-image generation, without the need for fine-tuning. PortraitBooth leverages subject embeddings from a face recognition model for personalized image generation without fine-tuning. It eliminates computational overhead and mitigates identity distortion. The introduced dynamic identity preservation strategy further ensures close resemblance to the original image identity. Moreover, PortraitBooth incorporates emotion-aware cross-attention control for diverse facial expressions in generated images, supporting text-driven expression editing. Its scalability enables efficient and high-quality image creation, including multi-subject generation. Extensive results demonstrate superior performance over other state-of-the-art methods in both single and multiple image generation scenarios.
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
Le Xue · Ning Yu · Shu Zhang · Artemis Panagopoulou · Junnan Li · Roberto Martín-Martín · Jiajun Wu · Caiming Xiong · Ran Xu · Juan Carlos Niebles · Silvio Savarese
Recent advancements in multimodal pre-training have shown promising efficacy in 3D representation learning by aligning multimodal features across 3D shapes, their 2D counterparts, and language descriptions. However, the methods used by existing frameworks to curate such multimodal data, in particular language descriptions for 3D shapes, are not scalable, and the collected language descriptions are not diverse. To address this, we introduce ULIP-2, a simple yet effective tri-modal pre-training framework that leverages large multimodal models to automatically generate holistic language descriptions for 3D shapes. It only needs 3D data as input, eliminating the need for any manual 3D annotations, and is therefore scalable to large datasets. ULIP-2 is also equipped with scaled-up backbones for better multimodal representation learning. We conduct experiments on two large-scale 3D datasets, Objaverse and ShapeNet, and augment them with tri-modal datasets of 3D point clouds, images, and language for training ULIP-2. Experiments show that ULIP-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification, standard 3D classification with fine-tuning, and 3D captioning (3D-to-language generation). It achieves a new SOTA of 50.6% (top-1) on Objaverse-LVIS and 84.7% (top-1) on ModelNet40 in zero-shot classification. In the ScanObjectNN benchmark for standard fine-tuning, ULIP-2 reaches an overall accuracy of 91.5% with a compact model of only 1.4 million parameters. ULIP-2 sheds light on a new paradigm for scalable multimodal 3D representation learning without human annotations and shows significant improvements over existing baselines. The code and datasets are released at https://github.com/salesforce/ULIP.
AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
Trevine Oorloff · Surya Koppisetti · Nicolo Bonettini · Divyaraj Solanki · Ben Colman · Yaser Yacoob · Ali Shahriyari · Gaurav Bharaj
With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely, the latter predominantly focuses on discerning audio-visual cues within the training corpus, thereby potentially overlooking correspondences that can help detect unseen deepfakes. We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage, where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.
Language-aware Visual Semantic Distillation for Video Question Answering
Bo Zou · Chao Yang · Yu Qiao · Chengbin Quan · Youjian Zhao
Significant advancements in video question answering (VideoQA) have been made thanks to thriving large image-language pretraining frameworks. Although these image-language models can efficiently represent both video and language branches, they typically employ a goal-free vision perception process and do not interact vision with language well during the answer generation, thus omitting crucial visual cues. In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process. VideoDistill generates answers only from question-related visual embeddings and follows a thinking-observing-answering approach that closely resembles human behavior, distinguishing it from previous research. Specifically, we develop a language-aware gating mechanism to replace the standard cross-attention, avoiding language's direct fusion into visual representations. We incorporate this mechanism into two key components of the entire framework. The first component is a differentiable sparse sampling module, which selects frames containing the necessary dynamics and semantics relevant to the questions. The second component is a vision refinement module that merges existing spatial-temporal attention layers to ensure the extraction of multi-grained visual semantics associated with the questions. We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance in both general and long-form VideoQA datasets. In Addition, we verify that VideoDistill can effectively alleviate the utilization of language shortcut solutions in the EgoTaskQA dataset.
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Renjie Pi · Lewei Yao · Jiahui Gao · Jipeng Zhang · Tong Zhang
The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to vision large language models (VLLMs). However, effectively harnessing LLMs for intricate visual perception tasks, such as detection and segmentation, remains a challenge. Conventional approaches achieve this by transforming perception signals (e.g., bounding boxes, segmentation masks) into sequences of discrete tokens, which struggle with the precision errors and introduces further complexities for training. In this paper, we present a novel end-to-end framework named PerceptionGPT, which represent the perception signals using LLM's dynamic token embedding. Specifically, we leverage lightweight encoders and decoders to handle the perception signals in LLM's embedding space, which takes advantage of the representation power of the high-dimensional token embeddings. Our approach significantly eases the training difficulties associated with the discrete representations in prior methods. Furthermore, owing to our compact representation, the inference speed is also greatly boosted. Consequently, PerceptionGPT enables accurate, flexible and efficient handling of complex perception signals. We validate the effectiveness of our approach through extensive experiments. The results demonstrate significant improvements over previous methods with only 4% trainable parameters and less than 25% training time.
Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation
Qi Yang · Xing Nie · Tong Li · Gaopengfei · Ying Guo · Cheng Zhen · Pengfei Yan · Shiming Xiang
Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Regarding pixel entanglement, we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement, we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods.
MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval
bowen zhang · Xiaojie Jin · Weibo Gong · Kai Xu · Xueqing Deng · Peng Wang · Zhao Zhang · Xiaohui Shen · Jiashi Feng
State-of-the-art video-text retrieval (VTR) methods typically involve fully fine-tuning a pre-trained model (e.g. CLIP) on specific datasets. However, this can result in significant storage costs in practical applications as a separate model per task must be stored. To address this issue, we present our pioneering work that enables parameter-efficient VTR using a pre-trained model, with only a small number of tunable parameters during training. Towards this goal, we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically, MV-Adapter utilizes bottleneck structures in both video and text branches, along with two novel components. The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts. We also train weights calibrations to adjust to dynamic variations across frames. The second is Cross Modality Tying that generates weights for video/text branches through sharing cross modality factors, for better aligning between modalities. Thanks to above innovations, MV-Adapter can achieve comparable or better performance than standard fine-tuning with negligible parameters overhead. Notably, MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins on five widely used VTR benchmarks (MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet). Codes will be released.
Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion.
Weijian Ma · Shuaiqi Chen · Yunzhong Lou · Xueyang Li · Xiangdong Zhou
Reconstructing CAD construction sequences from raw 3D geometry serves as an interface between real-world objects and digital designs. In this paper, we propose CAD-Diffuser, a multimodal diffusion scheme aiming at integrating top-down design paradigm into generative reconstruction. In particular, we unify CAD point clouds and CAD construction sequences at the token level, guiding our proposed multimodal diffusion strategy to understand and link between the geometry and the design intent concentrated in construction sequences. Leveraging the strong decoding abilities of language models, the forward process is modeled as a random walk between the original token and the [MASK] token, while the reverse process naturally fits the masked token modeling scheme. A volume-based noise schedule is designed to encourage outline-first generation, decomposing the top-down design methodology into a machine-understandable procedure. For tokenizing CAD data of multiple modalities, we introduce a tokenizer with a self-supervised face segmentation task to compress local and global geometric information for CAD point clouds, and the CAD construction sequence is transformed into a primitive token string. Experimental results show that our CAD-Diffuser can perceive geometric details and the results are more likely to be reused by human designers.
AV-RIR: Audio-Visual Room Impulse Response Estimation
Anton Ratnarajah · Sreyan Ghosh · Sonal Kumar · Purva Chiniya · Dinesh Manocha
Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, can aid in synthesizing speech as if it were spoken in that environment. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86\%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36\% - 63\% improvement across various acoustic metrics in RIR estimation. Additionally, it also achieves higher preference scores in human evaluation. As an auxiliary benefit, dereverbed speech from AV-RIR shows competitive performance with the state-of-the-art in a variety of spoken language processing tasks and outperforms $T_{60}$ error score in the real-world AVSpeech dataset. Code and qualitative examples of both synthesized reverberant speech and enhanced speech can be found in the supplementary.
Link-Context Learning for Multimodal LLMs
Yan Tai · Weichen Fan · Zhao Zhang · Ziwei Liu
The ability to learn from context with novel concepts, and deliver appropriate responses are essential in human conversations. Despite current Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being trained on mega-scale datasets, recognizing unseen images or understanding novel concepts in a training-free manner remains a challenge. In-Context Learning (ICL) explores training-free few-shot learning, where models are encouraged to "learn to learn" from limited tasks and generalize to unseen tasks. In this work, we propose link-context learning (LCL), which emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs. LCL goes beyond traditional ICL by explicitly strengthening the causal relationship between the support set and the query set. By providing demonstrations with causal links, LCL guides the model to discern not only the analogy but also the underlying causal associations between data points, which empowers MLLMs to recognize unseen images and understand novel concepts more effectively. To facilitate the evaluation of this novel approach, we introduce the ISEKAI dataset, comprising exclusively of unseen generated image-label pairs designed for link-context learning. Extensive experiments show that our LCL-MLLM exhibits strong link-context learning capabilities to novel concepts over vanilla MLLMs.
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
Shentong Mo · Pedro Morgado
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models. However, training early fusion architectures poses significant challenges, as the increased model expressivity requires robust learning frameworks to harness their enhanced capabilities. In this paper, we address this challenge by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion. Additionally, we propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions. While effective, this procedure can become computationally intractable, as the number of local representations increases. Thus, to address the computational complexity, we propose an alternative procedure that factorizes the local representations before representing audio-visual interactions. Extensive evaluations on a variety of datasets demonstrate the superiority of our approach in audio-event classification, visual sound localization, sound separation, and audio-visual segmentation. These contributions enable the efficient training of deeply integrated audio-visual models and significantly advance the usefulness of early fusion architectures.
Noisy-Correspondence Learning for Text-to-Image Person Re-identification
Yang Qin · Yingke Chen · Dezhong Peng · Xi Peng · Joey Tianyi Zhou · Peng Hu
Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, $\textit{a.k.a}$ noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet-Alignment Loss (TAL) relaxes the conventional triplet-ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones, thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets.
Mind Artist: Creating Artistic Snapshots with Human Thought
Jiaxuan Chen · Yu Qi · Yueming Wang · Gang Pan
We introduce Mind Artist (MindArt), a novel and efficient neural decoding architecture to snap artistic photographs from our mind in a controllable manner. Recently, progress has been made in image reconstruction with non-invasive brain recordings, but it's still difficult to generate realistic images with high semantic fidelity due to the scarcity of data annotations. Unlike previous methods, this work casts the neural decoding into optimal transport (OT) and representation decoupling problems. Specifically, under discrete OT theory, we design a graph matching-guided neural representation learning framework to seek the underlying correspondences between conceptual semantics and neural signals, which yields a natural and meaningful self-supervisory task. Moreover, the proposed MindArt, structured with multiple stand-alone modal branches, enables the seamless incorporation of semantic representation into any visual style information, thus leaving it to have multi-modal reconstruction and training-free semantic editing capabilities. By doing so, the reconstructed images of MindArt have phenomenal realism both in terms of semantics and appearance. We compare our MindArt with leading alternatives, and achieve SOTA performance in different decoding tasks. Importantly, our approach can directly generate a series of stylized “mind snapshots” w/o extra optimizations, which may open up more potential applications. Code is available at https://github.com/JxuanC/MindArt.
VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning
Kang Chen · Xiangqian Wu
Achieving the optimal form of Visual Question Answering mandates a profound grasp of understanding, grounding, and reasoning within the intersecting domains of vision and language. Traditional VQA benchmarks have predominantly focused on simplistic tasks such as counting, visual attributes, and object detection, which do not necessitate intricate cross-modal information understanding and inference. Motivated by the need for a more comprehensive evaluation, we introduce a novel dataset comprising 23,781 questions derived from 10,124 image-text pairs. Specifically, the task of this dataset requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. Furthermore, we evaluate this VTQA dataset, comparing the performance of both state-of-the-art VQA models and our proposed baseline model, the Key Entity Cross-Media Reasoning Network (KECMRN). The VTQA task poses formidable challenges for traditional VQA models, underscoring its intrinsic complexity. Conversely, KECMRN exhibits a modest improvement, signifying its potential in multimedia entity alignment and multi-step reasoning. Our analysis underscores the diversity, difficulty, and scale of the VTQA task compared to previous multimodal QA datasets. In conclusion, we anticipate that this dataset will serve as a pivotal resource for advancing and evaluating models proficient in multimedia entity alignment, multi-step reasoning, and open-ended answer generation. Our dataset and code is available at https://visual-text-qa.github.io/.
THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models
Prannay Kaul · Zhizhong Li · Hao Yang · Yonatan Dukler · Ashwin Swaminathan · CJ Taylor · Stefano Soatto
Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses, which we term “Type I hallucinations”. They focus on, if at all, hallucinations responding to very specific questions—yes-no or multiple-choice questions regarding a particular object or attribute—which we term “Type II hallucinations”, and they often require closed-source models which are subject to arbitrary change. Additionally, we observe a reduction in Type II hallucinations does not lead to a congruent reduction in Type I hallucations; rather, it often increases. We propose THRONE, a novel automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. We evaluate a large selection of recent LVLMs using public datasets. Our results show advances on existing metrics are disconnected from the reduction of Type I hallucinations, and established benchmarks for measuring Type I hallucination prevalence are incomplete. Finally, we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.
Data-Efficient Multimodal Fusion on a Single GPU
Noël Vouitsis · Zhaoyan Liu · Satya Krishna Gorti · Valentin Villecroze · Jesse C. Cresswell · Guangwei Yu · Gabriel Loaiza-Ganem · Maksims Volkovs
The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources, making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance -- and in certain cases outperform state-of-the art methods -- in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with $\sim 600\times$ fewer GPU days and $\sim 80\times$ fewer image-text pairs. Additionally, we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen · Kumar Ashutosh · Rohit Girdhar · David Harwath · Kristen Grauman
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when any one pair does not. We show our approach can successfully discover how subtle and long-tail human actions sound in egocentric video, outperforming an array of recent multimodal embedding techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal tasks.
Accept the Modality Gap: An Exploration in the Hyperbolic Space
Sameera Ramasinghe · Violetta Shevchenko · Gil Avraham · Thalaiyasingam Ajanthan
Recent advancements in machine learning have spotlighted the potential of hyperbolic spaces as they effectively learn hierarchical feature representations. While there has been progress in leveraging hyperbolic spaces in single-modality contexts, its exploration in multimodal settings remains under explored. Some recent efforts have sought to transpose Euclidean multimodal learning techniques to hyperbolic spaces, by adopting geodesic distance based contrastive losses. However, we show both theoretically and empirically that such spatial proximity based contrastive loss significantly disrupts hierarchies in the latent space. To remedy this, we advocate that the cross-modal representations should accept the inherent modality gap between text and images, and introduce a novel approach to measure cross-modal similarity that does not enforce spatial proximity. Our approach show remarkable capabilities in preserving unimodal hierarchies while aligning the two modalities. Our experiments on a series of downstream tasks demonstrate that better latent structure emerges with our objective function while being superior in text-to-image and image-to-text retrieval tasks.
DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction
Junwen Xiong · Peng Zhang · Tao You · Chuanyue Li · Wei Huang · Yufei Zha
Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel \textbf{Diff}usion architecture for generalized audio-visual \textbf{Sal}iency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features, an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3\% over the previous state-of-the-art results by six metrics.
DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning
Sikai Bai · Jie ZHANG · Song Guo · Shuaicheng Li · Jingcai Guo · Jun Hou · Tao Han · Xiaocheng Lu
Federated learning (FL) has emerged as a powerful paradigm for learning from decentralized data, and federated domain generalization further considers the test dataset (target domain) is absent from the decentralized training data (source domains). However, most existing FL methods assume that domain labels are provided during training, and their evaluation imposes explicit constraints on the number of domains, which must strictly match the number of clients. Because of the underutilization of numerous edge devices and additional cross-client domain annotations in the real world, such restrictions may be impractical and involve potential privacy leaks. In this paper, we propose an efficient and novel approach, called Disentangled Prompt Tuning (DiPrompT), a method that tackles the above restrictions by learning adaptive prompts for domain generalization in a distributed manner. Specifically, we first design two types of prompts, i.e., global prompt to capture general knowledge across all clients and domain prompts to capture domain-specific knowledge. They eliminate the restriction on the one-to-one mapping between source domains and local clients. Furthermore, a dynamic query metric is introduced to automatically search the suitable domain label for each sample, which includes two-substep text-image alignments based on prompt tuning without labor-intensive annotation. Extensive experiments on multiple datasets demonstrate that our DiPrompT achieves superior domain generalization performance over state-of-the-art FL methods when domain labels are not provided, and even outperforms many centralized learning methods using domain labels.
Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks Methods and Applications
Karren Yang · Anurag Ranjan · Jen-Hao Rick Chang · Raviteja Vemulapalli · Oncel Tuzel
We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly, the relationship between speech and facial motion is one-to-many, containing both inter-speaker and intra-speaker variations and necessitating a probabilistic approach. In this paper, we identify and address key challenges that have so far limited the development of probabilistic models: lack of datasets and metrics that are suitable for training and evaluating them, as well as the difficulty of designing a model that generates diverse results while remaining faithful to a strong conditioning signal as speech. We first propose large-scale benchmark datasets and metrics suitable for probabilistic modeling. Then, we demonstrate a probabilistic model that achieves both diversity and fidelity to speech, outperforming other methods across the proposed benchmarks. Finally, we showcase useful applications of probabilistic models trained on these large-scale datasets: we can generate diverse speech-driven 3D facial motion that matches unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of downstream audio-visual models.
DIEM: Decomposition-Integration Enhancing Multimodal Insights
Xinyi Jiang · Guoming Wang · Junhao Guo · Juncheng Li · Wenqiao Zhang · Rongxing Lu · Siliang Tang
In image question answering, due to the abundant and sometimes redundant information, precisely matching and integrating the information from both text and images is a challenge. In this paper, we propose the Decomposition-Integration Enhancing Multimodal Insight (DIEM) which initially decomposes the given question and image into multiple subquestions and several sub-images aiming to isolate specific elements for more focused analysis. We then integrate these sub-elements by matching each subquestion with its relevant sub-images, while also retaining the original image, to construct a comprehensive answer to the original question without losing sight of the overall context. This strategy mirrors the human cognitive process of simplifying complex problems into smaller components for individual analysis, followed by an integration of these insights. We implement DIEM on the LLaVA-v1.5 model, and evaluate its performance on ScienceQA and MM-Vet. Experimental results indicate that our method boosts accuracy in most question classes of the ScienceQA (+2.03% in average), especially in the image modality (+3.40%). On MM-Vet, our method achieves an improvement in MM-Vet scores, increasing from 31.1 to 32.4. These findings highlight DIEM's effectiveness in harmonizing the complexities of multimodal data, demonstrating its ability to enhance accuracy and depth in image question answering through its decomposition-integration process.
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun · Dohoon Kim · Taesup Moon
We consider a critical issue of false negatives in Vision- Language Pre-training (VLP), a challenge that arises from the inherent many-to-many correspondence of image-text pairs in large-scale web-crawled datasets. The presence of false negatives can impede achieving optimal performance and even lead to a significant performance drop. To address this challenge, we propose MAFA (MAnaging FAlse nega- tives), which consists of two pivotal components building upon the recently developed GRouped mIni-baTch sampling (GRIT) strategy: 1) an efficient connection mining process that identifies and converts false negatives into positives, and 2) label smoothing for the image-text contrastive (ITC) loss. Our comprehensive experiments verify the effectiveness of MAFA across multiple downstream tasks, emphasizing the crucial role of addressing false negatives in VLP, potentially even surpassing the importance of addressing false posi- tives. In addition, the compatibility of MAFA with the recent BLIP-family model is also demonstrated. Code is available at https://github.com/jaeseokbyun/MAFA.
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
Jeongsoo Choi · Se Jin Park · Minsu Kim · Yong Man Ro
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting.
Enhancing Multimodal Cooperation via Sample-level Modality Valuation
Yake Wei · Ruoxuan Feng · Zihe Wang · Di Hu
One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities. However, most models often suffer from unsatisfactory multimodal cooperation, which cannot jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality, but they are often hard to provide the fine-grained observation of multimodal cooperation at sample-level with theoretical support. Hence, it is essential to reasonably observe and improve the fine-grained cooperation between modalities, especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end, we introduce a sample-level modality valuation metric to evaluate the contribution of each modality for each sample. Via modality valuation, we observe that modality discrepancy indeed could be different at sample-level, beyond the global contribution discrepancy at dataset-level. We further analyze this issue and improve cooperation between modalities at sample-level by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall, our methods reasonably observe the fine-grained uni-modal contribution and achieve considerable improvement. The source code and dataset are available at \url{https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation}.
Diff-BGM: A Diffusion Model for Video Background Music Generation
Sizhe Li · Yiming Qin · Minghang Zheng · Xin Jin · Yang Liu
When editing a video, a piece of attractive background music is indispensable. Furthermore, the video background music generation tasks face several challenges, for example, the lack of suitable training datasets, and the difficulties in flexibly controlling the music generation process and sequentially aligning the video and music. In this work, we first propose a high-quality music-video dataset BGM909 with detailed semantics annotation and shot detection to provide multi-modal information about the video and music. We then present novel evaluation metrics that go beyond assessing music quality; we propose a metric for evaluating diversity and the alignment between music and video by incorporating retrieval precision metrics. Finally, we propose a framework named Diff-BGM to automatically generate the background music for a given video, which uses different signals to control different aspects of the music during the generation process, i.e., uses dynamic video features to control music rhythm and semantic features to control the melody and atmosphere. We propose to align the video and music sequentially by proposing a segment-aware cross-attention layer to enhance the temporal consistency between video and music. Experiments verify the effectiveness of our proposed method.
SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training
WU Sitong · Haoru Tan · Zhuotao Tian · Yukang Chen · Xiaojuan Qi · Jiaya Jia
Vision-language pre-training (VLP) aims to learn joint representations of vision and language modalities. The contrastive paradigm is currently dominant in this field. However, we observe a notable misalignment phenomenon, that is, the affinity between samples has an obvious disparity across different modalities, namely ''Affinity Inconsistency Problem''. Our intuition is that, for a well-aligned model, two images that look similar to each other should have the same level of similarity as their corresponding texts that describe them. In this paper, we first investigate the reason of this inconsistency problem. We discover that the lack of consideration for sample-wise affinity consistency across modalities in existing training objectives is the central cause. To address this problem, we propose a novel loss function, named Sample-wise affinity Consistency (SaCo) loss, which is designed to enhance such consistency by minimizing the distance between image embedding similarity and text embedding similarity for any two samples. Our SaCo loss can be easily incorporated into existing vision-language models as an additional loss due to its complementarity for most training objectives. In addition, considering that pre-training from scratch is computationally expensive, we also provide a more efficient way to continuously pre-train on a converged model by integrating our loss. Experimentally, the model trained with our SaCo loss significantly outperforms the baseline on a variety of vision and language tasks.
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
Haokun Lin · Haoli Bai · Zhili Liu · Lu Hou · Muyi Sun · Linqi Song · Ying Wei · Zhenan Sun
Vision-language pre-trained models have achieved impressive performance on various downstream tasks.However, their large model sizes hinder their utilization on platforms with limited computational resources.We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance.Recent efforts for VLP compression either adopt uni-modal compression metrics resulting in limited performance or involve costly mask-search processes with learnable masks.In this paper, we first propose the Module-wise Pruning Error (MoPE) metric, accurately assessing CLIP module importance by performance decline on cross-modal tasks.Using the MoPE metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages. For pre-training, MoPE-CLIP effectively leverages knowledge from the teacher model, significantly reducing pre-training costs while maintaining strong zero-shot capabilities.For fine-tuning, consecutive pruning from width to depth yields highly competitive task-specific models.Extensive experiments in two stages demonstrate the effectiveness of the MoPE metric, and MoPE-CLIP outperforms previous state-of-the-art VLP compression methods.
Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning
Zihua Zhao · Mengxi Chen · Tianjie Dai · Jiangchao Yao · Bo Han · Ya Zhang · Yanfeng Wang
Noisy correspondence that refers to mismatches in cross-modal data pairs, are prevalent on human-annotated or web-crawled datasets. Prior approaches to leverage such data mainly consider the application of uni-modal noisy label learning without amending the impact on both cross-modal and intra-modal geometrical structures in multimodal learning. Actually, we find that both structures are effective to discriminate noisy correspondence through structural differences when being well-established. Inspired by this observation, we introduce a Geometrical Structure Consistency (GSC) method to infer the true correspondence. Specifically, GSC ensures the preservation of geometrical structures within and between modalities, allowing for the accurate discrimination of noisy samples based on structural differences. Utilizing these inferred true correspondence labels, GSC refines the learning of geometrical structures by filtering out the noisy samples. Our experiments across three well-known cross-modal datasets confirm that GSC effectively identifies noisy samples under various conditions of noisy correspondence, and significantly outperforms the current leading methods.
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
Lewei Yao · Renjie Pi · Jianhua Han · Xiaodan Liang · Hang Xu · Wei Zhang · Zhenguo Li · Dan Xu
Existing open-vocabulary object detectors typically require a predefined set of categories from users, significantly confining their application scenarios. In this paper, we introduce DetCLIPv3, a high-performing detector that excels not only at both open-vocabulary object detection, but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs:1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs, DetCLIPv3 demonstrates superior open-vocabulary detection performance, e.g., our Swin-T backbone model achieves a notable 47.0 zero-shot AP on the LVIS benchmark, outperforming GLIPv2, DetCLIPV2, and GroundingDINO by 6.6/18.0/19.6 AP, respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset, showcasing its strong generative capability.
Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification
Chao Yi · Lu Ren · De-Chuan Zhan · Han-Jia Ye
CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP’s image encoder for tasks like few-shot classification, introducing a misalignment between itspre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation, adversely affecting CLIP’s effectiveness in target tasks. In this paper, we view text features as precise neighbors of image features in CLIP’s space and present a novel CrOss-moDal nEighbor Representation (CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP’s pre-training objectives, thereby fully leveraging CLIP’s robust cross-modal capabilities. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. We introduce the Auto Text Generator (ATG) to automatically produce the required text in a data-free and training-free manner. We apply CODER to CLIP’s zero-shot and few-shot image classification tasks. Experiment results across various datasets and models confirm CODER’s effectiveness. Code is available at: https://github.com/YCaigogogo/CVPR24-CODER.
OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning
Siddharth Srivastava · Gaurav Sharma
We present a novel multimodal multitask network and associated training algorithm.The method is capable of ingesting data from approximately 12 different modalitiesnamely image, video, audio, text, depth, point cloud, time series, tabular, graph, X-ray, infrared, IMU, and hyperspectral.The proposed approach utilizes modality specialized tokenizers, a shared transformer architecture, and cross-attention mechanisms to project the data from different modalities into a unified embedding space. It addresses multimodal and multitask scenarios by incorporating modality-specific task heads for different tasks in respective modalities. We propose a novel pretraining strategy with iterative modality switching to initialize the network, and a training algorithm which trades off fully joint training over all modalities, with training on pairs of modalities at a time. We provide comprehensive evaluation across 25 datasets from 12 modalities and show state of the art performances, demonstrating the effectiveness of the proposed architecture, pretraining strategy and adapted multitask training.
CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation
Zineng Tang · Ziyi Yang · MAHMOUD KHADEMI · Yang Liu · Chenguang Zhu · Mohit Bansal
We present CoDi-2, a Multimodal Large Language Model (MLLM) for learning in-context interleaved multi-modal representations. By aligning modalities with language for both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to understand modality- interleaved instructions and in-context examples and autoregressively generate grounded and coherent multimodal outputs in an any-to-any input-output modality paradigm. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot and few-shot capabilities for tasks like editing, exemplar learning, composition, reasoning, etc. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing and showcases a significant advancement for integrating diverse multimodal tasks with sequential generation.
Differentiable Information Bottleneck for Deterministic Multi-view Clustering
Xiaoqiang Yan · Zhixiang Jin · Fengshou Han · Yangdong Ye
In recent several years, the information bottleneck (IB) principle provides an information-theoretic framework for deep multi-view clustering (MVC) by compressing multi-view observations while preserving the relevant information of multiple views. Although existing IB-based deep MVC methods have achieved huge success, they rely on variational approximation and distribution assumption to estimate the lower bound of mutual information, which is a notoriously hard and impractical problem in high-dimensional multi-view spaces. In this work, we propose a new differentiable information bottleneck (DIB) method, which provides a deterministic and analytical MVC solution by fitting the mutual information without the necessity of variational approximation. Specifically, we first propose to directly fit the mutual information of high-dimensional spaces by leveraging normalized kernel Gram matrix, which does not require any auxiliary neural estimator to estimate the lower bound of mutual information. Then, based on the new mutual information measurement, a deterministic multi-view neural network with analytical gradients is explicitly trained to parameterize IB principle, which derives a deterministic compression of input variables from different views. Finally, a triplet consistency discovery mechanism is devised, which is capable of mining the feature consistency, cluster consistency and joint consistency based on the deterministic and compact representations. Extensive experimental results show the superiority of our DIB method on 6 benchmarks compared with 13 state-of-the-art baselines.
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
Yusheng Dai · HangChen · Jun Du · Ruoyu Wang · shihao chen · Haotian Wang · Chin-Hui Lee
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames, performing even worse than single-modality models. While applying the common dropout techniques to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this study, we delve into this contrasting phenomenon through the lens of modality bias and uncover that an excessive modality bias towards the audio modality induced by dropout constitutes the fundamental cause. Next, we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between the modality bias and the robustness against missing modality in multimodal systems. Building on these findings, we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality, maintaining performance and robustness simultaneously. Finally, to address an entirely missing modality, we adopt adapters to dynamically switch decision strategies. The effectiveness of our proposed approach is evaluated through comprehensive experiments on the MISP2021 and MISP2022 datasets. Our code is available at https://github.com/dalision/ModalBiasAVSR.
Multimodal Representation Learning by Alternating Unimodal Adaptation
Xiaohui Zhang · Jaehong Yoon · Mohit Bansal · Huaxiu Yao
Multimodal learning, which integrates data from diverse sensory modes, plays a pivotal role in artificial intelligence. However, existing multimodal learning methods often struggle with challenges where some modalities appear more dominant than others during multimodal learning, resulting in suboptimal performance. To address this challenge, we propose MLA (Multimodal Learning with Alternating Unimodal Adaptation). MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process, thereby minimizing interference between modalities. Simultaneously, it captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. This optimization process is controlled by a gradient modification mechanism to prevent the shared head from losing previously acquired information. During the inference phase, MLA utilizes a test-time uncertainty-based model fusion mechanism to integrate multimodal information. Extensive experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities. These experiments demonstrate the superiority of MLA over competing prior approaches.
View-Category Interactive Sharing Transformer for Incomplete Multi-View Multi-Label Learning
Shilong Ou · Zhe Xue · Yawen Li · Meiyu Liang · Yuanqiang Cai · junjiang wu
As a problem often encountered in real-world scenarios, multi-view multi-label learning has attracted considerable research attention. However, due to oversights in data collection and uncertainties in manual annotation, real-world data often suffer from incompleteness. Regrettably, most existing multi-view multi-label learning methods sidestep missing views and labels. Furthermore, they often neglect the potential of harnessing complementary information between views and labels, thus constraining their classification capabilities. To address these challenges, we propose a view-category interactive sharing transformer tailored for incomplete multi-view multi-label learning. Within this network, we incorporate a two-layer transformer module to characterize the interplay between views and labels. Additionally, to address view incompleteness, a KNN-style missing view generation module is employed. Finally, we introduce a view-category consistency guided embedding enhancement module to align different views and improve the discriminating power of the embeddings. Collectively, these modules synergistically integrate to classify the incomplete multi-view multi-label data effectively. Extensive experiments substantiate that our approach outperforms the existing state-of-the-art methods.
Scalable 3D Registration via Truncated Entry-wise Absolute Residuals
Tianyu Huang · Liangzu Peng · Rene Vidal · Yun-Hui Liu
Given an input set of $3$D point pairs, the goal of outlier-robust $3$D registration is to compute some rotation and translation that align as many point pairs as possible. This is an important problem in computer vision, for which many highly accurate approaches have been recently proposed. Despite their impressive performance, these approaches lack scalability, often overflowing the $16$GB of memory of a standard laptop to handle roughly $30,000$ point pairs. In this paper, we propose a $3$D registration approach that can process more than ten million ($10^7$) point pairs with over $99$\% random outliers. Moreover, our method is efficient, entails low memory costs, and maintains high accuracy at the same time. We call our method TEAR, as it involves minimizing an outlier-robust loss that computes Truncated Entry-wise Absolute Residuals. To minimize this loss, we decompose the original $6$-dimensional problem into two subproblems of dimensions $3$ and $2$, respectively, solved in succession to global optimality via a customized branch-and-bound method. While branch-and-bound is often slow and unscalable, this does not apply to TEAR as we propose novel bounding functions that are tight and computationally efficient. Experiments on various datasets are conducted to validate the scalability and efficiency of our method.
Partial-to-Partial Shape Matching with Geometric Consistency
Viktoria Ehm · Maolin Gao · Paul Roetzer · Marvin Eisenberger · Daniel Cremers · Florian Bernard
Finding correspondences between 3D shapes is an important and long-standing problem in computer vision, graphics and beyond. A prominent challenge are partial-to-partial shape matching settings, which occur when the shapes to match are only observed incompletely (e.g. from 3D scanning). Although partial-to-partial matching is a highly relevant setting in practice, it is rarely explored. Our work bridges the gap between existing (rather artificial) 3D full shape matching and partial-to-partial real-world settings by exploiting geometric consistency as a strong constraint. We demonstrate that it is indeed possible to solve this challenging problem in a variety of settings. For the first time, we achieve geometric consistency for partial-to-partial matching, which is realized by a novel integer non-linear program formalism building on triangle product spaces, along with a new pruning algorithm based on linear integer programming. Further, we generate a new inter-class dataset for partial-to-partial shape-matching. We show that our method outperforms current SOTA methods on both an established intra-class dataset and our novel inter-class dataset.
Towards Robust Learning to Optimize with Theoretical Guarantees
Qingyu Song · Wei Lin · Juncheng Wang · Hong Xu
Learning to optimize (L2O) is an emerging technique to solve mathematical optimization problems with learning-based methods. Although with great success in many real-world scenarios such as wireless communications, computer networks, and electronic design, existing L2O works lack theoretical demonstration of their performance and robustness in out-of-distribution (OOD) scenarios. We address this gap by providing comprehensive proofs. First, we prove a sufficient condition for a robust L2O model with homogeneous convergence rates over all In-Distribution (InD) instances. We assume an L2O model achieves robustness for an InD scenario. Based on our proposed methodology of aligning OOD problems to InD problems, we also demonstrate that the L2O model's convergence rate in OOD scenarios will deteriorate by an equation of the L2O model's input features. Moreover, we propose an L2O model with a concise gradient-only feature construction and a novel gradient-based history modeling method. Numerical simulation demonstrates that our proposed model outperforms the state-of-the-art baseline in both InD and OOD scenarios and achieves up to 10 $\times$ convergence speedup. The code of our method can be found from https://github.com/NetX-lab/GoMathL2O-Official.
From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers
Swaminathan Gurumurthy · Karnik Ram · Bingqing Chen · Zachary Manchester · Zico Kolter
Various pose estimation and tracking problems in robotics can be decomposed into a correspondence estimation problem (often computed using a deep network) followed by a weighted least squares optimization problem to solve for the poses. Recent work has shown that coupling the two problems by iteratively refining one conditioned on the other's output yields SOTA results across domains. However, training these models has proved challenging, requiring a litany of tricks to stabilize and speed up training. In this work, we take the visual odometry problem as an example and identify three plausible causes: (1) flow loss interference, (2) linearization errors in the bundle adjustment (BA) layer, and (3) dependence of weight gradients on the BA residual. We show how these issues result in noisy and higher variance gradients, potentially leading to a slow down in training and instabilities. We then propose a simple solution to reduce the gradient variance by using the weights predicted by the network in the inner optimization loop to also weight the correspondence objective in the training problem. This helps the training objective 'focus' on the more important points, thereby reducing the variance and mitigating the influence of outliers. We show that the resulting method leads to faster training and can be more flexibly trained in varying training setups without sacrificing performance. In particular we show 2-2.5x training speedups over a baseline visual odometry model we modify.
DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models
Nastaran Saadati · Minh Pham · Nasla Saleem · Joshua R. Waite · Aditya Balu · Zhanhong Jiang · Chinmay Hegde · Soumik Sarkar
Recent advances in decentralized deep learning algorithms have demonstrated cutting-edge performance on various tasks with large pre-trained models. However, a pivotal prerequisite for achieving this level of competitiveness is the significant communication and computation overheads when updating these models, which prohibits the applications of them to real-world scenarios.To address this issue, drawing inspiration from advanced model merging techniques without requiring additional training, we introduce the Decentralized Iterative Merging-And-Training (DIMAT) paradigm—a novel decentralized deep learning framework. Within DIMAT, each agent is trained on their local data and periodically merged with their neighboring agents using advanced model merging techniques like activation matching until convergence is achieved. DIMAT provably converges with the best available rate for nonconvex functions with various first-order methods, while yielding tighter error bounds compared to the popular existing approaches. We conduct a comprehensive empirical analysis to validate DIMAT's superiority over baselines across diverse computer vision tasks sourced from multiple datasets. Empirical results validate our theoretical claims by showing that DIMAT attains faster and higher initial gain in accuracy with independent and identically distributed (IID) and non-IID data, incurring lower communication overhead. This DIMAT paradigm presents a new opportunity for the future decentralized learning, enhancing its adaptability to real-world with sparse and light-weight communication and computation.
Ink Dot-Oriented Differentiable Optimization for Neural Image Halftoning
Hao Jiang · Bingfeng Zhou · Yadong Mu
Halftoning is a time-honored printing technique that simulates continuous tones using ink dots (halftone dots). The resurgence of deep learning has catalyzed the emergence of innovative technologies in the printing industry, fostering the advancement of data-driven halftoning methods. Nevertheless, current deep learning-based approaches produce halftones through image-to-image black box transformations, lacking direct control over the movement of individual halftone dots. In this paper, we propose an innovative halftoning method termed ``neural dot-controllable halftoning". This method allows dot-level image dithering by providing direct control over the motion of each ink dot. We conceptualize halftoning as the process of sprinkling dots on a canvas. Initially, a specific quantity of dots are randomly dispersed on the canvas and subsequently adjusted based on the surrounding grayscale and gradient. To establish differentiable transformations between discrete ink dot positions and halftone matrices, we devise a lightweight dot encoding network to spread dense gradients to sparse dots. Dot control offers several advantages to our approach, including the capability to regulate the quantity of halftone dots and enhance specific areas with artifacts in the generated halftones by adjusting the placement of the dots. Our proposed method exhibits superior performance than previous approaches in extensive quantitative and qualitative experiments.
Are Conventional SNNs Really Efficient? A Perspective from Network Quantization
Guobin Shen · Dongcheng Zhao · Tenglong Li · Jindong Li · Yi Zeng
Spiking Neural Networks (SNNs) have been widely praised for their high energy efficiency and immense potential. However, comprehensive research that critically contrasts and correlates SNNs with quantized Artificial Neural Networks (ANNs) remains scant, often leading to skewed comparisons lacking fairness towards ANNs. This paper introduces a unified perspective, illustrating that the simulation steps in SNNs and quantized bit-widths of activation values present analogous representations. Building on this, we present a more pragmatic and rational approach to estimating the energy consumption of SNNs. Diverging from the conventional Synaptic Operations (SynOps), we champion the "Bit Budget" concept. This notion permits an intricate discourse on strategically allocating computational and storage resources between weights, activation values, and temporal steps under stringent hardware constraints. Guided by the Bit Budget paradigm, we discern that pivoting efforts towards spike patterns and weight quantization, rather than temporal attributes, elicits profound implications for model performance. Utilizing the Bit Budget for holistic design consideration of SNNs elevates model performance across diverse data types, encompassing static imagery and neuromorphic datasets. Our revelations bridge the theoretical chasm between SNNs and quantized ANNs and illuminate a pragmatic trajectory for future endeavors in energy-efficient neural computations.
FedMef: Towards Memory-efficient Federated Dynamic Pruning
Hong Huang · Weiming Zhuang · Chen Chen · Lingjuan Lyu
Federated learning (FL) promotes decentralized training while prioritizing data confidentiality. However, its application on resource-constrained devices is challenging due to the high demand for computation and memory resources to train deep learning models. Neural network pruning techniques, such as dynamic pruning, could enhance model efficiency, but directly adopting them in FL still poses substantial challenges, including post-pruning performance degradation, high activation memory, etc. To address these challenges, we propose FedMef, a novel and memory-efficient federated dynamic pruning framework. FedMef comprises two key components. First, we introduce the budget-aware extrusion that maintains pruning efficiency while preserving post-pruning performance by salvaging crucial information from parameters marked for pruning within a given budget. Second, we propose scaled activation pruning to effectively reduce activation memory, which is particularly beneficial for deploying FL to memory-limited devices. Extensive experimentsdemonstrate the effectiveness of our proposed FedMef. In particular, it achieves a significant reduction of 28.5\% in memory footprint compared to state-of-the-art methods while obtaining superior accuracy.
SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching
Xinghui Li · Jingyi Lu · Kai Han · Victor Adrian Prisacariu
In this paper, we address the challenge of matching semantically similar keypoints across image pairs. Existing research indicates that the intermediate output of the UNet within the Stable Diffusion (SD) can serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique, the inherent potential of Stable Diffusion can be harnessed, resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs, leading to a further improvement in performance. We designate our approach as SD4Match, short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets new benchmarks in accuracy across all these datasets. Particularly, SD4Match outperforms the previous state-of-the-art by a margin of 12 percentage points on the challenging SPair-71k dataset.
Purified and Unified Steganographic Network
GuoBiao Li · Sheng Li · Zicong Luo · Zhenxing Qian · Xinpeng Zhang
Steganography is the art of hiding secret data into the cover media for covert communication. In recent years, more and more deep neural network (DNN)-based steganographic schemes are proposed to train steganographic networks for secret embedding and recovery, which are shown to be promising. Compared with the handcrafted steganographic tools, steganographic networks tend to be large in size. It raises concerns on how to imperceptibly and effectively transmit these networks to the sender and receiver to facilitate the covert communication. To address this issue, we propose in this paper a Purified and Unified Steganographic Network (PUSNet). It performs an ordinary machine learning task in a purified network, which could be triggered into steganographic networks for secret embedding or recovery using different keys. We formulate the construction of the PUSNet into a sparse weight filling problem to flexibly switch between the purified and steganographic networks. We further instantiate our PUSNet as an image denoising network with two steganographic networks concealed for secret image embedding and recovery. Comprehensive experiments demonstrate that our PUSNet achieves good performance on secret image embedding, secret image recovery, and image denoising in a single architecture. It is also shown to be capable of imperceptibly carrying the steganographic networks in a purified network.
Learned Lossless Image Compression based on Bit Plane Slicing
Zhe Zhang · Huairui Wang · Zhenzhong Chen · Shan Liu
Autoregressive Initial Bits (ArIB), a framework that combines subimage autoregression and latent variable models, has shown its advantages in lossless image compression. However, in current methods, the image splitting makes the information of latent variables being uniformly distributed in each subimage, and causes inadequate use of latent variables in addition to posterior collapse. To tackle these issues, we introduce Bit Plane Slicing (BPS), splitting images in the bit plane dimension with the considerations on different importance for latent variables. Thus, BPS provides a more effective representation by arranging subimages with decreasing importance for latent variables. To solve the problem of the increased number of dimensions caused by BPS, we further propose a dimension-tailored autoregressive model that tailors autoregression methods for each dimension based on their characteristics, efficiently capturing the dependencies in plane, space, and color dimensions. As shown in the extensive experimental results, our method demonstrates the superior compression performance with comparable inference speed, when compared to the state-of-the-art normalizing-flow-based methods. The code is at https://github.com/ZZ022/ArIB-BPS.
The problem of calibrating deep neural networks (DNNs) for multi-label learning is considered. It is well-known that DNNs trained by cross-entropy for single-label, or one-hot, classification are poorly calibrated. Many calibration techniques have been proposed to address the problem. However, little attention has been paid to the calibration of multi-label DNNs. In this literature, the focus has been on improving labeling accuracy in the face of severe dataset unbalance. This is addressed by the introduction of asymmetric losses, which have became very popular. However, these losses do not induce well calibrated classifiers. In this work, we first provide a theoretical explanation for this poor calibration performance, by showing that these loses losses lack the strictly proper property, a necessary condition for accurate probability estimation. To overcome this problem, we propose a new Strictly Proper Asymmetric (SPA) loss. This is complemented by a Label Pair Regularizer (LPR) that increases the number of calibration constraints introduced per training example. The effectiveness of both contributions is validated by extensive experiments on various multi-label datasets. The resulting training method is shown to significantly decrease the calibration error while maintaining state-of-the-art accuracy.
Improving Generalization via Meta-Learning on Hard Samples
Nishant Jain · Arun Suggala · Pradeep Shenoy
Learned reweighting (LRW) approaches to supervised learning use an optimization criterion to assign weights for training instances, in order to maximize performance on a representative validation dataset. We pose and formalize the problem of optimized selection of the validation set used in LRW training, to improve classifier generalization. In particular, we show that using hard-to-classify instances in the validation set has both a theoretical connection to, and strong empirical evidence of generalization. We provide an efficient algorithm for training this meta-optimized model, as well as a simple train-twice heuristic for careful comparative study. We demonstrate that LRW with easy validation data performs consistently worse than LRW with hard validation data, establishing the validity of our meta-optimization problem. Our proposed algorithm outperforms a wide range of baselines on a range of datasets and domain shift challenges (Imagenet-1K, CIFAR-100, Clothing-1M, CAMELYON, WILDS, etc.), with ~1\% gains using VIT-B on Imagenet. We also show that using naturally hard examples for validation (Imagenet-R / Imagenet-A) in LRW training for Imagenet improves performance on both clean and naturally hard test instances by 1-2\%. Secondary analyses show that using hard validation data in an LRW framework improves margins on test data, hinting at the mechanism underlying our empirical gains. We believe this work opens up new research directions for the meta-optimization of meta-learning in a supervised learning context.
Learning with Structural Labels for Learning with Noisy Labels
Noo-ri Kim · Jin-Seop Lee · Jee-Hyong Lee
Deep Neural Networks (DNNs) have demonstrated remarkable performance across diverse domains and tasks with large-scale datasets. To reduce labeling costs, semi-automated and crowdsourcing labeling methods are developed, but their labels are inevitably noisy. Learning with Noisy Labels (LNL) approaches aim to train DNNs despite the presence of noisy labels. These approaches leverage the memorization effect to acquire more accurate labels through a process of relabeling and selection, subsequently using these refined labels for next training. However, these methods encounter a significant decrease in the model's generalization performance due to the inevitably existing noise labels. To overcome this limitation, we propose a new approach to enhance learning with noisy labels by incorporating additional distribution information—structural labels. In order to leverage additional distribution information for generalization, we utilize a reverse k-NN, which helps the model achieve a simpler feature manifold and avoid overfitting to noisy labels. The proposed method shows outperformed performance in multiple benchmark datasets with synthetic and real-world datasets.
DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models
Khawar Islam · Muhammad Zaigham Zaheer · Arif Mahmood · Karthik Nandakumar
Recently, a number of image-mixing-based augmentation techniques have been introduced to improve the generalization of deep neural networks. In these techniques, two or more randomly selected natural images are mixed together to generate an augmented image. Such methods may not only omit important portions of the input images but also introduce label ambiguities by mixing images across labels resulting in misleading supervisory signals. To address these limitations, we propose DiffuseMix, a novel data augmentation technique that leverages a diffusion model to reshape training images, supervised by our bespoke conditional prompts. First, concatenation of a partial natural image and its generated counterpart is obtained which helps in avoiding the generation of unrealistic images or label ambiguities. Then, to avoid over-fitting on generated images, a randomly selected pattern from a set of fractal images is blended into the concatenated image to form the final augmented image for training. Our empirical results on seven different datasets reveal that DiffuseMix achieves superior performance compared to existing state-of-the-art methods on tasks including general classification, fine-grained classification, fine-tuning, data scarcity, and adversarial robustness.
Improving Out-of-Distribution Generalization in Graphs via Hierarchical Semantic Environments
Yinhua Piao · Sangseon Lee · Yijingxiu Lu · Sun Kim
Out-of-distribution (OOD) generalization in the graph domain is challenging due to complex distribution shifts and a lack of environmental contexts. Recent methods attempt to enhance graph OOD generalization by generating flat environments. However, such flat environments come with inherent limitations to capture more complex data distributions. Considering the DrugOOD dataset, which contains diverse training environments (e.g., scaffold, size, etc.), flat contexts cannot sufficiently address its high heterogeneity. Thus, a new challenge is posed to generate more semantically enriched environments to enhance graph invariant learning for handling distribution shifts. In this paper, we propose a novel approach to generate hierarchical semantic environments for each graph. Firstly, given an input graph, we explicitly extract variant subgraphs from the input graph to generate proxy predictions on local environments. Then, stochastic attention mechanisms are employed to re-extract the subgraphs for regenerating global environments in a hierarchical manner. In addition, we introduce a new learning objective that guides our model to learn the diversity of environments within the same hierarchy while maintaining consistency across different hierarchies. This approach enables our model to consider the relationships between environments and facilitates robust graph invariant learning. Extensive experiments on real-world graph data have demonstrated the effectiveness of our framework. Particularly, in the challenging dataset DrugOOD, our method achieves up to 1.29% and 2.83% improvement over the best baselines on IC50 and EC50 prediction tasks, respectively.
Patch2Self2: Self-supervised Denoising on Coresets via Matrix Sketching
Shreyas Fadnavis · Agniva Chowdhury · Joshua Batson · Petros Drineas · Eleftherios Garyfallidis
Diffusion MRI (dMRI) non-invasively maps brain white matter, yet necessitates denoising due to low signal-to-noise ratios. Patch2Self (P2S), employing self-supervised techniques and regression on a Casorati matrix, effectively denoises dMRI images and has become the new de-facto standard in this field. P2S however is resource intensive, both in terms of running time and memory usage, as it uses all voxels ($n$) from all-but-one held-in volumes ($d-1$) to learn a linear mapping $\Phi : \mathbb{R}^{n \times (d-1)} \mapsto \mathbb{R}^{n}$ for denoising the held-out volume. The increasing size and dimensionality of higher resolution dMRI acquisitions can make P2S infeasible for large-scale analyses. This work exploits the redundancy imposed by P2S to alleviate its performance issues and inspect regions that influence the noise disproportionately. Specifically, this study makes a three-fold contribution: (1) We present Patch2Self2 (P2S2), a method that uses matrix sketching to perform self-supervised denoising. By solving a sub-problem on a smaller sub-space, so called, coreset, we show how P2S2 can yield a significant speedup in training time while using less memory. (2) We present a theoretical analysis of P2S2, focusing on determining the optimal sketch size through rank estimation, a key step in achieving a balance between denoising accuracy and computational efficiency. (3) We show how the so-called statistical leverage scores can be used to interpret the denoising of dMRI data, a process that was traditionally treated as a black-box. Experimental results on both simulated and real data affirm that P2S2 maintains denoising quality while significantly enhancing speed and memory efficiency, achieved by training on a reduced data subset.
G-FARS: Gradient-Field-based Auto-Regressive Sampling for 3D Part Grouping
Junfeng Cheng · Tania Stathaki
This paper proposes a novel task named "3D part grouping". Suppose there is a mixed set containing scattered parts from various shapes. This task requires algorithms to find out every possible combination among all the parts. To address this challenge, we propose the so called Gradient Field-based Auto-Regressive Sampling framework (G-FARS) tailored specifically for the 3D part grouping task. In our framework, we design a gradient-field-based selection graph neural network (GNN) to learn the gradients of a log conditional probability density in terms of part selection, where the condition is the given mixed part set. This innovative approach, implemented through the gradient-field-based selection GNN, effectively captures complex relationships among all the parts in the input. Upon completion of the training process, our framework becomes capable of autonomously grouping 3D parts by iteratively selecting them from the mixed part set, leveraging the knowledge acquired by the trained gradient-field-based selection GNN. Our code is available at: https://github.com/J-F-Cheng/G-FARS-3DPartGrouping.
Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation
Fahimeh Hosseini Noohdani · Parsa Hosseini · Aryan Yazdan Parast · Hamidreza Araghi · Mahdieh Baghshah
While standard Empirical Risk Minimization (ERM) training is proven effective for image classification on in-distribution data, it fails to perform well on out-of-distribution samples. One of the main sources of distribution shift for image classification is the compositional nature of images. Specifically, in addition to the main object or component(s) determining the label, some other image components usually exist, which may lead to the shift of input distribution between train and test environments. More importantly, these components may have spurious correlations with the label. To address this issue, we propose Decompose-and-Compose (DaC), which improves robustness to correlation shift by a compositional approach based on combining elements of images. Based on our observations, models trained with ERM usually highly attend to either the causal components or the components having a high spurious correlation with the label (especially in datapoints on which models have a high confidence). In fact, according to the amount of spurious correlation and the easiness of classification based on the causal or non-causal components, the model usually attends to one of these more (on samples with high confidence). Following this, we first try to identify the causal components of images using class activation maps of models trained with ERM. Afterward, we intervene on images by combining them and retraining the model on the augmented data, including the counterfactual ones. This work proposes a group-balancing method by intervening on images without requiring group labels or information regarding the spurious features during training. The method has an overall better worst group accuracy compared to previous methods with the same amount of supervision on the group labels in correlation shift. Our code is available at https://github.com/fhn98/DaC.
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
Xin Guo · Jiangwei Lao · Bo Dang · Yingying Zhang · Lei Yu · Lixiang Ru · Liheng Zhong · Ziyuan Huang · Kang Wu · Dingxiang Hu · HUIMEI HE · Jian Wang · Jingdong Chen · Ming Yang · Yongjun Zhang · Yansheng Li
Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications.
Building Bridges across Spatial and Temporal Resolutions: Reference-Based Super-Resolution via Change Priors and Conditional Diffusion Model
Runmin Dong · Shuai Yuan · Bin Luo · Mengxuan Chen · Jinxiao Zhang · Lixian Zhang · Weijia Li · Juepeng Zheng · Haohuan Fu
Reference-based super-resolution (RefSR) has the potential to build bridges across spatial and temporal resolutions of remote sensing images. However, existing RefSR methods are limited by the faithfulness of content reconstruction and the effectiveness of texture transfer in large scaling factors. Conditional diffusion models have opened up new opportunities for generating realistic high-resolution images, but effectively utilizing reference images within these models remains an area for further exploration. Furthermore, content fidelity is difficult to guarantee in areas without relevant reference information. To solve these issues, we propose a change-aware diffusion model named Ref-Diff for RefSR, using the land cover change priors to guide the denoising process explicitly. Specifically, we inject the priors into the denoising model to improve the utilization of reference information in unchanged areas and regulate the reconstruction of semantically relevant content in changed areas. With this powerful guidance, we decouple the semantics-guided denoising and reference texture-guided denoising processes to improve the model performance. Extensive experiments demonstrate the superior effectiveness and robustness of the proposed method compared with state-of-the-art RefSR methods in both quantitative and qualitative evaluations.
SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation
Aysim Toker · Marvin Eisenberger · Daniel Cremers · Laura Leal-Taixe
In recent years, semantic segmentation has become a pivotal tool in processing and interpreting satellite imagery. Yet, a prevalent limitation of supervised learning techniques remains the need for extensive manual annotations by experts. In this work, we explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. The main idea is to learn the joint data manifold of images and labels, leveraging recent advancements in denoising diffusion probabilistic models. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation. We find that the obtained pairs not only display high quality in fine-scale features but also ensure a wide sampling diversity. Both aspects are crucial for earth observation data, where semantic classes can vary severely in scale and occurrence frequency. We employ the novel data instances for downstream segmentation, as a form of data augmentation. In our experiments, we provide comparisons to prior works based on discriminative diffusion models or GANs. We demonstrate that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation -- both compared to baselines and when training only on the original data.
S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data
Xuyang Li · Danfeng Hong · Jocelyn Chanussot
In the expansive domain of computer vision, a myriad of pre-trained models are at our disposal. However, most of these models are designed for natural RGB images and prove inadequate for spectral remote sensing (RS) images. Spectral RS images have two main traits: (1) multiple bands capturing diverse feature information, (2) spatial alignment and consistent spectral sequencing within the spatial-spectral dimension. In this paper, we introduce Spatial-SpectralMAE (S2MAE), a specialized pre-trained architecture for spectral RS imagery. S2MAE employs a 3D transformer for masked autoencoder modeling, integrating learnable spectral-spatial embeddings with a 90% masking ratio. The model efficiently captures local spectral consistency and spatial invariance using compact cube tokens, demonstrating versatility to diverse input characteristics. This adaptability facilitates progressive pretraining on extensive spectral datasets. The effectiveness of S2MAE is validated through continuous pretraining on two sizable datasets, totaling over a million training images. The pre-trained model is subsequently applied to three distinct downstream tasks, with in-depth ablation studies conducted to emphasize its efficacy.
Poly Kernel Inception Network for Remote Sensing Detection
Xinhao Cai · Qiuxia Lai · Yuwei Wang · Wenguan Wang · Zeren Sun · Yazhou Yao
Object detection in remote sensing images (RSIs) often suffers from several increasing challenges, including the large variation in object scales and the diverse-ranging context. Prior methods tried to address these challenges by expanding the spatial receptive field of the backbone, either through large-kernel convolution or dilated convolution. However, the former typically introduces considerable background noise, while the latter risks generating overly sparse feature representations. In this paper, we introduce the Poly Kernel Inception Network (PKINet) to handle the above challenges. PKINet employs multi-scale convolution kernels without dilation to extract object features of varying scales and capture local context. In addition, a Context Anchor Attention (CAA) module is introduced in parallel to capture long-range contextual information. These two components work jointly to advance the performance of PKINet on four challenging remote sensing object detection benchmarks, namely DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R.
Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels
Zhuohong Li · Wei He · Jiepan Li · Fangxiao Lu · Hongyan Zhang
Large-scale high-resolution (HR) land-cover mapping is a vital task to survey the Earth's surface and resolve many challenges facing humanity. However, it is still a non-trivial task hindered by complex ground details, various landforms, and the scarcity of accurate training labels over a wide-span geographic area. In this paper, we propose an efficient, weakly supervised framework (Paraformer) to guide large-scale HR land-cover mapping with easy-access historical land-cover data of low resolution (LR). Specifically, existing land-cover mapping approaches reveal the dominance of CNNs in preserving local ground details but still suffer from insufficient global modeling in various landforms. Therefore, we design a parallel CNN-Transformer feature extractor in Paraformer, consisting of a downsampling-free CNN branch and a Transformer branch, to jointly capture local and global contextual information. Besides, facing the spatial mismatch of training data, a pseudo-label-assisted training (PLAT) module is adopted to reasonably refine LR labels for weakly supervised semantic segmentation of HR images.Experiments on two large-scale datasets demonstrate the superiority of Paraformer over other state-of-the-art methods for automatically updating HR land-cover maps from LR historical labels.
3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions
Weijia Li · Haote Yang · Zhenghao Hu · Juepeng Zheng · Gui-Song Xia · Conghui He
3D building reconstruction from monocular remote sensing images is an important and challenging research problem that has received increasing attention in recent years, owing to its low cost of data acquisition and availability for large-scale applications. However, existing methods rely on expensive 3D-annotated samples for fully-supervised training, restricting their application to large-scale cross-city scenarios. In this work, we propose MLS-BRN, a multi-level supervised building reconstruction network that can flexibly utilize training samples with different annotation levels to achieve better reconstruction results in an end-to-end manner. To alleviate the demand on full 3D supervision, we design two new modules, Pseudo Building Bbox Calculator and Roof-Offset guided Footprint Extractor, as well as new tasks and training strategies for different types of samples. Experimental results on several public and new datasets demonstrate that our proposed MLS-BRN achieves competitive performance using much fewer 3D-annotated samples, and significantly improves the footprint extraction and 3D reconstruction performance compared with current state-of-the-art. The code and datasets of this work will be made publicly available.
Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening
Yule Duan · Xiao Wu · Haoyu Deng · Liang-Jian Deng
Currently, machine learning-based methods for remote sensing pansharpening have progressed rapidly. However, existing pansharpening methods often do not fully exploit differentiating regional information in non-local spaces, thereby limiting the effectiveness of the methods and resulting in redundant learning parameters. In this paper, we introduce a so-called content-adaptive non-local convolution (CANConv), a novel method tailored for remote sensing image pansharpening. Specifically, CANConv employs adaptive convolution, ensuring spatial adaptability, and incorporates non-local self-similarity through the similarity relationship partition (SRP) and the partition-wise adaptive convolution (PWAC) sub-modules. Furthermore, we also propose a corresponding network architecture, called CANNet, which mainly utilizes the multi-scale self-similarity. Extensive experiments demonstrate the superior performance of CANConv, compared with recent promising fusion methods. Besides, we substantiate the method's effectiveness through visualization, ablation experiments, and comparison with existing methods on multiple test sets. The source code is publicly available at https://github.com/duanyll/CANConv.
SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation
Junyan Ye · Qiyan Luo · Jinhua Yu · Huaping Zhong · Zhimeng Zheng · Conghui He · Weijia Li
This paper aims at achieving fine-grained building attribute segmentation in a cross-view scenario, i.e., using street-view and satellite image pairs. The main challenge lies in overcoming the significant perspective differences between street views and satellite views. In this work, we introduce SG-BEV, a novel approach for satellite-guided BEV fusion for cross-view semantic segmentation. To overcome the limitations of existing cross-view projection methods in capturing the complete building facade features, we innovatively incorporate Bird's Eye View (BEV) method to establish a spatially explicit mapping of street-view features. Moreover, we fully leverage the advantages of multiple perspectives by introducing a novel satellite-guided reprojection module, optimizing the uneven feature distribution issues associated with traditional BEV methods. Our method demonstrates significant improvements on four cross-view datasets collected from multiple cities, including New York, San Francisco, and Boston. On average across these datasets, our method achieves an increase in mIOU by 10.13% and 5.21% compared with the state-of-the-art satellite-based and cross-view methods. The code, models, and data of this work will be released to the public.
DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting
Demin Yu · Xutao Li · Yunming Ye · Baoquan Zhang · Luo Chuyao · Kuai Dai · wangrui · Chenxunlai
Precipitation nowcasting is an important spatio-temporal prediction task to predict the radar echoes sequences based on current observations, which can serve both meteorological science and smart city applications. Due to the chaotic evolution nature of the precipitation systems, it is a very challenging problem. Previous studies address the problem either from the perspectives of deterministic modeling or probabilistic modeling. However, their predictions suffer from the blurry, high-value echoes fading away and position inaccurate issues. The root reason of these issues is that the chaotic evolutionary precipitation systems are not appropriately modeled. Inspired by the nature of the systems, we propose to decompose and model them from the perspective of global deterministic motion and local stochastic variations with residual mechanism. A unified and flexible framework that can equip any type of spatio-temporal models is proposed based on residual diffusion, which effectively tackles the shortcomings of previous methods. Extensive experimental results on four publicly available radar datasets demonstrate the effectiveness and superiority of the proposed framework, compared to state-of-the-art techniques. Our code will be made publicly available upon acceptance.
MoCha-Stereo: Motif Channel Attention Network for Stereo Matching
Ziyang Chen · Wei Long · He Yao · Yongjun Zhang · Bingshu Wang · Yongbin Qin · Jia Wu
Learning-based stereo matching techniques have made significant progress. However, existing methods inevitably lose geometrical structure information during the feature channel generation process, resulting in edge detail mismatches. In this paper, the Motif Channel Attention Stereo Matching Network (MoCha-Stereo) is designed to address this problem. We provide the Motif Channel Correlation Volume (MCCV) to determine more accurate edge matching costs by projecting motif channels, which capture common geometric structures in feature channels, onto feature maps and cost volumes. In addition, edge variations in the potential feature channels of the reconstruction error map also affect edge texture matching. To further refine the full-resolution disparity details, we propose the Reconstruction Error Motif Penalty (REMP) module, which integrates the frequency information of typical channel features from the reconstruction error. MoCha-Stereo ranks 1st on the KITTI 2015 and KITTI 2012 Reflective leaderboards. The structure of MoCha-Stereo also shows excellent performance in Multi-View Stereo.
PBWR: Parametric-Building-Wireframe Reconstruction from Aerial LiDAR Point Clouds
Shangfeng Huang · Ruisheng Wang · Bo Guo · Hongxin Yang
In this paper, we present an end-to-end 3D building wireframe reconstruction method to regress edges directly from aerial LiDAR point clouds. Our method, named Parametric Building Wireframe Reconstruction (PBWR), takes aerial LiDAR point clouds and initial edge entities as input, and fully uses self-attention mechanism of transformers to regress edge parameters without any intermediate steps such as corner prediction. We propose an edge non-maximum suppression (E-NMS) module based on edge similarityto remove redundant edges. Additionally, a dedicated edge loss function is utilized to guide the PBWR in regressing edges parameters, where simple use of edge distance loss isn't suitable. In our experiments, we demonstrate state-of-the-art results on the Building3D dataset, achieving an improvement of approximately 36\% in entry-level dataset edge accuracy and around 42\% improvement in the Tallinn dataset.
Multi-modal Learning for Geospatial Vegetation Forecasting
Vitus Benson · Claire Robin · Christian Requena-Mesa · LAZARO ALONSO SILVA · Mélanie Weynants · Nora Linscheid · Jose Cortes · Zhihan Gao · Nuno Carvalhais · Markus Reichstein
Precise geospatial vegetation forecasting holds potential across diverse sectors, including agriculture, forestry, humanitarian aid, and carbon accounting. To leverage the vast availability of satellite imagery for this task, various works have applied deep neural networks for predicting multispectral images in photorealistic quality. However, the important area of vegetation dynamics has not been thoroughly explored. Our study introduces GreenEarthNet, the first dataset specifically designed for high-resolution vegetation forecasting, and Contextformer, a novel deep learning approach for predicting vegetation greenness from Sentinel 2 satellite images with fine resolution across Europe. Our multi-modal transformer model Contextformer leverages spatial context through a vision backbone and predicts the temporal dynamics on local context patches incorporating meteorological time series in a parameter-efficient manner. The GreenEarthNet dataset features a learned cloud mask and an appropriate evaluation scheme for vegetation modeling. It also maintains compatibility with the existing satellite imagery forecasting dataset EarthNet2021, enabling cross-dataset model comparisons. Our extensive qualitative and quantitative analyses reveal that our methods outperform a broad range of baseline techniques. This includes surpassing previous state-of-the-art models on EarthNet2021, as well as adapted models from time series forecasting and video prediction. To the best of our knowledge, this work presents the first models for continental-scale vegetation modeling at fine resolution able to capture anomalies beyond the seasonal cycle, thereby paving the way for predicting vegetation health and behaviour in response to climate variability and extremes. We provide open source code and pre-trained weights to reproduce our experimental results under https://github.com/vitusbenson/greenearthnet.
Relational Matching for Weakly Semi-Supervised Oriented Object Detection
Wenhao Wu · Hau San Wong · Si Wu · Tianyou Zhang
Oriented object detection has witnessed significant progress in recent years. However, the impressive performance of oriented object detectors is at the huge cost of labor-intensive annotations, and deteriorates once the annotated data becomes limited. Semi-supervised learning, in which sufficient unannotated data are utilized to enhance the base detector, is a promising method to address the annotation deficiency problem. Motivated by weakly supervised learning, we introduce annotation-efficient point annotations for unannotated images and propose a weakly semi-supervised method for oriented object detection to balance the detection performance and annotation cost. Specifically, we propose a Rotation-Modulated Relational Graph Matching method to match relations of proposals centered on annotated points between different models to alleviate the ambiguity of point annotations in depicting the oriented object. In addition, we further propose a Relational Rank Distribution Matching method to align the rank distribution on classification and regression between different models. Finally, to handle the difficult annotated points that both models are confused about, we introduce weakly supervised learning to impose positive signals for difficult point-induced clusters to the base model, and focus the base model on the occupancy between the predictions and annotated points. We perform extensive experiments on challenging datasets to demonstrate the effectiveness of our proposed weakly semi-supervised method in effectively leveraging unannotated data for significant performance improvement.
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
Mubashir Noman · Muzammal Naseer · Hisham Cholakkal · Rao Anwer · Salman Khan · Fahad Shahbaz Khan
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks by pre-training on large amount of unlabelled data. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. Different from standard natural image datasets, remote sensing data is acquired from various sensor technologies and exhibit diverse range of scale variations as well as modalities. Existing satellite image pre-training methods either ignore the scale information present in the remote sensing imagery or restrict themselves to use only a single type of data modality. In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities. Our proposed approach, named SatMAE++, performs multi-scale pre-training and utilizes convolution based upsampling blocks to reconstruct the image at higher scales making it extensible to include more scales. Compared to existing works, the proposed SatMAE++ with multi-scale pre-training is equally effective for both optical as well as multi-spectral imagery. Extensive experiments on six datasets reveal the merits of proposed contributions, leading to state-of-the-art performance on all datasets. SatMAE++ achieves mean average precision (mAP) gain of 2.5% for multi-label classification task on BigEarthNet dataset.
Unmixing Diffusion for Self-Supervised Hyperspectral Image Denoising
Haijin Zeng · Jiezhang Cao · Yongyong Chen · Kai Zhang · Hiep Luong · Wilfried Philips
Hyperspectral images (HSIs) have extensive applications in various fields such as medicine, agriculture, and industry. Nevertheless, acquiring high signal-to-noise ratio HSI poses a challenge due to narrow-band spectral filtering. Consequently, the importance of HSI denoising is substantial, especially for snapshot hyperspectral imaging technology. While most previous HSI denoising methods are supervised, creating supervised training datasets for the diverse scenes, hyperspectral cameras, and scan parameters is impractical. In this work, we present Diff-Unmix, a self-supervised denoising method for HSI using diffusion denoising generative models. Specifically, Diff-Unmix addresses the challenge of recovering noise-degraded HSI through a fusion of Spectral Unmixing and conditional abundance generation. Firstly, it employs a learnable block-based spectral unmixing strategy, complemented by a pure transformer-based backbone. Then, we introduce a self-supervised generative diffusion network to enhance abundance maps from the spectral unmixing block. This network reconstructs noise-free Unmixing probability distributions, effectively mitigating noise-induced degradations within these components. Finally, the reconstructed HSI is reconstructed through unmixing reconstruction by blending the diffusion-adjusted abundance map with the spectral endmembers. Experimental results on both simulated and real-world noisy datasets show that Diff-Unmix achieves state-of-the-art performance.
GeoChat: Grounded Large Vision-Language Model for Remote Sensing
Kartik Kuckreja · Muhammad Sohail Danish · Muzammal Naseer · Abhijit Das · Salman Khan · Fahad Shahbaz Khan
Recent advancements in Large Vision-Language Models (VLMs) have shown great promise in natural image domains, allowing users to hold a dialogue about given visual content. However, such general-domain VLMs perform poorly for Remote Sensing (RS) scenarios, leading to inaccurate or fabricated information when presented with RS domain-specific queries. Such a behavior emerges due to the unique challenges introduced by RS imagery. For example, to handle high-resolution RS imagery with diverse scale changes across categories and many small objects, region-level reasoning is necessary alongside holistic scene interpretation. Furthermore, the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries. To address these limitations, we propose GeoChat - the first versatile remote sensing VLM that offers multitask conversational capabilities with high-resolution RS images. Specifically, GeoChat can not only answer image-level queries, but also accepts region inputs to hold region-specific dialogue. Furthermore, it can visually ground objects in its responses by referring to their spatial coordinates. To address the lack of domain-specific datasets, we generate a novel RS multimodal instruction-following dataset by extending image-text pairs from existing diverse RS datasets. Leveraging this rich dataset, we fine-tune our remote sensing VLM based on the LLaVA-1.5 architecture. We establish a comprehensive benchmark for RS multitask conversations and compare with a number of baseline methods. GeoChat demonstrates robust zero-shot performance on various remote sensing tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring object detection. Our codes will be open-sourced.
Parameter Efficient Self-Supervised Geospatial Domain Adaptation
Linus Scheibenreif · Michael Mommert · Damian Borth
As large-scale foundation models become publicly available for different domains, efficiently adapting them to individual downstream applications and additional data modalities has turned into a central challenge.For example, foundation models for geospatial and satellite remote sensing applications are commonly trained on large optical RGB or multi-spectral datasets, although data from a wide variety of heterogeneous sensors are available in the remote sensing domain. This leads to significant discrepancies between pre-training and downstream target data distributions for many important applications. Fine-tuning large foundation models to bridge that gap incurs high computational cost and can be infeasible when target datasets are small.In this paper, we address the question of how large, pre-trained foundational transformer models can be efficiently adapted to downstream remote sensing tasks involving different data modalities or limited dataset size.We present a self-supervised adaptation method that boosts downstream linear evaluation accuracy of different foundation models by $4\text -6$% (absolute) across 8 remote sensing datasets while outperforming full fine-tuning when training only $1{\text -}2$% of the model parameters. Our method significantly improves label efficiency and increases few-shot accuracy by $6{\text -}10$% on different datasets\footnote{Code available at \texttt{anonymized}}.
Bridging Remote Sensors with Multisensor Geospatial Foundation Models
Boran Han · Shuai Zhang · Xingjian Shi · Markus Reichstein
In the realm of geospatial analysis, the diversity of remote sensors, encompassing both optical and microwave technologies, offers a wealth of distinct observational capabilities. Recognizing this, we present msGFM, a multisensor geospatial foundation model that effectively unifies data from four key sensor modalities. This integration spans an expansive dataset of two million multisensor images. msGFM is uniquely adept at handling both paired and unpaired sensor data. For data originating from identical geolocations, our model employs an innovative cross-sensor pretraining approach in masked image modeling, enabling the synthesis of joint representations from diverse sensors. msGFM, incorporating four remote sensors, upholds strong performance, forming a comprehensive model adaptable to various sensor types. msGFM has demonstrated enhanced proficiency in a range of both single-sensor and multisensor downstream tasks. These include scene classification, segmentation, cloud removal, and pan-sharpening. A key discovery of our research is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors, underscoring the limitations of existing representations in this field. Our work can serve as a guide for developing multisensor geospatial pretraining models, paving the way for more advanced geospatial capabilities. Code can be found at https://github.com/boranhan/GeospatialFoundationModels
CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning
Lianggangxu Chen · Xuejiao Wang · Jiale Lu · Shaohui Lin · Changbo Wang · Gaoqi He
3D Scene Graph Generation (3DSGG) aims to classify objects and their predicates within 3D point cloud scenes. However, current 3DSGG methods struggle with two main challenges. 1) The dependency on labor-intensive ground-truth annotations. 2) Closed-set classes training hampers the recognition of novel objects and predicates. Addressing these issues, our idea is to extract cross-modality features by CLIP from text and image data naturally related to 3D point clouds. Cross-modality features are used to train a robust 3D scene graph (3DSG) feature extractor. Specifically, we propose a novel Cross-Modality Contrastive Learning 3DSGG (CCL-3DSGG) method. Firstly, to align the text with 3DSG, the text is parsed into word level that are consistent with the 3DSG annotation. To enhance robustness during the alignment, adjectives are exchanged for different objects as negative samples. Then, to align the image with 3DSG, the camera view is treated as a positive sample and other views as negatives. Lastly, the recognition of novel object and predicate classes is achieved by calculating the cosine similarity between prompts and 3DSG features. Our rigorous experiments confirm the superior open-vocabulary capability and applicability of CCL-3DSGG in real-world contexts, both indoors and outdoors.
Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans
Romain Loiseau · Elliot Vincent · Mathieu Aubry · Loic Landrieu
We propose an unsupervised method for parsing large 3D scans of real-world scenes with easily-interpretable shapes. This work aims to provide a practical tool for analyzing 3D scenes in the context of aerial surveying and mapping, without the need for user annotations. Our approach is based on a probabilistic reconstruction model that decomposes an input 3D point cloud into a small set of learned prototypical 3D shapes. The resulting reconstruction is visually interpretable and can be used to perform unsupervised instance and low-shot semantic segmentation of complex scenes. We demonstrate the usefulness of our model on a novel dataset of seven large aerial LiDAR scans from diverse real-world scenarios. Our approach outperforms state-of-the-art unsupervised methods in terms of decomposition accuracy while remaining visually interpretable. Our code and dataset are available at https://romainloiseau.fr/learnable-earth-parser/.
Semantics Distortion and Style Matter: Towards Source-free UDA for Panoramic Segmentation
Xu Zheng · Pengyuan Zhou · ATHANASIOS · Addison, Lin Wang
This paper addresses an interesting yet challenging problem-- source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation--given only a pinhole image-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is nontrivial due to the semantic mismatches, style discrepancies, and inevitable distortion of panoramic images. To this end, we propose a novel method that utilizes Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) with a fixed FoV to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, the distinct projection discrepancies between source and target domains impede the direct knowledge transfer; thus, we propose a panoramic prototype adaptation module (PPAM) to integrate panoramic prototypes from the extracted knowledge for adaptation. We then impose the loss constraints on both predictions and prototypes and propose a cross-dual attention module (CDAM) at the feature level to better align the spatial and channel characteristics across the domains and projections. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our method achieves significantly better performance than prior SFUDA methods for pinhole-to-panoramic adaptation.
Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding
Guofeng Mei · Luigi Riz · Yiming Wang · Fabio Poiesi
Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language Models (VLMs). Existing strategies directly map VLM representations from 2D pixels of rendered or captured views to 3D points, overlooking the inherent and expressible point cloud geometric structure. Geometrically similar or close regions can be exploited for bolstering point cloud understanding as they are likely to share semantic information. To this end, we introduce the first training-free aggregation technique that leverages the point cloud's 3D geometric structure to improve the quality of the transferred VLM representations. Our approach operates iteratively, performing local-to-global aggregation based on geometric and semantic point-level reasoning. We benchmark our approach on three downstream tasks, including classification, part segmentation, and semantic segmentation, with a variety of datasets representing both synthetic/real-world, and indoor/outdoor scenarios. Our approach achieves new state-of-the-art results in all benchmarks.We will release the source code publicly.
SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation
Jiehong Lin · lihua liu · Dekun Lu · Kui Jia
Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects. We will make our codes publicly available.
Construct to Associate: Cooperative Context Learning for Domain Adaptive Point Cloud Segmentation
Guangrui Li
This paper tackles the domain adaptation problem in point cloud semantic segmentation, which performs adaptation from a fully labeled domain (source domain) to an unlabeled target domain. Due to the unordered property of point clouds, LiDAR scans typically show varying geometric structures across different regions, in terms of density, noises, etc, hence leading to increased dynamics on context. However, such characteristics are not consistent across domains due to the difference in sensors, environments, etc, thus hampering the effective scene comprehension across domains. To solve this, we propose Cooperative Context Learning that performs context modeling and modulation from different aspects but in a cooperative manner. Specifically, we first devise context embeddings to discover and model contextual relationships with close neighbors in a learnable manner. Then with the context embeddings from two domains, we introduce a set of learnable prototypes to attend and associate them under the attention paradigm. As a result, these prototypes naturally establish long-range dependency across regions and domains, thereby encouraging the transfer of context knowledge and easing the adaptation. Moreover, the attention in turn attunes and guides the local context modeling and urges them to focus on the domain-invariant context knowledge, thus promoting the adaptation in a cooperative manner. Experiments on representative benchmarks verify that our method attains the new state-of-the-art.
Multi-Task Dense Prediction via Mixture of Low-Rank Experts
Yuqi Yang · Peng-Tao Jiang · Qibin Hou · Hao Zhang · Jinwei Chen · Bo Li
Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper, we present a novel decoder-focused method for multi-task dense prediction, called Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships, MLoRE adds a generic convolution path to the original MoE structure, where each task feature can go through this path for explicit parameter sharing. Furthermore, to control the parameters and computational cost brought by the increase in the number of experts, we take inspiration from LoRA and propose to leverage the low-rank format of a vanilla convolution in the expert network. Since the low-rank experts have fewer parameters and can be dynamically parameterized into the generic convolution, the parameters and computational cost do not change much with the increase of experts. Benefiting from this design, we increase the number of experts and its reception field to enlarge the representation capacity, facilitating multiple dense tasks learning in a unified network. Extensive experiments on the PASCAL-Context and NYUD-v2 benchmarks show that our MLoRE achieves superior performance compared to previous state-of-the-art methods on all metrics. Our code is available at https://github.com/YuqiYang213/MLoRE.
OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
Guan Wang · Zhimin Li · Qingchao Chen · Yang Liu
Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos. Conventional approaches often employ multi-stage pipelines, which typically consist of object detection, temporal association, and multi-relation classification. However, these methods exhibit inherent limitations due to the separation of multiple stages, and independent optimization of these sub-problems may yield sub-optimal solutions. To remedy these limitations, we propose a one-stage end-to-end framework, termed OED, which streamlines the DSGG pipeline. This framework reformulates the task as a set prediction problem and leverages pair-wise features to represent each subject-object pair within the scene graph. Moreover, another challenge of DSGG is capturing temporal dependencies, we introduce a Progressively Refined Module (PRM) for aggregating temporal context without the constraints of additional trackers or handcrafted trajectories, enabling end-to-end optimization of the network. Extensive experiments conducted on the Action Genome benchmark demonstrate the effectiveness of our design. The code and models are available at https://github.com/guanw-pku/OED.
OMG-Seg: Is One Model Good Enough For All Segmentation?
Xiangtai Li · Haobo Yuan · Wei Li · Henghui Ding · Size Wu · Wenwei Zhang · Yining Li · Kai Chen · Chen Change Loy
In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to fill all these tasks in one model and achieve good enough performance. We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at \url{https://github.com/lxtGH/OMG-Seg}.
DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data
Hanrong Ye · Dan Xu
Recently, there has been an increased interest in the practical problem of learning multiple dense scene understanding tasks from partially annotated data, where each training sample is only labeled for a subset of the tasks. The missing of task labels in training leads to clearly low-quality and noisy predictions, as can be observed from state-of-the-art methods. To tackle this issue, we reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem, and propose a novel multi-task denoising diffusion framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps and generate rectified outputs for different tasks. To exploit multi-task consistency in denoising, we further introduce a Multi-Task Conditioning strategy, which can implicitly utilize the complementary nature of the tasks to help learn the unlabeled tasks, leading to an improvement in the denoising performance of the different tasks. Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps, and outperform the state-of-the-art methods on three challenging multi-task benchmarks, under two different partial-labeling evaluation settings. The project will be open-sourced.
Bilateral Adaptation for Human-Object Interaction Detection with Occlusion-Robustness
Guangzhi Wang · Yangyang Guo · Ziwei Xu · Mohan Kankanhalli
Human-Object Interaction (HOI) Detection constitutes an important aspect of human-centric scene understanding, which requires precise object detection and interaction recognition. Despite increasing advancement in detection, recognizing subtle and intricate interactions remains challenging. Recent methods have endeavored to leverage the rich semantic representation from pre-trained CLIP, yet fail to efficiently capture finer-grained spatial features that are highly informative for interaction discrimination. In this work, instead of solely using representations from CLIP, we fill the gap by proposing a spatial adapter that efficiently utilizes the multi-scale spatial information in the pre-trained detector. This leads to a bilateral adaptation that produces complementary features. Moreover, we design a Conditional Contextual Mining module that further mines informative contextual clues from the spatial features via a tailored cross-attention mechanism. To further improve interaction recognition under occlusion, which is common in crowded scenarios, we propose an Occluded Part Extrapolation module that guides the model to recover the spatial details from manually occluded feature maps. Extensive experiments on V-COCO and HICO-DET benchmarks demonstrate that our method significantly outperforms prior art on both traditional and zero-shot settings, resulting in new state-of-the-art performance. Additional ablation studies further validate the effectiveness of each component in our method.
CurveCloudNet: Processing Point Clouds with 1D Structure
Colton Stearns · Alex Fu · Jiateng Liu · Jeong Joon Park · Davis Rempe · Despoina Paschalidou · Leonidas Guibas
Modern depth sensors such as LiDAR operate by sweeping laser-beams across the scene, resulting in a point cloud with notable 1D curve-like structures. In this work, we introduce a new point cloud processing scheme and backbone, called CurveCloudNet, which takes advantage of the curve-like structure inherent to these sensors. While existing backbones discard the rich 1D traversal patterns and rely on generic 3D operations, CurveCloudNet parameterizes the point cloud as a collection of polylines (dubbed a "curve cloud"), establishing a local surface-aware ordering on the points. By reasoning along curves, CurveCloudNet captures lightweight curve-aware priors to efficiently and accurately reason in several diverse 3D environments. We evaluate CurveCloudNet on multiple synthetic and real datasets that exhibit distinct 3D size and structure. We demonstrate that CurveCloudNet outperforms both point-based and sparse-voxel backbones in various segmentation settings, notably scaling to large scenes better than point-based alternatives while exhibiting improved single-object performance over sparse-voxel alternatives. In all, CurveCloudNet is an efficient and accurate backbone that can handle a larger variety of 3D environments than past works.
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Jitesh Jain · Jianwei Yang · Humphrey Shi
Humans possess the remarkable skill of Visual Perception, the ability to see and understand the seen, helping them make sense of the visual world and, in turn, reason. Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks ranging from visual question-answering and image captioning to visual reasoning and image generation. However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail. Working towards developing an accurate MLLM system for perception and reasoning, we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the VCoder with perception modalities such as segmentation or depth maps, improving the MLLM's perception abilities. Secondly, we leverage the images from COCO and outputs from off-the-shelf vision perception models to create our COCO Segmentation Text (COST) dataset for training and evaluating MLLMs on the object perception task. Thirdly, we introduce metrics to assess the object perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive experimental evidence proving the VCoder's improved object-level perception skills over existing Multimodal LLMs, including GPT-4V. We open-source our dataset, code, and models to promote research.
Amodal Ground Truth and Completion in the Wild
Guanqi Zhan · Chuanxia Zheng · Weidi Xie · Andrew Zisserman
The problem we study in this paper is amodal image segmentation: predicting entire object segmentation masks including both visible and invisible (occluded) parts. In previous work, the amodal segmentation ground truth on real images is usually predicted by manual annotaton and thus is subjective. In contrast, we use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images. This pipeline is used to construct an amodal completion evaluation benchmark, MP3D-Amodal, consisting of a variety of object categories and labels. To better handle the amodal completion task in the wild, we explore two architecture variants: a two-stage model that first infers the occluder, followed by amodal mask completion; and a one-stage model that exploits the representation power of Stable Diffusion for amodal segmentation across many categories. Without bells and whistles, our method achieves a new state-of-the-art performance on Amodal segmentation datasets that cover a large variety of objects, including COCOA and our new MP3D-Amodal dataset. The dataset, model, and code will be publicly released.
Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments
Liyuan Zhu · Shengyu Huang · Konrad Schindler · Iro Armeni
Research into dynamic 3D scene understanding has primarily focused on short-term change tracking from dense observations, while little attention has been paid to long-term changes with sparse observations. We address this gap with $MoRE^2$, a novel approach designed for multi-object relocalization and reconstruction in evolving environments. We view these environments as ``living scenes" and consider the problem of transforming scans taken at different points in time into a 3D reconstruction of the object instances, whose accuracy and completeness increase over time.At the core of our method lies a SE(3)-equivariant representation in a single encoder-decoder network, trained on synthetic data. This representation enables us to seamlessly tackle instance matching, registration, and reconstruction. We also introduce a joint optimization algorithm that facilitates the accumulation of point clouds originating from the same instance across multiple scans taken at different points in time. We validate our method on synthetic and real-world data and demonstrate state-of-the-art performance in both end-to-end performance and individual subtasks.
Image-based crowd counting widely employs density map regression, which often suffers from severe performance degradation when tested on data from unseen scenarios. To address this so-called "domain shift" problem, we study single domain generalization (SDG) for crowd counting. Though SDG has been extensively explored, the existing approaches are mainly for classification and segmentation. They can hardly be extended to crowd counting due to its nature of density regression and label ambiguity (i.e., ambiguous pixel-level ground truths). We propose MPCount, a novel SDG approach effective even for narrow source distribution. Reconstructing diverse features for density map regression with a single memory bank, MPCount retains only domain-invariant representations using a content error mask and attention consistency loss. It further introduces the patch-wise classification as an auxiliary task to boost the robustness of density prediction with relatively accurate labels. Through extensive experiments on different datasets, MPCount is shown to significantly improve counting accuracy compared to the state-of-the-art approaches under diverse scenarios unobserved in the training data and narrow source distribution.
LTA-PCS: Learnable Task-Agnostic Point Cloud Sampling
Jiaheng Liu · Jianhao Li · Kaisiyuan Wang · Hongcheng Guo · Jian Yang · Junran Peng · Ke Xu · Xianglong Liu · Jinyang Guo
Recently, many approaches directly operate on point clouds for different tasks. These approaches become more compu- tation and storage demanding when point cloud size is large. To reduce the required computation and storage, one possible solution is to sample the point cloud. In this paper, we pro- pose the first Learnable Task-Agnostic Point Cloud Sampling (LTA-PCS) framework. Existing task-agnostic point cloud sampling strategy (e.g., FPS) does not consider semantic in- formation of point clouds, causing degraded performance on downstream tasks. While learning-based point cloud sam- pling methods consider semantic information, they are task- specific and require task-oriented ground-truth annotations. So they cannot generalize well on different downstream tasks. Our LTA-PCS achieves task-agnostic point cloud sampling without requiring task-oriented labels, in which both the ge- ometric and semantic information of points is considered in sampling. Extensive experiments on multiple downstream tasks demonstrate the effectiveness of our LTA-PCS.
Prompt3D: Random Prompt Assisted Weakly-Supervised 3D Object Detection
Xiaohong Zhang · Huisheng Ye · Jingwen Li · Qinyu Tang · Yuanqi Li · Yanwen Guo · Jie Guo
The prohibitive cost of annotations for fully supervised 3D indoor object detection limits its practicality. In this work, we propose Random Prompt Assisted Weakly-supervised 3D Object Detection, termed as Prompt3D, a weakly-supervised approach that leverages position-level labels to overcome this challenge. Explicitly, our method focuses on enhancing labeling using synthetic scenes crafted from 3D shapes generated via random prompts. First, a Synthetic Scene Generation (SSG) module is introduced to assemble synthetic scenes with a curated collection of 3D shapes, created via random prompts for each category. These scenes are enriched with automatically generated point-level annotations, providing a robust supervisory framework for training the detection algorithm. To enhance the transfer of knowledge from virtual to real datasets, we then introduce a Prototypical Proposal Feature Alignment (PPFA) module. This module effectively alleviates the domain gap by directly minimizing the distance between feature prototypes of the same class proposals across two domains. Compared with sota BR, our method improves by 5.4% and 8.7% on mAP with VoteNet and GroupFree3D serving as detectors respectively, demonstrating the effectiveness of our proposed method. Code is available at: https://github.com/huishengye/prompt3d.
No More Ambiguity in 360° Room Layout via Bi-Layout Estimation
Yu-Ju Tsai · Jin-Cheng Jhang · JINGJING ZHENG · Wei Wang · Albert Chen · Min Sun · Cheng-Hao Kuo · Ming-Hsuan Yang
Inherent ambiguity in layout annotations poses significant challenges to developing accurate 360$^\circ$ room layout estimation models. To address this issue, we propose a novel Bi-Layout model capable of predicting two distinct layout types. One stops at ambiguous regions, while the other extends to encompass all visible areas. Our model employs two global context embeddings, where each embedding is designed to capture specific contextual information for each layout type. With our novel feature guidance module, the image feature retrieves relevant context from these embeddings, generating layout-aware features for precise bi-layout predictions. A unique property of our Bi-Layout model is its ability to inherently detect ambiguous regions by comparing the two predictions. To circumvent the need for manual correction of ambiguous annotations during testing, we also introduce a new metric for disambiguating ground truth layouts. Our method demonstrates superior performance on benchmark datasets, notably outperforming leading approaches. Specifically, on the MatterportLayout dataset, it improves 3DIoU from $81.70$% to $82.57$% across the full test set and notably from $54.80$% to $59.97$% in subsets with significant ambiguity.
A novel algorithm, called semantic line combination detector (SLCD), to find an optimal combination of semantic lines is proposed in this paper. It processes all lines in each line combination at once to assess the overall harmony of the lines. First, we generate various line combinations from reliable lines. Second, we estimate the score of each line combination and determine the best one. Experimental results demonstrate that the proposed SLCD outperforms existing semantic line detectors on various datasets. Moreover, it is shown that SLCD can be applied effectively to three vision tasks of vanishing point detection, symmetry axis detection, and composition-based image retrieval. Our codes are available at https://github.com/Jinwon-Ko/SLCD.
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
Rongjie Li · Songyang Zhang · Dahua Lin · Kai Chen · Xuming He
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks.Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts.To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation.Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm.Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences.By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks.Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.
PanoContext-Former: Panoramic Total Scene Understanding with a Transformer
Yuan Dong · Chuan Fang · Liefeng Bo · Zilong Dong · Ping Tan
Panoramic image enables deeper understanding and more holistic perception of 360-degree surrounding environment, which can naturally encode enriched scene context information compared to standard perspective image. Previous work has made lots of effort to solve the scene understanding task in a hybrid solution based on 2D-3D geometric reasoning, thus each sub-task is processed separately and few correlations are explored in this procedure. In this paper, we propose a fully 3D method for holistic indoor scene understanding which recovers the objects' shapes, oriented bounding boxes and the 3D room layout simultaneously from a single panorama. To maximize the exploration of the rich context information, we design a transformer-based context module to predict the representation and relationship among each component of the scene. In addition, we introduce a new dataset for scene understanding, including photo-realistic panoramas, high-fidelity depth images, accurately annotated room layouts, oriented object bounding boxes and shapes. Experiments on the synthetic and new datasets demonstrate that our method outperforms previous panoramic scene understanding methods in terms of both layout estimation and 3D object detection.
DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly
Gianluca Scarpellini · Stefano Fiorini · Francesco Giuliari · Pietro Morerio · Alessio Del Bue
Reassembly tasks play a fundamental role in many fields and multiple approaches exist to solve specific reassembly problems. In this context, we posit that a general unified model can effectively address them all, irrespective of the input data type (image, 3D, etc.). We introduce DiffAssemble, a Graph Neural Network (GNN)-based architecture that learns to solve reassembly tasks using a diffusion model formulation.Our method treats the elements of a set, whether pieces of 2D patch or 3D object fragments, as nodes of a spatial graph. Training is performed by introducing noise into the position and rotation of the elements and iteratively denoising them to reconstruct the coherent initial pose.DiffAssemble achieves state-of-the-art (SOTA) results in most 2D and 3D reassembly tasks and is the first learning-based approach that solves 2D puzzles for both rotation and translation. Furthermore, we highlight its remarkable reduction in run-time, performing 11 times faster than the quickest optimization-based method for puzzle solving.
ProMotion: Prototypes As Motion Learners
Yawen Lu · Dongfang Liu · Qifan Wang · Cheng Han · Yiming Cui · Zhiwen Cao · Xueling Zhang · Yingjie Victor Chen · Heng Fan
In this work, we introduce ProMotion, a unified prototypical transformer-based framework engineered to jointly model fundamental motion tasks. ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms. We adopt a prototypical perspective, establishing a unified paradigm that harmonizes disparate motion learning approaches. This novel paradigm streamlines the architectural design, enabling the simultaneous assimilation of diverse motion information. We capitalize on a dual mechanism involving the feature denoiser and the prototypical learner to decipher the intricacies of motion. This approach effectively circumvents the pitfalls of ambiguity in pixel-wise feature matching, significantly bolstering the robustness of motion representation. We demonstrate a profound degree of transferability across distinct motion patterns. This inherent versatility reverberates robustly across a comprehensive spectrum of both 2D and 3D downstream tasks. Empirical results demonstrate that ProMotion outperforms various well-known specialized architectures, achieving 0.54 and 0.054 Abs Rel error on the Sintel and KITTI depth benchmarks, 1.04 and 2.01 average endpoint error on the clean and final pass of Sintel flow benchmark, and 4.30 F1-all error on the KITTI flow benchmark. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.
HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes
Yichen Yao · Zimo Jiang · YUJING SUN · Zhencai Zhu · Xinge Zhu · Runnan Chen · Yuexin Ma
Human-centric 3D scene understanding has recently drawn increasing attention, driven by its critical impact on robotics. However, human-centric real-life scenarios are extremely diverse and complicated, and humans have intricate motions and interactions. With limited labeled data, supervised methods are difficult to generalize to general scenarios, hindering real-life applications. Mimicking human intelligence, we propose an unsupervised 3D detection method for human-centric scenarios by transferring the knowledge from synthetic human instances to real scenes. To bridge the gap between the distinct data representations and feature distributions of synthetic models and real point clouds, we introduce novel modules for effective instance-to-scene representation transfer and synthetic-to-real feature alignment. Remarkably, our method exhibits superior performance compared to current state-of-the-art techniques, achieving 87.8% improvement in mAP and closely approaching the performance of fully supervised methods (62.15 mAP vs. 69.02 mAP) on HuCenLife Dataset.
Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection
Chuangchuang Tan · Huan Liu · Yao Zhao · Shikui Wei · Guanghua Gu · Ping Liu · Yunchao Wei
Recently, the proliferation of highly realistic synthetic images, facilitated through a variety of GANs and Diffusions, has significantly heightened the susceptibility to misuse. While the primary focus of deepfake detection has traditionally centered on the design of detection algorithms, an investigative inquiry into the generator architectures has remained conspicuously absent in recent years. This paper contributes to this lacuna by rethinking the architectures of CNN-based generator, thereby establishing a generalized representation of synthetic artifacts. Our findings illuminate that the up-sampling operator can, beyond frequency-based artifacts, produce generalized forgery artifacts. In particular, the local interdependence among image pixels caused by upsampling operators is significantly demonstrated in synthetic images generated by GAN or diffusion. Building upon this observation, we introduce the concept of Neighboring Pixel Relationships(NPR) as a means to capture and characterize the generalized structural artifacts stemming from up-sampling operations. A comprehensive analysis is conducted on an open-world dataset, comprising samples generated by 28 distinct generative models. This analysis culminates in the establishment of a novel state-of-the-art performance, showcasing a remarkable 12.8\% improvement over existing methods. Code will be released.
Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now
Ayush Sarkar · Hanlin Mai · Amitabh Mahapatra · David Forsyth · Svetlana Lazebnik · Anand Bhattad
Generative models can produce impressively realistic images. This paper demonstrates that generated images have geometric features different from those of real images. We build a set of collections of generated images, prequalified to fool simple, signal-based classifiers into believing they are real. We then show that prequalified generated images can be identified reliably by classifiers that only look at geometric properties. We use three such classifiers. All three classifiers are denied access to image pixels, and look only at derived geometric features. The first classifier looks at the perspective field of the image, the second looks at lines detected in the image, and the third looks at relations between detected objects and shadows. Our procedure detects generated images more reliably than SOTA local signal based detectors, for images from a number of distinct generators. Saliency maps suggest that the classifiers can identify geometric problems reliably. We conclude that current generators cannot reliably reproduce geometric properties of real images.
Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis
Tianci Bi · Xiaoyi Zhang · Zhizheng Zhang · Wenxuan Xie · Cuiling Lan · Yan Lu · Nanning Zheng
Significant progress has been made in scene text detection models since the rise of deep learning, but scene text layout analysis, which aims to group detected text instances as paragraphs, has not kept pace. Previous works either treated text detection and grouping using separate models, or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper, we present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis, allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures, TGA takes detected text regions and image features as universal inputs to assemble text instance features. To capture broader contextual information for layout analysis, we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance, simultaneously inheriting generalized text detection ability from pre-training. In the case of full parameter fine-tuning, we can further improve layout analysis performance.
Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
Jongha Kim · Jihwan Park · Jinyoung Park · Jinyoung Kim · Sehyung Kim · Hyunwoo J. Kim
Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architecturesrecently. However, we identify two key limitations in a conventional label assignment for training Transformer-based VRD models, which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment, an ‘unspecialized’ query is trained since a query is expected to detect every relation, which makes it difficult for a query to specialize in specific relations. Furthermore, aquery is also insufficiently trained since a GT is assigned only to a single prediction, therefore near-correct or even correct predictions are suppressed by being assigned ‘no relation (∅)’ as a GT. To address these issues, we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a ‘specialized’ query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject, an object, and the relation in between. Experimental results and analyses show that SpeaQ effectively trains ‘specialized’ queries, which better utilize the capacity of a model, resulting in consistent performance gains with ‘zero’ additional inference cost across multiple VRD models and benchmarks.
CoralSCOP: Segment any COral Image on this Planet
Zheng Ziqiang · Liang Haixin · Binh-Son Hua · Tim, Yue Him Wong · Put ANG · Apple CHUI · Sai-Kit Yeung
Underwater visual understanding has recently gained increasing attention within the computer vision community for studying and monitoring underwater ecosystems. Among these, coral reefs play an important and intricate role, often referred to as the rainforests of the sea, due to their rich biodiversity and crucial environmental impact. Existing coral analysis, due to its technical complexity, requires significant manual work from coral biologists, therefore hindering scalable and comprehensive studies. In this paper, we introduce CoralSCOP, the first foundation model designed for the automatic dense segmentation of coral reefs. CoralSCOP is developed to accurately assign labels to different coral entities, addressing the challenges in the semantic analysis of coral imagery. Its main objective is to identify and delineate the irregular boundaries between various coral individuals across different granularities, such as coral/non-coral, growth form, and genus. This task is challenging due to the semantic agnostic nature or fixed limited semantic categories of previous generic segmentation methods, which fail to adequately capture the complex characteristics of coral structures. By introducing a novel parallel semantic branch, CoralSCOP can produce high-quality coral masks with semantics that enable a wide range of downstream coral reef analysis tasks. We demonstrate that CoralSCOP exhibits a strong zero-shot ability to segment unseen coral images. To effectively train our foundation model, we propose CoralMask, a new dataset with 41,297 densely labeled coral images and 330,144 coral masks. We have conducted comprehensive and extensive experiments to demonstrate the advantages of CoralSCOP over existing generalist segmentation algorithms and coral reef analytical approaches.
Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models
Huimin Huang · Yawen Huang · Lanfen Lin · Ruofeng Tong · Yen-Wei Chen · Hao Zheng · Yuexiang Li · Yefeng Zheng
Multi-task visual scene understanding aims to leverage the relationships among a set of correlated tasks, which are solved simultaneously by embedding them within a unified network. However, most existing methods give rise to two primary concerns from a task-level perspective: (1) the lack of task-independent correspondences for distinct tasks, and (2) the neglect of explicit task-consensual dependencies among various tasks. To address these issues, we propose a novel synergy embedding models (SEM), which goes beyond multi-task dense prediction by leveraging two innovative designs: the intra-task hierarchy-adaptive module and the inter-task EM-interactive module. Specifically, the constructed intra-task module incorporates hierarchy-adaptive keys from multiple stages, enabling the efficient learning of specialized visual patterns with an optimal trade-off. In addition, the developed inter-task module learns interactions from a compact set of mutual bases among various tasks, benefiting from the expectation maximization (EM) algorithm. Extensive empirical evidence from two public benchmarks, NYUD-v2 and PASCAL-Context, demonstrates that SEM consistently outperforms state-of-the-art approaches across a range of metrics.
Disentangled Pre-training for Human-Object Interaction Detection
Zhuolong Li · Xingao Li · Changxing Ding · Xiangmin Xu
Detecting human-object interaction (HOI) has long been limited by the amount of supervised data available. Recent approaches address this issue by pre-training according to pseudo-labels, which align object regions with HOI triplets parsed from image captions. However, pseudo-labeling is tricky and noisy, making HOI pre-training a complex process. Therefore, we propose an efficient disentangled pre-training method for HOI detection (DP-HOI) to address this problem. First, DP-HOI utilizes object detection and action recognition datasets to pre-train the detection and interaction decoder layers, respectively. Then, we arrange these decoder layers so that the pre-training architecture is consistent with the downstream HOI detection task. This facilitates efficient knowledge transfer. Specifically, the detection decoder identifies reliable human instances in each action recognition dataset image, generates one corresponding query, and feeds it into the interaction decoder for verb classification. Next, we combine the human instance verb predictions in the same image and impose image-level supervision. The DP-HOI structure can be easily adapted to the HOI detection task, enabling effective model parameter initialization. Therefore, it significantly enhances the performance of existing HOI detection models on a broad range of rare categories. The code and pre-trained weight are available at https://github.com/xingaoli/DP-HOI.
Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan · Wentong Li · Jian liu · Dongqi Tang · Xinjie Luo · Chi Qin · Lei Zhang · Jianke Zhu
Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.
Discovering Syntactic Interaction Clues for Human-Object Interaction Detection
Jinguo Luo · Weihong Ren · Weibo Jiang · Xi'ai Chen · Qiang Wang · Zhi Han · Honghai LIU
Recently, Vision-Language Model (VLM) has greatly advanced the Human-Object Interaction (HOI) detection. The existing VLM-based HOI detectors typically adopt a hand-crafted template (e.g., a photo of a person [action] a$/$an [object]) to acquire text knowledge through the VLM text encoder. However, such approaches, only encoding the action-specific text prompts in vocabulary level, may suffer from learning ambiguity without exploring the fine-grained clues from the perspective of interaction context. In this paper, we propose a novel method to discover Syntactic Interaction Clues for HOI detection (SICHOI) by using VLM. Specifically, we first investigate what are the essential elements for an interaction context, and then establish a syntactic interaction bank from three levels: spatial relationship, action-oriented posture and situational condition. Further, to align visual features with the syntactic interaction bank, we adopt a multi-view extractor to jointly aggregate visual features from instance, interaction, and image levels accordingly. In addition, we also introduce a dual cross-attention decoder to perform context propagation between text knowledge and visual features, thereby enhancing the HOI detection. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on HICO-DET and V-COCO.
Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball
Simon Weber · Barış Zöngür · Nikita Araslanov · Daniel Cremers
Hierarchy is a natural representation of semantic taxonomies, including the ones routinely used in image segmentation. Indeed, recent work on semantic segmentation reports improved accuracy from supervised training leveraging hierarchical label structures. Encouraged by these results, we revisit the fundamental assumptions behind that work. We postulate and then empirically verify that the reasons for the observed improvement in segmentation accuracy may be entirely unrelated to the use of the semantic hierarchy. To demonstrate this, we design a range of cross-domain experiments with a representative hierarchical approach. We find that on the new testing domains, a flat (non-hierarchical) segmentation network, in which the parents are inferred from the children, has superior segmentation accuracy to the hierarchical approach across the board. Complementing these findings and inspired by the intrinsic properties of hyperbolic spaces, we study a more principled approach to hierarchical segmentation using the Poincaré ball model. The hyperbolic representation largely outperforms the previous (Euclidean) hierarchical approach as well and is on par with our flat Euclidean baseline in terms of segmentation accuracy. However, it additionally exhibits surprisingly strong calibration quality of the parent nodes in the semantic hierarchy, especially on the more challenging domains. Our combined analysis suggests that the established practice of hierarchical segmentation may be limited to in-domain settings, whereas flat classifiers generalize substantially better, especially if they are modeled in the hyperbolic space.
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
Ce Zhang · Simon Stepputtis · Joseph Campbell · Katia Sycara · Yaqi Xie
Being able to understand visual scenes is a precursor for many downstream tasks, including autonomous driving, robotics, and other vision-based approaches. A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG); however, many existing approaches assume undisturbed vision, i.e., the absence of real-world corruptions such as fog, snow, smoke, as well as non-uniform perturbations like sun glare or water drops. In this work, we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. Further, we introduce a corresponding approach, Hierarchical Knowledge Enhanced Robust Scene Graph Generation (HiKER-SGG), providing a strong baseline for scene graph generation under such challenging setting. At its core, HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its predictions from coarse initial estimates to detailed predictions. In our extensive experiments, we show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner, but also outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is available at https://github.com/zhangce01/HiKER-SGG.
Hierarchical Intra-modal Correlation Learning for Label-free 3D Semantic Segmentation
Xin Kang · Lei Chu · Jiahao Li · Xuejin Chen · Yan Lu
Recent methods for label-free 3D semantic segmentation aim to assist 3D model training by leveraging the open-world recognition ability of pre-trained vision language models. However, these methods usually suffer from inconsistent and noisy pseudo-labels provided by the vision language models. To address this issue, we present a hierarchical intra-modal correlation learning framework that captures visual and geometric correlations in 3D scenes at three levels: intra-set, intra-scene, and inter-scene, to help learn more compact 3D representations. We refine pseudo-labels using intra-set correlations within each geometric consistency set and align features of visually and geometrically similar points using intra-scene and inter-scene correlation learning. We also introduce a feedback mechanism to distill the correlation learning capability into the 3D model. Experiments on both indoor and outdoor datasets show the superiority of our method. We achieve a state-of-the-art 36.6% mIoU on the ScanNet dataset, and a 23.0% mIoU on the nuScenes dataset, with improvements of 7.8% mIoU and 2.2% mIoU compared with previous SOTA. We also provide theoretical analysis and qualitative visualization results to discuss the mechanism and conduct thorough ablation studies to support the effectiveness of our framework.
FreePoint: Unsupervised Point Cloud Instance Segmentation
Zhikai Zhang · Jian Ding · Li Jiang · Dengxin Dai · Gui-Song Xia
Instance segmentation of point clouds is a crucial task in 3D field with numerous applications that involve localizing and segmenting objects in a scene. However, achieving satisfactory results requires a large number of manual annotations, which is time-consuming and expensive. To alleviate dependency on annotations, we propose a novel framework, FreePoint, for underexplored unsupervised class-agnostic instance segmentation on point clouds. In detail, we represent the point features by combining coordinates, colors, and self-supervised deep features. Based on the point features, we perform a bottom-up multicut algorithm to segment point clouds into coarse instance masks as pseudo labels, which are used to train a point cloud instance segmentation model. We propose an id-as-feature strategy at this stage to alleviate the randomness of the multicut algorithm and improve the pseudo labels’ quality. During training, we propose a weakly-supervised two-step training strategy and corresponding losses to overcome the inaccuracy of coarse masks. FreePoint has achieved breakthroughs in unsupervised class-agnostic instance segmentation on point clouds and outperformed previous traditional methods by over 18.2% and a competitive concurrent work UnScene3D by 5.5% in AP. Additionally, when used as a pretext task and fine-tuned on S3DIS, FreePoint performs significantly better than existing self-supervised pre-training methods with limited annotations and surpasses CSC by 6.0% in AP with 10% annotation masks. Code will be released at https://github.com/zzk273/FreePoint.
GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation
WEIMING ZHANG · Yexin Liu · Xu Zheng · Addison, Lin Wang
This paper tackles a novel yet challenging problem: how to transfer knowledge from the emerging Segment Anything Model (SAM) $-$ which reveals impressive zero-shot instance segmentation capacity $-$ to learn a compact panoramic semantic segmentation model, i.e., student, without requiring any labeled data. This poses considerable challenges due to SAM's inability to provide semantic labels and the large capacity gap between SAM and the student.To this end, we propose a novel framework, called $\textbf{GoodSAM}$, that introduces a teacher assistant (TA) to provide semantic information, integrated with SAM to generate ensemble logits to achieve knowledge transfer.Specifically, we propose a Distortion-Aware Rectification (DAR) module that first addresses the distortion problem of panoramic images by imposing prediction-level consistency and boundary enhancement.This subtly enhances TA's prediction capacity on panoramic images. DAR then incorporates a cross-task complementary fusion block to adaptively merge the predictions of SAM and TA to obtain more reliable ensemble logits.Moreover, we introduce a Multi-level Knowledge Adaptation (MKA) module to efficiently transfer the multi-level feature knowledge from TA and ensemble logits to learn a compact student model.Extensive experiments on two benchmarks show that our GoodSAM achieves a remarkable $\textbf{+3.75}$% mIoU improvement over the state-of-the-art (SOTA) domain adaptation methods, e.g., [41] . Also, Our most lightweight model achieves comparable performance to the SoTA methods with only $\textbf{3.7}$M parameters.
MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation
Mi Yan · Jiazhao Zhang · Yan Zhu · He Wang
Open-vocabulary 3D instance segmentation is cutting-edge for its ability to segment 3D instances without predefined categories. However, progress in 3D lags behind its 2D counterpart due to limited annotated 3D data. To address this, recent works first generate 2D open-vocabulary masks through 2D models and then merge them into 3D instances based on metrics calculated between two neighboring frames. In contrast to these local metrics, we propose a novel metric, view consensus rate, to enhance the utilization of multi-view observations. The key insight is that two 2D masks should be deemed part of the same 3D instance if a significant number of other 2D masks from different views contain both these two masks. Using this metric as edge weight, we construct a global mask graph where each mask is a node. Through iterative clustering of masks showing high view consensus, we generate a series of clusters, each representing a distinct 3D instance. Notably, our model is training-free. Through extensive experiments on publicly available datasets, including ScanNet++, ScanNet200 and MatterPort3D, we demonstrate that our method achieves state-of-the-art performance in open-vocabulary 3D instance segmentation. Our project page is at \href{https://pku-epic.github.io/MaskClustering/}{https://pku-epic.github.io/MaskClustering}.
ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation
Suraj Patni · Aradhye Agarwal · Chetan Arora
In the absence of parallax cues, a learning-based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been shown that using embeddings from pre-trained foundational models, such as CLIP, improves zero shot transfer in several applications. Taking inspiration from this, in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings. Based on this idea, we propose a new SIDE model using a diffusion backbone which is conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of 0.059(14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%) over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%, 18%, 45%, 9%) by ZoeDepth. The project page is available at https://ecodepth-iitd.github.io.
Physical Property Understanding from Language-Embedded Feature Fields
Albert J. Zhai · Yuan Shen · Emily Y. Chen · Gloria Wang · Xinlei Wang · Sheng Wang · Kaiyu Guan · Shenlong Wang
Can computers perceive the physical properties of objects solely through vision? Research in cognitive science and vision science has shown that humans excel at identifying materials and estimating their physical properties based purely on visual appearance. In this paper, we present a novel approach for dense prediction of the physical properties of objects using a collection of images. Inspired by how humans reason about physics through vision, we leverage large language models to propose candidate materials for each object. We then construct a language-embedded point cloud and estimate the physical properties of each 3D point using a zero-shot kernel regression approach. Our method is accurate, annotation-free, and applicable to any object in the open world. Experiments demonstrate the effectiveness of the proposed approach in various physical property reasoning tasks, such as estimating the mass of common objects, as well as other properties like friction and hardness.
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation
Kibum Kim · Kanghoon Yoon · Jaehyeong Jeon · Yeonjun In · Jinyoung Moon · Donghyun Kim · Chanyoung Park
Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard, studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However, they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions, where fine-grained predicates in captions are undesirably converted into coarse-grained predicates, resulting in a long-tailed predicate distribution, and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest, where many triplets are discarded and not used in training, leading to insufficient supervision. To tackle the two issues, we propose a new approach, i.e., Large Language Model for weakly-supervised SGG (LLM4SGG), where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes, we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG, we conduct extensive experiments on Visual Genome and GQA datasets, showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient, enabling effective model training with a small amount of training images.
DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation
Zeeshan Hayder · Xuming He
Scene graph generation aims to capture detailed spatial and semantic relationships between objects in an image, which is challenging due to incomplete labeling, long-tailed relationship categories, and relational semantic overlap. Existing Transformer-based methods either employ distinct queries for objects and predicates or utilize holistic queries for relation triplets and hence often suffer from limited capacity in learning low-frequency relationships. In this paper, we present a new Transformer-based method, called DSGG, that views scene graph detection as a direct graph prediction problem based on a unique set of graph-aware queries. In particular, each graph-aware query encodes a compact representation of both the node and all of its relations in the graph, acquired through the utilization of a relaxed sub-graph matching during the training process. Moreover, to address the problem of relational semantic overlap, we utilize a strategy for relation distillation, aiming to efficiently learn multiple instances of semantic relationships. Extensive experiments on the VG and the PSG datasets show that our model achieves state-of-the-art results, showing a significant improvement of 3.5\% and 6.7\% in mR@50 and mR@100 for the scene-graph generation task and achieves an even more substantial improvement of 8.5\% and 10.3\% in mR@50 and mR@100 for the panoptic scene graph generation task. Code is available at https://github.com/zeeshanhayder/DSGG
OTE: Exploring Accurate Scene Text Recognition Using One Token
Jianjun Xu · Yuxin Wang · Hongtao Xie · Yongdong Zhang
In this paper, we propose a novel framework to fully exploit the potential of a single vector for scene text recognition (STR). Different from previous sequence-to-sequence methods that rely on a sequence of visual tokens to represent scene text images, we prove that just one token is enough to characterize the entire text image and achieve accurate text recognition. Based on this insight, we introduce a new paradigm for STR, called One Token Ecognizer (OTE). Specifically, we implement an image-to-vector encoder to extract the fine-grained global semantics, eliminating the need for sequential features. Furthermore, an elegant yet potent vector-to-sequence decoder is designed to adaptively diffuse global semantics to corresponding character locations, enabling both autoregressive and non-autoregressive decoding schemes. By executing decoding within a high-level representational space, our vector-to-sequence (V2S) approach avoids the alignment issues between visual tokens and character embeddings prevalent in traditional sequence-to-sequence methods. Remarkably, due to introducing character-wise fine-grained information, such global tokens also boost the performance of scene text retrieval tasks. Extensive experiments on synthetic and real datasets demonstrate the effectiveness of our method by achieving new state-of-the-art results on various public STR benchmarks. Code will be available.
SemCity: Semantic Scene Generation with Triplane Diffusion
Jumin Lee · Sebin Lee · Changho Jo · Woobin Im · Ju-hyeong Seon · Sung-Eui Yoon
We present "SemCity," a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object, synthetic indoor scenes, or synthetic outdoor scenes, while the generation of real-world outdoor scenes is rarely addressed. In this paper, we concentrate on generating a real-outdoor scene through learning a diffusion model on a real-world outdoor dataset. In contrast to synthetic data, real-outdoor datasets often contain more empty spaces due to sensor limitations, causing challenges in learning real-outdoor distributions. To address this issue, we exploit a triplane representation as a proxy form of scene distributions to be learned by our diffusion model. Furthermore, we propose a triplane manipulation that integrates seamlessly with our triplane diffusion model. The manipulation improves our diffusion model's applicability in a variety of downstream tasks related to outdoor scene generation such as scene inpainting, scene outpainting, and semantic scene completion refinements. In experimental results, we demonstrate that our triplane diffusion model shows meaningful generation results compared with existing work in a real-outdoor dataset, SemanticKITTI. We also show our triplane manipulation facilitates seamlessly adding, removing, or modifying objects within a scene. Further, it also enables the expansion of scenes toward a city-level scale. Finally, we evaluate our method on semantic scene completion refinements where our diffusion model enhances predictions of semantic scene completion networks by learning scene distribution. Our code is available at https://github.com/zoomin-lee/SemCity.
Advancing Saliency Ranking with Human Fixations: Dataset Models and Benchmarks
Bowen Deng · Siyang Song · Andrew French · Denis Schluppeck · Michael Pound
Saliency ranking detection (SRD) has emerged as a challenging task in computer vision, aiming not only to identify salient objects within images but also to rank them based on their degree of saliency. Existing SRD datasets have been created primarily using mouse-trajectory data, which inadequately captures the intricacies of human visual perception. Addressing this gap, this paper introduces the first large-scale SRD dataset, SIFR, constructed using genuine human fixation data, thereby aligning more closely with real visual perceptual processes. To establish a baseline for this dataset, we propose QAGNet, a novel model that leverages salient instance query features from a transformer detector within a tri-tiered nested graph. Through extensive experiments, we demonstrate that our approach outperforms existing state-of-the-art methods across two widely used SRD datasets and our newly proposed dataset. Code and dataset are available at https://github.com/EricDengbowen/QAGNet.
Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing
Boqiang Zhang · Hongtao Xie · Zuan Gao · Yuxin Wang
Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing. Code and the dataset will be available.
Leveraging Predicate and Triplet Learning for Scene Graph Generation
Jiankai Li · Yunhong Wang · Xiefan Guo · Ruijie Yang · Weixin Li
Scene Graph Generation (SGG) aims to identify entities and predict the relationship triplets
Regressor-Segmenter Mutual Prompt Learning for Crowd Counting
Mingyue Guo · Li Yuan · Zhaoyi Yan · Binghui Chen · Yaowei Wang · Qixiang Ye
Crowd counting has achieved significant progress by training regressors to predict head positions. In heavily crowded scenarios, however, regressors are challenged by uncontrollable annotation variance, which causes density map bias and context information inaccuracy. In this study, we propose mutual prompt learning (mPrompt), which leverages a regressor and a segmenter as guidance for each other, alleviating the bias and inaccuracy caused by annotation variance while distinguishing foreground from background. In specific, mPrompt leverages point annotations to tune the segmenter and predict pseudo head masks in a way of point prompt learning. It then uses the predicted segmentation masks, which serve as spatial constraint, to rectify biased point annotations as context prompt learning. From a perspective of mutual information maximization, mPrompt mitigates the impact of annotation variance while improving the model accuracy. Experiments show that mPrompt respectively reduces the Mean Average Error (MAE) significantly on four popular datasets, demonstrating the superiority of mutual prompt learning.Code is enclosed in the supplementary material.
Learning from Observer Gaze: Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition
Yuchen Zhou · Linkai Liu · Chao Gou
Most existing attention prediction research focuses on salient instances like humans and objects. However, the more complex interaction-oriented attention, arising from the comprehension of interactions between instances by human observers, remains largely unexplored. This is equally crucial for advancing human-machine interaction and human-centered artificial intelligence. To bridge this gap, we first collect a novel gaze fixation dataset named IG, comprising 530,000 fixation points across 740 diverse interaction categories, capturing visual attention during human observers’ cognitive processes of interactions. Subsequently, we introduce the zero-shot interaction-oriented attention prediction task (ZeroIA), which challenges models to predict visual cues for interactions not encountered during training. Thirdly, we present the Interactive Attention model (IA), designed to emulate human observers’ cognitive processes to tackle the ZeroIA problem. Extensive experiments demonstrate that the proposed IA outperforms other state-of-the-art approaches in both ZeroIA and fully supervised settings. Lastly, we endeavor to apply interaction-oriented attention to the interaction recognition task itself. Further experimental results demonstrate the promising potential to enhance the performance and interpretability of existing state-of-the-art HOI models by incorporating real human attention data from IG and attention labels generated by IA.
EGTR: Extracting Graph from Transformer for Scene Graph Generation
Jinbae Im · JeongYeon Nam · Nokyung Park · Hyungmin Lee · Seunghyun Park
Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects. After DETR was developed, one-stage SGG models based on a one-stage object detector have been actively studied. However, complex modeling is used to predict the relationship between objects, and the inherent relationship between object queries learned in the multi-head self-attention of the object detector has been neglected. We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder. By fully utilizing the self-attention by-products, the relation graph can be extracted effectively with a shallow relation extraction head. Considering the dependency of the relation extraction task on the object detection task, we propose a novel relation smoothing technique that adjusts the relation label adaptively according to the quality of the detected objects. By the relation smoothing, the model is trained according to the continuous curriculum that focuses on object detection task at the beginning of training and performs multi-task learning as the object detection performance gradually improves. Furthermore, we propose a connectivity prediction task that predicts whether a relation exists between object pairs as an auxiliary task of the relation extraction. We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets. Our code is publicly available at https://github.com/naver-ai/egtr.
SG-PGM: Partial Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks
Yaxu Xie · Alain Pagani · Didier Stricker
Scene graphs have been recently introduced into 3D spatial understanding as a comprehensive representation of the scene.The alignment between 3D scene graphs is the first step of many downstream tasks such as scene graph aided point cloud registration, mosaicking, overlap checking, and robot navigation. In this work, we treat 3D scene graph alignment as a partial graph-matching problem and propose to solve it with a graph neural network. We reuse the geometric features learned by a point cloud registration method and associate the clustered point-level geometric features with the node-level semantic feature via our designed feature fusion module. Partial matching is enabled by using a learnable method to select the top-k similar node pairs. Subsequent downstream tasks such as point cloud registration are achieved by running a pre-trained registration network within the matched regions.We further propose a point-matching rescoring method, that uses the node-wise alignment of the 3D scene graph to reweight the matching candidates from a pre-trained point cloud registration method. It reduces the false point correspondences estimated especially in low-overlapping cases.Experiments show that our method improves the alignment accuracy by 10$\sim$20\% in low-overlap and random transformation scenarios and outperforms the existing work in multiple downstream tasks. Our code and models are available here (https://github.com/dfki-av/sg-pgm.git).
Open-Vocabulary Semantic Segmentation with Image Embedding Balancing
Xiangheng Shan · Dongyue Wu · Guilin Zhu · Yuanjie Shao · Nong Sang · Changxin Gao
Open-vocabulary semantic segmentation is a challenging task, which requires the model to output semantic masks of an image beyond a close-set vocabulary. Although many efforts have been made to utilize powerful CLIP models to accomplish this task, they are still easily overfitting to training classes due to the natural gaps in semantic information between training and new classes. To overcome this challenge, we propose a novel framework for open-vocabulary semantic segmentation called EBSeg, incorporating an Adaptively Balanced Decoder (AdaB Decoder) and a Semantic Structure Consistency loss (SSC Loss). The AdaB Decoder is designed to generate different image embeddings for both training and new classes. Subsequently, these two types of embeddings are adaptively balanced to fully exploit their ability to recognize training classes and generalization ability for new classes. To learn a consistent semantic structure from CLIP, the SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP, thereby improving the generalization ability of our model. Furthermore, we employ a frozen SAM image encoder to complement the spatial information that CLIP features lack due to the low training image resolution and image-level supervision inherent in CLIP. Extensive experiments conducted across various benchmarks demonstrate that the proposed EBSeg outperforms the state-of-the-art methods. Our code and trained models will be here: https://github.com/slonetime/EBSeg.
Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment
Aobo Li · Jinjian Wu · Yongxu Liu · Leida Li
The annotation of blind image quality assessment (BIQA) is labor-intensive and time-consuming, especially for authentic images. Training on synthetic data is expected to be beneficial, but synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that introducing more distortion types in the synthetic dataset may not improve or even be harmful to generalizing authentic image quality assessment. To solve this challenge, we propose distortion-guided unsupervised domain adaptation for BIQA (DGQA), a novel framework that leverages adaptive multi-domain selection via prior knowledge from distortion to match the data distribution between the source domains and the target domain, thereby reducing negative transfer from the outlier source domains. Extensive experiments on two cross-domain settings (synthetic distortion to authentic distortion and synthetic distortion to algorithmic distortion) have demonstrated the effectiveness of our proposed DGQA. Besides, DGQA is orthogonal to existing model-based BIQA methods, and can be used in combination with such models to improve performance with less training data.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen · Jiannan Wu · Wenhai Wang · Weijie Su · Guo Chen · Sen Xing · Zhong Muyan · Qing-Long Zhang · Xizhou Zhu · Lewei Lu · Bin Li · Ping Luo · Tong Lu · Yu Qiao · Jifeng Dai
The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models.
Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples
Junhao Dong · Piotr Koniusz · Junxi Chen · Z. Wang · Yew-Soon Ong
Adversarially robust knowledge distillation aims to compress large-scale models into lightweight models while preserving adversarial robustness and natural performance on a given dataset. Existing methods typically align probability distributions of natural and adversarial samples between teacher and student models, but they overlook intermediate adversarial samples along the ``adversarial path'' formed by the multi-step gradient ascent of a sample towards the decision boundary. Such paths capture rich information about the decision boundary. In this paper, we propose a novel adversarially robust knowledge distillation approach by incorporating such adversarial paths into the alignment process. Recognizing the diverse impacts of intermediate adversarial samples (ranging from benign to noisy), we propose an adaptive weighting strategy to selectively emphasize informative adversarial samples, thus ensuring efficient utilization of lightweight model capacity. Moreover, we propose a dual-branch mechanism exploiting two following insights: (i) complementary dynamics of adversarial paths obtained by targeted and untargeted adversarial learning, and (ii) inherent differences between the gradient ascent path from class $c_i$ towards the nearest class boundary and the gradient descent path from a specific class $c_j$ towards the decision region of $c_i$ ($i \neq j$). Comprehensive experiments demonstrate the effectiveness of our method on lightweight models under various settings.
Class Incremental Learning with Multi-Teacher Distillation
Haitao Wen · Lili Pan · Yu Dai · Heqian Qiu · Lanxiao Wang · Qingbo Wu · Hongliang Li
Distillation strategies are currently the primary approaches for mitigating forgetting in class incremental learning (CIL). Existing methods generally inherit previous knowledge from a single teacher. However, teachers with different mechanisms are talented at different tasks, and inheriting diverse knowledge from them can enhance compatibility with new knowledge. In this paper, we propose the MTD method to find multiple diverse teachers for CIL. Specifically, we adopt weight permutation, feature perturbation, and diversity regularization techniques to ensure diverse mechanisms in teachers. To reduce time and memory consumption, each teacher is represented as a small branch in the model. We adapt existing CIL distillation strategies with MTD and extensive experiments on CIFAR-100, ImageNet-100, and ImageNet-1000 show significant performance improvement.
Large Language Models are Good Prompt Learners for Low-Shot Image Classification
Zhaoheng Zheng · Jingmin Wei · Xuefeng Hu · Haidong Zhu · Ram Nevatia
Low-shot image classification, where training images are limited or inaccessible, has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability, e.g. CLIP. Prompt learning methods built with VL models generate text features from the class names that only have confined class-specific information. Large Language Models (LLMs), with their vast encyclopedic knowledge, emerge as the complement. Thus, in this paper, we discuss the integration of LLMs to enhance pre-trained VL models, specifically on low-shot classification. However, the domain gap between language and vision blocks the direct application of LLMs. Thus, we propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder, establishing it as the connecting bridge. Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification, over a spectrum of 11 datasets.
Consistent Prompting for Rehearsal-Free Continual Learning
Zhanxin Gao · Jun Cen · Xiaobin Chang
Continual learning empowers models to adapt autonomously to the ever-changing environment or data streams without forgetting old knowledge. Prompt-based approaches are built on frozen pre-trained models to learn the task-specific prompts and classifiers efficiently. Existing prompt-based methods are inconsistent between training and testing, limiting their effectiveness. Two types of inconsistency are revealed. Test predictions are made from all classifiers while training only focuses on the current task classifier without holistic alignment, leading to Classifier inconsistency. Prompt inconsistency indicates that the prompt selected during testing may not correspond to the one associated with this task during training. In this paper, we propose a novel prompt-based method, Consistent Prompting (CPrompt), for more aligned training and testing. Specifically, all existing classifiers are exposed to prompt training, resulting in classifier consistency learning. In addition, prompt consistency learning is proposed to enhance prediction robustness and boost prompt selection accuracy. Our Consistent Prompting surpasses its prompt-based counterparts and achieves state-of-the-art performance on multiple continual learning benchmarks. Detailed analysis shows that improvements come from more consistent training and testing.
Tuning Stable Rank Shrinkage: Aiming at the Overlooked Structural Risk in Fine-tuning
Sicong Shen · Yang Zhou · Bingzheng Wei · Eric Chang · Yan Xu
Existing fine-tuning methods for computer vision tasks primarily focus on re-weighting the knowledge learned from the source domain during pre-training. They aim to retain beneficial knowledge for the target domain while suppressing unfavorable knowledge. During the pre-training and fine-tuning stages, there is a notable disparity in the data scale. Consequently, it is theoretically necessary to employ a model with reduced complexity to mitigate the potential structural risk. However, our empirical investigation in this paper reveals that models fine-tuned using existing methods still manifest a high level of model complexity inherited from the pre-training stage, leading to a suboptimal stability and generalization ability. This phenomenon indicates an issue that has been overlooked in fine-tuning: Structural Risk Minimization. To address this issue caused by data scale disparity during the fine-tuning stage, we propose a simple yet effective approach called Tuning Stable Rank Shrinkage (TSRS). TSRS mitigates the structural risk during the fine-tuning stage by constraining the noise sensitivity of the target model based on stable rank theories. Through extensive experiments, we demonstrate that incorporating TSRS into fine-tuning methods leads to improved generalization ability on various tasks, regardless of whether the neural networks are based on convolution or transformer architectures. Additionally, empirical analysis reveals that TSRS enhances the robustness, convexity, and smoothness of the loss landscapes in fine-tuned models. Code is available at https://github.com/WitGotFlg/TSRS.
Coherent Temporal Synthesis for Incremental Action Segmentation
Guodong Ding · Hans Golong · Angela Yao
Data replay is a successful incremental learning technique for images. It prevents catastrophic forgetting by keeping a reservoir of previous data, original or synthesized, to ensure the model retains past knowledge while adapting to novel concepts. However, its application in the video domain is rudimentary, as it simply stores frame exemplars for action recognition. This paper presents the first exploration of video data replay techniques for incremental action segmentation, focusing on action temporal modeling. We propose a Temporally Coherent Action (TCA) model, which represents actions using a generative model instead of storing individual frames. The integration of a conditioning variable that captures temporal coherence allows our model to understand the evolution of action features over time. Therefore, action segments generated by TCA for replay are diverse and temporally coherent. In a 10-task incremental setup on the Breakfast dataset, our approach achieves significant increases in accuracy for up to 22% compared to the baselines.
FCS: Feature Calibration and Separation for Non-Exemplar Class Incremental Learning
Qiwei Li · Yuxin Peng · Jiahuan Zhou
Non-Exemplar Class Incremental Learning (NECIL) involves learning a classification model on a sequence of data without access to exemplars from previously encountered old classes. Such a stringent constraint always leads to catastrophic forgetting of the learned knowledge. Currently, existing methods either employ knowledge distillation techniques or preserved class prototypes to sustain prior knowledge. However, two critical issues still persist. On the one hand, as the model is continually updated, the preserved prototypes of old classes will inevitably derive from the suitable location in the feature space of the new model. On the other hand, due to the lack of exemplars, the features of new classes will take the place of similar old classes which breaks the classification boundary. To address these challenges, we propose a Feature Calibration and Separation (FCS) method for NECIL. Our approach comprises a Feature Calibration Network (FCN) that adapts prototypes of old classes to the new model via optimal transport learning, approximating the drift of prototypes caused by model evolution. Additionally, we also propose a Prototype-Involved Contrastive Loss (PIC) that enhances feature separation among different classes. Specifically, to mitigate the boundary distortion arising from the interplay of classes from different learning stages, prototypes are involved in pushing the feature of new classes away from the old classes. Extensive experiments on three datasets with different settings have demonstrated the superiority of our FCS method against the state-of-the-art class incremental learning approaches. Code is available at https://github.com/zhoujiahuan1991/CVPR2024-FCS.
DeIL: Direct-and-Inverse CLIP for Open-World Few-Shot Learning
Shuai Shao · Yu Bai · Yan WANG · Bao-di Liu · Yicong Zhou
Open-World Few-Shot Learning (OFSL) is a critical field of research, concentrating on the precise identification of target samples in environments with scarce data and unreliable labels, thus possessing substantial practical significance. Recently, the evolution of foundation models like CLIP has revealed their strong capacity for representation, even in settings with restricted resources and data. This development has led to a significant shift in focus, transitioning from the traditional method of “building models from scratch” to a strategy centered on “efficiently utilizing the capabilities of foundation models to extract relevant prior knowledge tailored for OFSL and apply it judiciously”. Amidst this backdrop, we unveil the Direct-and-Inverse CLIP (DeIL), an innovative method leveraging our proposed “Direct-and-Inverse” concept to activate CLIP-based methods for addressing OFSL. This concept transforms conventional single-step classification into a nuanced two-stage process: initially filtering out less probable categories, followed by accurately determining the specific category of samples. DeIL comprises two key components: a pre-trainer (frozen) for data denoising, and an adapter (tunable) for achieving precise final classification. In experiments, DeIL achieves SOTA performance on 11 datasets. https://github.com/The-Shuai/DeIL.
Understanding and Improving Source-free Domain Adaptation from a Theoretical Perspective
Yu Mitsuzumi · Akisato Kimura · Hisashi Kashima
Source-free Domain Adaptation (SFDA) is an emerging and challenging research area that addresses the problem of unsupervised domain adaptation (UDA) without source data. Though numerous successful methods have been proposed for SFDA, a theoretical understanding of why these methods work well is still absent. In this paper, we shed light on the theoretical perspective of existing SFDA methods. Specifically, we find that SFDA loss functions comprising discriminability and diversity losses work in the same way as the training objective in the theory of self-training based on the expansion assumption, which shows the existence of the target error bound. This finding brings two novel insights that enable us to build an improved SFDA method comprising 1) Model Training with Auto-Adjusting Diversity Constraint and 2) Augmentation Training with Teacher-Student Framework, yielding a better recognition performance. Extensive experiments on three benchmark datasets demonstrate the validity of the theoretical analysis and our method.
Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning
Dipam Goswami · Albin Soutif · Yuyang Liu · Sandesh Kamath · Bartłomiej Twardowski · Joost van de Weijer
Continual learning methods are known to suffer from catastrophic forgetting, a phenomenon that is particularly hard to counter for methods that do not store exemplars of previous tasks. Therefore, to reduce potential drift in the feature extractor, existing exemplar-free methods are typically evaluated in settings where the first task is significantly larger than subsequent tasks. Their performance drops drastically in more challenging settings starting with a smaller first task. To address this problem of feature drift estimation for exemplar-free methods, we propose to adversarially perturb the current samples such that their embeddings are close to the old class prototypes in the old model embedding space. We then estimate the drift in the embedding space from the old to the new model using the perturbed images and compensate the prototypes accordingly. We exploit the fact that adversarial samples are transferable from the old to the new feature space in a continual learning setting. The generation of these images is simple and computationally cheap. We demonstrate in our experiments that the proposed approach better tracks the movement of prototypes in embedding space and outperforms existing methods on several standard continual learning benchmarks as well as on fine-grained datasets. Code is available at https://github.com/dipamgoswami/ADC.
Adversarially Robust Few-shot Learning via Parameter Co-distillation of Similarity and Class Concept Learners
Junhao Dong · Piotr Koniusz · Junxi Chen · Xiaohua Xie · Yew-Soon Ong
Few-shot learning (FSL) facilitates a variety of computer vision tasks yet remains vulnerable to adversarial attacks. Existing adversarially robust FSL methods rely on either visual similarity learning or class concept learning. Our analysis reveals that these two learning paradigms are complementary, exhibiting distinct robustness due to their unique decision boundary types (concepts clustering by the visual similarity label vs. classification by the class labels). To bridge this gap, we propose a novel framework unifying adversarially robust similarity learning and class concept learning. Specifically, we distill parameters from both network branches into a "unified embedding model" during robust optimization and redistribute them to individual network branches periodically. To capture generalizable robustness across diverse branches, we initialize adversaries in each episode with cross-branch class-wise "global adversarial perturbations" instead of less informative random initialization. We also propose a branch robustness harmonization to modulate the optimization of similarity and class concept learners via their relative adversarial robustness. Extensive experiments demonstrate the state-of-the-art performance of our method in diverse few-shot scenarios.
Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation
Ba Hung Ngo · Nhat-Tuong Do-Tran · Tuan-Ngoc Nguyen · Hae-Gon Jeon · Tae Jong Choi
Most domain adaptation (DA) methods are based on either a convolutional neural networks (CNNs) or a vision transformers (ViTs). They align the distribution differences between domains as encoders without considering their unique characteristics. For instance, ViT excels in accuracy due to its superior ability to capture global representations, while CNN has an advantage in capturing local representations. This fact has led us to design a hybrid method to fully take advantage of both ViT and CNN, called $\textbf{E}$xplicitly $\textbf{C}$lass-specific $\textbf{B}$oundaries ($\textbf{ECB}$). ECB learns CNN on ViT to combine their distinct strengths. In particular, we leverage ViT's properties to explicitly find class-specific decision boundaries by maximizing the discrepancy between the outputs of the two classifiers to detect target samples far from the source support. In contrast, the CNN encoder clusters target features based on the previously defined class-specific boundaries by minimizing the discrepancy between the probabilities of the two classifiers. Finally, ViT and CNN mutually exchange knowledge to improve the quality of pseudo labels and reduce the knowledge discrepancies of these models. Compared to conventional DA methods, our ECB achieves superior performance, which verifies its effectiveness in this hybrid model. The project website can be found https://dotrannhattuong.github.io/ECB/website/.
Efficient Stitchable Task Adaptation
Haoyu He · Zizheng Pan · Jing Liu · Jianfei Cai · Bohan Zhuang
The paradigm of pre-training and fine-tuning has laid the foundation for deploying deep learning models. However, most fine-tuning methods are designed to meet a specific resource budget. Recently, considering diverse deployment scenarios with various resource budgets, SN-Net is introduced to quickly obtain numerous new networks (stitches) from the pre-trained models (anchors) in a model family via model stitching. Although promising, SN-Net confronts new challenges when adapting it to new target domains, including huge memory and storage requirements and a long and sub-optimal multistage adaptation process. In this work, we present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models that adhere to diverse resource constraints. Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches while maintaining independent bias terms. In this way, we largely reduce fine-tuning memory burdens and mitigate the interference among stitches that arises in task adaptation. Furthermore, we streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy with training-time gradient statistics. By assigning higher sampling probabilities to important stitches, we also get a boosted Pareto frontier. Extensive experiments on 25 downstream visual recognition tasks demonstrate that our ESTA is capable of generating stitches with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net adaptation by remarkable margins with significantly lower training time and fewer trainable parameters. Furthermore, we demonstrate the flexibility and scalability of our ESTA framework by stitching large language models (LLMs) and obtaining chatbot stitches of various sizes.
Gradient-based Parameter Selection for Efficient Fine-Tuning
Zhi Zhang · Qizhe Zhang · Zijun Gao · Renrui Zhang · Ekaterina Shutova · Shiji Zhou · Shanghang Zhang
With the growing size of pre-trained models, full fine-tuning and storing all the parameters for various downstream tasks is costly and infeasible. In this paper, we propose a new parameter-efficient fine-tuning method, Gradient-based Parameter Selection (GPS), demonstrating that only tuning a few selected parameters from the pre-trained model while keeping the remainder of the model frozen can generate similar or better performance compared with the full model fine-tuning method. Different from the existing popular and state-of-the-art parameter-efficient fine-tuning approaches, our method does not introduce any additional parameters and computational costs during both the training and inference stages. Another advantage is the model-agnostic and non-destructive property, which eliminates the need for any other design specific to a particular model. Compared with the full fine-tuning, GPS achieves 3.33% (91.78% vs. 88.45%, FGVC) and 9.61% (73.1% vs. 65.57%, VTAB) improvement of the accuracy with tuning only 0.36% parameters of the pre-trained model on average over 24 image classification tasks; it also demonstrates a significant improvement of 17% and 16.8% in mDice and mIoU, respectively, on medical image segmentation task. Moreover, GPS achieves state-of-the-art performance compared with existing PEFT methods. The code will be available in https://github.com/FightingFighting/GPS.git.
ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models
Xinyu Tian · Shu Zou · Zhaoyuan Yang · Jing Zhang
Although soft prompt tuning is effective in efficiently adapting Vision-Language (V\&L) models for downstream tasks, it shows limitations in dealing with distribution shifts. We address this issue with Attribute-Guided Prompt Tuning (ArGue), making three key contributions. 1) In contrast to the conventional approach of directly appending soft prompts preceding class names, we align the model with primitive visual attributes generated by Large Language Models (LLMs). We posit that a model's ability to express high confidence in these attributes signifies its capacity to discern the correct class rationales. 2) We introduce attribute sampling to eliminate disadvantageous attributes, thus only semantically meaningful attributes are preserved. 3) We propose negative prompting, explicitly enumerating class-agnostic attributes to activate spurious correlations and encourage the model to generate highly orthogonal probability distributions in relation to these negative features. In experiments, our method significantly outperforms current state-of-the-art prompt tuning methods on both novel class prediction and out-of-distribution generalization tasks.
Simple Semantic-Aided Few-Shot Learning
Hai Zhang · Junzhe Xu · Shanlin Jiang · Zhenan He
Learning from a limited amount of data, namely Few-Shot Learning, stands out as a challenging computer vision task. Several works exploit semantics and design complicated semantic fusion mechanisms to compensate for rare representative features within restricted data. However, relying on naive semantics such as class names introduces biases due to their brevity, while acquiring extensive semantics from external knowledge takes a huge time and effort. This limitation severely constrains the potential of semantics in Few-Shot Learning. In this paper, we design an automatic way called Semantic Evolution to generate high-quality semantics. The incorporation of high-quality semantics alleviates the need for complex network structures and learning algorithms used in previous works. Hence, we employ a simple two-layer network termed Semantic Alignment Network to transform semantics and visual features into robust class prototypes with rich discriminative features for few-shot classification. The experimental results show our framework outperforms all previous methods on six benchmarks, demonstrating a simple network with high-quality semantics can beat intricate multi-modal modules on few-shot classification tasks. Code is available at https://github.com/zhangdoudou123/SemFew.
Long-Tail Class Incremental Learning via Independent Sub-prototype Construction
Xi Wang · Xu Yang · Jie Yin · Kun Wei · Cheng Deng
Long-tail class incremental learning (LT-CIL) is designed to perpetually acquire novel knowledge from an imbalanced and perpetually evolving data stream while ensuring the retention of previously acquired knowledge. The existing method only re-balances data distribution and ignores exploring the potential relationship between different samples, causing non-robust representations and even severe forgetting in classes with few samples. In this paper, we constructed two parallel spaces simultaneously: 1) Sub-prototype space and 2) Reminiscence space to learn robust representations while alleviating forgetfulness. Concretely, we advance the concept of the sub-prototype space, which amalgamates insights from diverse classes. This integration facilitates the mutual complementarity of varied knowledge, thereby augmenting the attainment of more robust representations.Furthermore, we introduce the reminiscence space, which encapsulates each class distribution, aiming to constraint model optimization and mitigate the phenomenon of forgetting. The tandem utilization of the two parallel spaces effectively alleviates the adverse consequences associated with imbalanced data distribution, preventing forgetting without needing replay examples. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various benchmarks.
Few-shot object detection (FSOD) aims to detect objects with only a few training examples. Visual feature extraction and query-support few-shot learning are the two critical components. Existing works are usually developed based on ImageNet pre-training vision backbones and design sophisticated metric learning networks, which have inferior accuracy. In this work, we study few-shot object detection using modern foundation models. First, vision-only contrastive pre-trained DINOv2 model is used for the vision backbone, which shows strong transferable performance without tuning the parameters. Second, Large Language Model (LLM) is employed for contextualized few-shot learning with all classes and proposals within the query image. Language instructions are carefully designed to prompt the LLM to classify each proposal in context. The contextual information include proposal-proposal relations, proposal-class relations, and class-class relations, which can largely promote few-shot learning. We comprehensively evaluate the proposed model (FM-FSOD) in multiple FSOD benchmarks, achieving state-of-the-arts performance.
Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
ZHIXIANG WEI · Lin Chen · Xiaoxiao Ma · Huaian Chen · Tianle Liu · Pengyang Ling · Jinjin Zheng · Ben Wang · Yi Jin
In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that $\textbf{Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability}$, we introduce a robust fine-tuning approach, namely "$\textbf{Rein}$", to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens, each linked to distinct instances, Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning.Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra $\textbf{1}$% of trainable parameters within the frozen backbone, Rein achieves a mIoU of $\textbf{68.1}$% on the Cityscapes, without accessing any real urban-scene datasets. Such an improvement boosts the state-of-the-art by a notable $\textbf{21.7}$% in mIoU with efficient training.
Continual Forgetting for Pre-trained Vision Models
Hongbo Zhao · Bolin Ni · Junsong Fan · Yuxi Wang · Yuntao Chen · Gaofeng Meng · Zhaoxiang Zhang
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify two key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. To address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. GS-LoRA is effective, parameter-efficient, data-efficient, and easy to implement. We conduct extensive experiments on both face recognition and object detection and demonstrate that GS-LoRA manages to forget specific classes with minimal impact on other classes. Codes will be available upon acceptance.
AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation
Taeckyung Lee · Sorn Chottananurak · Taesik Gong · Sung-Ju Lee
Test-time adaptation (TTA) has emerged as a viable solution to adapt pre-trained models to domain shifts using unlabeled test data. However, TTA faces challenges of adaptation failures due to its reliance on blind adaptation to unknown test samples in dynamic scenarios. Traditional methods for out-of-distribution performance estimation are limited by unrealistic assumptions in the TTA context, such as requiring labeled data or re-training models. To address this issue, we propose AETTA, a label-free accuracy estimation algorithm for TTA. We propose the prediction disagreement as the accuracy estimate, calculated by comparing the target model prediction with dropout inferences.We then improve the prediction disagreement to extend the applicability of AETTA under adaptation failures. Our extensive evaluation with four baselines and six TTA methods demonstrates that AETTA shows an average of 19.8\%p more accurate estimation compared with the baselines.We further demonstrate the effectiveness of accuracy estimation with a model recovery case study, showcasing the practicality of our model recovery based on accuracy estimation. The source code is available at https://github.com/taeckyung/AETTA.
Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation
Jiaming Liu · Ran Xu · Senqiao Yang · Renrui Zhang · Qizhe Zhang · Zehui Chen · Yandong Guo · Shanghang Zhang
Continual Test-Time Adaptation (CTTA) is proposed to migrate a source pre-trained model to continually changing target distributions, addressing real-world dynamism. Existing CTTA methods mainly rely on entropy minimization or teacher-student pseudo-labeling schemes for knowledge extraction in unlabeled target domains. However, dynamic data distributions cause miscalibrated predictions and noisy pseudo-labels in existing self-supervised learning methods, hindering the effective mitigation of error accumulation and catastrophic forgetting problems during the continual adaptation process. To tackle these issues, we propose a continual self-supervised method, Adaptive Distribution Masked Autoencoders (ADMA), which enhances the extraction of target domain knowledge while mitigating the accumulation of distribution shifts. Specifically, we propose a Distribution-aware Masking (DaM) mechanism to adaptively sample masked positions, followed by establishing consistency constraints between the masked target samples and the original target samples. Additionally, for masked tokens, we utilize an efficient decoder to reconstruct a hand-crafted feature descriptor (e.g., Histograms of Oriented Gradients), leveraging its invariant properties to boost task-relevant representations. Through conducting extensive experiments on four widely recognized benchmarks, our proposed method attains state-of-the-art performance in both classification and segmentation CTTA tasks.
LEAD: Exploring Logit Space Evolution for Model Selection
Zixuan Hu · Xiaotong Li · SHIXIANG TANG · Jun Liu · Yichun Hu · Ling-Yu Duan
The remarkable success of ''pretrain-then-finetune'' paradigm has led to a proliferation of available pre-trained models for vision tasks. This surge presents a significant challenge in efficiently choosing the most suitable pre-trained models for downstream tasks. The critical aspect of this challenge lies in effectively predicting the model transferability by considering the underlying fine-tuning dynamics. Existing methods often model fine-tuning dynamics in feature space with linear transformations, which do not precisely align with the fine-tuning objective and fail to grasp the essential nonlinearity from optimization. To this end, we present LEAD, a finetuning-aligned approach based on the network output of logits. LEAD proposes a theoretical framework to model the optimization process and derives an ordinary differential equation (ODE) to depict the nonlinear evolution toward the final logit state. Additionally, we design a class-aware decomposition method to consider the varying evolution dynamics across classes and further ensure practical applicability. Integrating the closely aligned optimization objective and nonlinear modeling capabilities derived from the differential equation, our method offers a concise solution to effectively bridge the optimization gap in a single step, bypassing the lengthy fine-tuning process. The comprehensive experiments on 24 supervised and self-supervised pre-trained models across 10 downstream datasets demonstrate impressive performances and showcase its broad adaptability even in low-data scenarios.
In order to mimic the human few-shot learning (FSL) ability better and to make FSL closer to real-world applications, this paper proposes a practical FSL (pFSL) setting. pFSL is based on unsupervised pre-trained models (analogous to human prior knowledge) and recognizes many novel classes simultaneously. Compared to traditional FSL, pFSL is simpler in its formulation, easier to evaluate, more challenging and more practical. To cope with the rarity of training examples, this paper proposes IbM2, an instance-based max-margin method not only for the new pFSL setting, but also works well in traditional FSL scenarios. Based on the Gaussian Annulus Theorem, IbM2 converts random noise applied to the instances into a mechanism to achieve maximum margin in the many-way pFSL (or traditional FSL) recognition task. Experiments with various self-supervised pre-training methods and diverse many- or few-way FSL tasks show that IbM2 almost always leads to improvements compared to its respective baseline methods, and in most cases the improvements are significant. With both the new pFSL setting and novel IbM2 method, this paper shows that practical few-shot learning is both viable and promising.
Domain Gap Embeddings for Generative Dataset Augmentation
Yinong Oliver Wang · Younjoon Chung · Chen Henry Wu · Fernando De la Torre
The performance of deep learning models is intrinsically tied to the quality, volume, and relevance of their training data. Gathering ample data for production scenarios often demands significant time and resources. Among various strategies, data augmentation circumvents exhaustive data collection by generating new data points from existing ones. However, traditional augmentation techniques can be less effective amidst a shift in training and testing distributions.This paper explores the potential of synthetic data by leveraging large pre-trained models for data augmentation, especially when confronted with distribution shifts. Although recent advancements in generative models have enabled several prior works in cross-distribution data generation, they require model fine-tuning and a complex setup. To bypass these shortcomings, we introduce Domain Gap Embeddings (DoGE), a plug-and-play semantic data augmentation framework in a cross-distribution few-shot setting. Our method extracts disparities between source and desired data distributions in a latent form, and subsequently steers a generative process to supplement the training set with endless diverse synthetic samples. Our evaluations, conducted on a subpopulation shift and three domain adaptation scenarios under a few-shot paradigm, reveal that our versatile method improves performance across tasks without needing hands-on intervention or intricate fine-tuning. DoGE paves the way to effortlessly generate realistic, controllable synthetic datasets following the test distributions, bolstering real-world efficacy for downstream task models.
JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models
YUNCHENG GUO · Xiaodong Gu
Leveraging few-shot datasets in prompt learning for Vision-Language Models eliminates the need for manual prompt engineering while highlighting the necessity of accurate annotations for the labels. However, high-level or complex label noise challenges prompt learning for Vision-Language Models. Aiming at this issue, we propose a new framework for improving its robustness. Specifically, we introduce the Joint Adaptive Partitioning for Label Refurbishment (JoAPR), a structured framework encompassing two key steps. 1) Data Partitioning, where we differentiate between clean and noisy data using joint adaptive thresholds. 2) Label Refurbishment, where we correct the labels based on the partition outcomes before retraining the network. Our comprehensive experiments confirm that JoAPR substantially enhances the robustness of prompt learning for Vision-Language Models against label noise, offering a promising direction for future research.
Generative Multi-modal Models are Good Class Incremental Learners
Xusheng Cao · Haori Lu · Linlan Huang · Xialei Liu · Ming-Ming Cheng
In class incremental learning (CIL) scenarios, the phenomenon of catastrophic forgetting caused by the classifier's bias towards the current task has long posed a significant challenge. It is mainly caused by the characteristic of discriminative models. With the growing popularity of the generative multi-modal models, we would explore replacing discriminative models with generative ones for CIL. However, transitioning from discriminative to generative models requires addressing two key challenges. The primary challenge lies in transferring the generated textual information into the classification of distinct categories. Additionally, it requires formulating the task of CIL within a generative framework. To this end, we propose a novel generative multi-modal model (GMM) framework for class incremental learning. Our approach directly generates labels for images using an adapted generative model. After obtaining the detailed text, we use a text encoder to extract text features and employ feature matching to determine the most similar label as the classification prediction. In the conventional CIL settings, we achieve significantly better results in long-sequence task scenarios. Under the Few-shot CIL setting, we have improved by at least 14% over the current state-of-the-art methods with significantly less forgetting.
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
Yabin Zhang · Wenjie Zhu · Hui Tang · Zhiyuan Ma · Kaiyang Zhou · Lei Zhang
With the emergence of pre-trained vision-language models like CLIP, how to adapt them to various downstream classification tasks has garnered significant attention in recent research. The adaptation strategies can be typically categorized into three paradigms: zero-shot adaptation, few-shot adaptation, and the recently-proposed training-free few-shot adaptation. Most existing approaches are tailored for a specific setting and can only cater to one or two of these paradigms. In this paper, we introduce a versatile adaptation approach that can effectively work under all three settings. Specifically, we propose the dual memory networks that comprise dynamic and static memory components. The static memory caches training data knowledge, enabling training-free few-shot adaptation, while the dynamic memory preserves historical test features online during the testing process, allowing for the exploration of additional data insights beyond the training set. This novel capability enhances model performance in the few-shot setting and enables model usability in the absence of training data.The two memory networks employ the same flexible memory interactive strategy, which can operate in a training-free mode and can be further enhanced by incorporating learnable projection layers. Our approach is tested across 11 datasets under the three task settings. Remarkably, in the zero-shot scenario, it outperforms existing methods by over 3\% and even shows superior results against methods utilizing external training data. Additionally, our method exhibits robust performance against natural distribution shifts. Codes are available at \url{https://github.com/YBZh/DMN}.
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory
Haiwen Diao · Bo Wan · Ying Zhang · Xu Jia · Huchuan Lu · Long Chen
Parameter-efficient transfer learning (PETL), i.e., fine-tuning a small portion of parameters, is an effective strategy for adapting pre-trained models to downstream domains. To further reduce the memory demand, recent PETL works focus on the more valuable memory-efficient characteristic. In this paper, we argue that the scalability, adaptability, and generalizability of state-of-the-art methods are hindered by structural dependency and pertinency on specific pre-trained backbones. To this end, we propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT), to mitigate these weaknesses. Specifically, we facilitate the transfer process via a lightweight learnable parallel network, which consists of: 1) A parallel interaction module that decouples the sequential connections and processes the intermediate activations detachedly from the pre-trained network. 2) A confidence aggregation module that learns optimal strategies adaptively for integrating cross-layer features. We evaluate UniPT with different backbones (e.g., T5, VSE$\infty$, CLIP4Clip, Clip-ViL, and MDETR) on various vision-and-language tasks (image-text retrieval, video-text retrieval, visual question answering, compositional question answering, and visual grounding), and even pure NLP tasks (e.g., GLUE). Extensive ablations on 18 datasets have validated that UniPT can not only dramatically reduce memory consumption and outperform the best competitor, but also achieve competitive performance over other plain PETL methods with lower training memory overhead.
Federated Generalized Category Discovery
Nan Pu · Wenjing Li · Xinyuan Ji · Yalan Qin · Nicu Sebe · Zhun Zhong
Generalized category discovery (GCD) aims at grouping unlabeled samples from known and unknown classes, given labeled data of known classes. To meet the recent decentralization trend in the community, we introduce a practical yet challenging task, Federated GCD (Fed-GCD), where the training data are distributed in local clients and cannot be shared among clients. Fed-GCD aims to train a generic GCD model by client collaboration under the privacy-protected constraint. The Fed-GCD leads to two challenges: 1) representation degradation caused by training each client model with fewer data than centralized GCD learning, and 2) highly heterogeneous label spaces across different clients. To this end, we propose a novel Associated Gaussian Contrastive Learning (AGCL) framework based on learnable GMMs, which consists of a Client Semantics Association (CSA) and a global-local GMM Contrastive Learning (GCL). On the server, CSA aggregates the heterogeneous categories of local-client GMMs to generate a global GMM containing more comprehensive category knowledge. On each client, GCL builds class-level contrastive learning with both local and global GMMs. The local GCL learns robust representation with limited local data. The global GCL encourages the model to produce more discriminative representation with the comprehensive category relationships that may not exist in local data. We build a benchmark based on six visual datasets to facilitate the study of Fed-GCD. Extensive experiments show that our AGCL outperforms multiple baselines on all datasets.
Learning from One Continuous Video Stream
Joao Carreira · Michael King · Viorica Patraucean · Dilara Gokay · Catalin Ionescu · Yi Yang · Daniel Zoran · Joseph Heyward · Carl Doersch · Yusuf Aytar · Dima Damen · Andrew Zisserman
We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of streams and tasks composed from two existing video datasets, plus methodology for performance evaluation that considers both adaptation and generalization. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation as well as between arbitrary tasks, without ever requiring changes to models and always using the same pixel loss. Equipped with this framework we obtained large single-stream learning gains from pre-training with a novel family of future prediction tasks, found that momentum hurts, and that the pace of weight updates matters. The combination of these insights leads to matching the performance of IID learning with batch size 1, when using the same architecture and without costly replay buffers. An overview of the paper is available online at https://sites.google.com/view/one-stream-video.
OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning
Noor Ahmed · Anna Kukleva · Bernt Schiele
Few-Shot Class Incremental Learning (FSCIL) introduces a paradigm in which the problem space expands with limited data. FSCIL methods inherently face the challenge of catastrophic forgetting as data arrives incrementally, making models susceptible to overwriting previously acquired knowledge. Moreover, given the scarcity of labeled samples available at any given time, models may be prone to overfitting and find it challenging to strike a balance between extensive pretraining and the limited incremental data. To address these challenges, we propose the OrCo framework built on two core principles: features' orthogonality in the representation space, and contrastive learning. In particular, we improve the generalization of the embedding space by employing a combination of supervised and self-supervised contrastive losses during the pretraining phase. Additionally, we introduce OrCo loss to address challenges arising from data limitations during incremental sessions. Through feature space perturbations and orthogonality between classes, the OrCo loss maximizes margins and reserves space for the following incremental data. This, in turn, ensures the accommodation of incoming classes in the feature space without compromising previously acquired knowledge. Our experimental results showcase state-of-the-art performance across three benchmark datasets, including mini-ImageNet, CIFAR100, and CUB datasets. The code will be made publicly available.
SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection
JUNSU KIM · Hoseong Cho · Jihyeon Kim · Yihalem Tiruneh · Seungryul Baek
In the field of class incremental learning (CIL), generative replay has become increasingly prominent as a method to mitigate the catastrophic forgetting, alongside the continuous improvements in generative models. However, its application in class incremental object detection (CIOD) has been significantly limited, primarily due to the complexities of scenes involving multiple labels. In this paper, we propose a novel approach called stable diffusion deep generative replay (SDDGR) for CIOD. Our method utilizes a diffusion-based generative model with pre-trained text-to-image diffusion networks to generate realistic and diverse synthetic images. SDDGR incorporates an iterative refinement strategy to produce high-quality images encompassing old classes. Additionally, we adopt an L2 knowledge distillation technique to improve the retention of prior knowledge in synthetic images. Furthermore, our approach includes pseudo-labeling for old objects within new task images, preventing misclassification as background elements. Extensive experiments on the COCO 2017 dataset demonstrate that SDDGR significantly outperforms existing algorithms, achieving a new state-of-the-art in various CIOD scenarios.
Active Domain Adaptation with False Negative Prediction for Object Detection
Yuzuru Nakamura · Yasunori Ishii · Takayoshi Yamashita
Domain adaptation adapts models to various scenes with different appearances. In this field, active domain adaptation is crucial in effectively sampling a limited number of data in the target domain. We propose an active domain adaptation method for object detection, focusing on quantifying the undetectability of objects. Existing methods for active sampling encounter challenges in considering undetected objects while estimating the uncertainty of model predictions. Our proposed active sampling strategy addresses this issue using an active learning approach that simultaneously accounts for uncertainty and undetectability. Our newly proposed False Negative Prediction Module evaluates the undetectability of images containing undetected objects, enabling more informed active sampling. This approach considers previously overlooked undetected objects, thereby reducing false negative errors. Moreover, using unlabeled data, our proposed method utilizes uncertainty-guided pseudo-labeling to enhance domain adaptation further. Extensive experiments demonstrate that the performance of our proposed method closely rivals that of fully supervised learning while requiring only a fraction of the labeling efforts needed for the latter.
Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements
Niccolò Biondi · Federico Pernici · Simone Ricci · Alberto Del Bimbo
Learning compatible representations enables the interchangeable use of semantic features as models are updated over time. This is particularly relevant in search and retrieval systems where it is crucial to avoid reprocessing of the gallery images with the updated model. While recent research has shown promising empirical evidence, there is still a lack of comprehensive theoretical understanding about learning compatible representations. In this paper, we demonstrate that the stationary representations learned by the $d$-Simplex fixed classifier optimally approximate compatibility representation according to the two inequality constraints of its formal definition. This not only establishes a solid foundation for future works in this line of research but also presents implications that can be exploited in practical learning scenarios. An exemplary application is the now-standard practice of downloading and fine-tuning new pre-trained models. Specifically, we show the strengths and critical issues of stationary representations in the case in which a model undergoing sequential fine-tuning is asynchronously replaced by downloading a better-performing model pre-trained elsewhere. Such a representation enables seamless delivery of retrieval service (i.e., no reprocessing of gallery images) and offers improved performance without operational disruptions during model replacement.
Your Transferability Barrier is Fragile: Free-Lunch for Transferring the Non-Transferable Learning
Ziming Hong · Li Shen · Tongliang Liu
Recently, non-transferable learning (NTL) was proposed to restrict models' generalization toward the target domain(s), which serves as state-of-the-art solutions for intellectual property (IP) protection. However, the robustness of the established "transferability barrier" for degrading the target domain performance has not been well studied. In this paper, we first show that the generalization performance of NTL models is widely impaired on third-party domains (i.e., the unseen domain in the NTL training stage). We explore the impairment patterns and find that: due to the dominant generalization of non-transferable task, NTL models tend to make target-domain-consistent predictions on third-party domains, even though only a slight distribution shift from the third-party domain to the source domain. Motivated by these findings, we uncover the potential risks of NTL by proposing a simple but effective method (dubbed as TransNTL) to recover the target domain performance with few source domain data. Specifically, by performing a group of different perturbations on the few source domain data, we obtain diverse third-party domains that evoke the same impairment patterns as the unavailable target domain. Then, we fine-tune the NTL model under an impairment-repair self-distillation framework, where the source-domain predictions are used to teach the model itself how to predict on third-party domains, thus repairing the impaired generalization. Empirically, experiments on standard NTL benchmarks show that the proposed TransNTL reaches up to $\sim$72\% target-domain improvements by using only 10\% source domain data. Finally, we also explore a feasible defense method and empirically demonstrate its effectiveness.
Transductive Zero-Shot and Few-Shot CLIP
Ségolène Martin · Yunshi HUANG · Fereshteh Shakeri · Jean-Christophe Pesquet · Ismail Ben Ayed
Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class assignments. Extensive numerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach.On zero-shot tasks with test batches of 75 samples, our approach yields near 20$\%$ improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally, we outperform state-of-the-art methods in the few-shot setting. The code is available at: \url{https://github.com/SegoleneMartin/transductive-CLIP}.
Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships
Rangel Daroya · Aaron Sun · Subhransu Maji
Modeling and visualizing relationships between tasks or datasets is an important step towards solving various meta-tasks such as dataset discovery, multi-tasking, and transfer learning. However, many relationships, such as containment and transferability, are naturally asymmetric and current approaches for representation and visualization (e.g., t-SNE) do not readily support this. We propose Task2Box, an approach to represent tasks using box embeddings---axis-aligned hyperrectangles in low dimensional spaces---that can capture asymmetric relationships between them through volumetric overlaps. We show that Task2Box accurately predicts unseen hierarchical relationships between nodes in ImageNet and iNaturalist datasets, as well as transferability between tasks in the Taskonomy benchmark. We also show that box embeddings estimated from task representations (e.g., CLIP, Task2Vec, or attribute based) can be used to predict relationships between unseen tasks more accurately than classifiers trained on the same representations, as well as handcrafted asymmetric distances (e.g., KL divergence). This suggests that low-dimensional box embeddings can effectively capture these task relationships and have the added advantage of being interpretable. We use the approach to visualize relationships among publicly available image classification datasets on popular dataset hosting platform called Hugging Face.
Unbiased Faster R-CNN for Single-source Domain Generalized Object Detection
Yajing Liu · Shijun Zhou · Xiyao Liu · chunhui Hao · Baojie Fan · Jiandong Tian
Single-source domain generalization (SDG) for object detection is a challenging yet essential task as the distribution bias of the unseen domain degrades the algorithm performance significantly. However, existing methods attempt to extract domain-invariant features, neglecting that the biased data leads the network to learn biased features that are non-causal and poorly generalizable. To this end, we propose an Unbiased Faster R-CNN (UFR) for generalizable feature learning. Specifically, we formulate SDG in object detection from a causal perspective and construct a Structural Causal Model (SCM) to analyze the data bias and feature bias in the task, which are caused by scene confounders and object attribute confounders. Based on the SCM, we design a Global-Local Transformation module for data augmentation, which effectively simulates domain diversity and mitigates the data bias. Additionally, we introduce a Causal Attention Learning module that incorporates a designed attention invariance loss to learn image-level features that are robust to scene confounders. Moreover, we develop a Causal Prototype Learning module with an explicit instance constraint and an implicit prototype constraint, which further alleviates the negative impact of object attribute confounders. Experimental results on five scenes demonstrate the prominent generalization ability of our method, with an improvement of 3.9\% mAP on the Night-Clear scene.
MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning
Yixin Liu · Chenrui Fan · Yutong Dai · Xun Chen · Pan Zhou · Lichao Sun
Text-to-image diffusion models allow seamless generation of personalized images from scant reference photos. Yet, these tools, in the wrong hands, can fabricate misleading or harmful content, endangering individuals. To address this problem, existing poisoning-based approaches perturb user images in an imperceptible way to render them "unlearnable" from malicious uses. We identify two limitations of these defending approaches: i) sub-optimal due to the hand-crafted heuristics for solving the intractable bilevel optimization and ii) lack of robustness against simple data transformations like Gaussian filtering. To solve these challenges, we propose MetaCloak, which solves the bi-level poisoning problem with a meta-learning framework with an additional transformation sampling process to craft transferable and robust perturbation. Specifically, we employ a pool of surrogate diffusion models to craft transferable and model-agnostic perturbation. Furthermore, by incorporating an additional transformation process, we design a simple denoising-error maximization loss that is sufficient for causing transformation-robust semantic distortion and degradation in a personalized generation. Extensive experiments on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing approaches. Notably, MetaCloak can successfully fool online training services like Replicate, in a black-box manner, demonstrating the effectiveness of MetaCloak in real-world scenarios.