Oral Session
Orals 2B Deep learning architectures and techniques
Summit Flex Hall AB
Overflow in Signature Room on the 5th Floor in Summit
Neural Redshift: Random Networks are not Random Functions
Damien Teney · Armand Nicolicioiu · Valentin Hartmann · Ehsan Abbasnejad
Context. Our understanding of the generalization capabilities of neural networks (NNs) is incomplete. The prevailing explanation is based on implicit biases of gradient descent (GD) but it cannot account for recent findings of the capabilities of models found by gradient-free methods nor the `simplicity bias' observed even in untrained networks. This study seeks the source of inherent properties of NNs.Findings. To characterize inductive biases provided by architectures independently from GD, we examine networks of random weights and show that they do not correspond to random functions. We characterize the functions implemented by various architectures using decompositions in Fourier and polynomial bases and compressed representations. Even simple MLPs have strong inductive biases:uniform sampling in parameter space yields a strongly biased sampling of functions in frequency, order, and compressibility. Popular components including ReLUs, residual connections, and normalizations induce a bias toward the lower end of these measures,accounting for the ``simplicity bias'' frequently attributed to (S)GD. We also show that transformer-based sequence models inherit similar properties from their building blocks.Implications. We provide a fresh explanation for the success of deep learning compatible with recent observations, complementing those based on gradient-based optimization. This also points at future avenues for controlling the solutions implemented in trained models.
Given a well-behaved neural network, is possibleto identify its parent, based on which it was tuned?In this paper, we introduce a novel task known as {neural lineage} detection, aiming at discovering lineage relationships between parent and child models. Specifically, from a set of parent models, neural lineage detection predicts which parent model a child model has been fine-tuned from. We propose two approaches to address this task. (1) For practical convenience, we introduce a learning-free approach, which integrates an approximation of the finetuning process into the neural network representation similarity metrics, leading to a similarity-based lineage detection scheme. (2) For the pursuit of accuracy, we introduce a learning-based lineage detector comprising encoders and a transformer detector. Through experimentation, we have validated that our proposed learning-free and learning-based methods outperform the baseline in various learning settings and are adaptable to a variety of visual models. Moreover, they also exhibit the ability to trace cross-generational lineage, effectively identifying not only parent models but also their ancestors.
Learning Structure-from-Motion with Graph Attention Networks
Lucas Brynte · José Pedro Iglesias · Carl Olsson · Fredrik Kahl
In this paper we tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. SfM is a classic computer vision problem that is solved though iterative minimization of reprojection errors, referred to as Bundle Adjustment (BA), starting from a good initialization. In order to obtain a good enough initialization to BA, conventional methods rely on a sequence of sub-problems (such as pairwise pose estimation, pose averaging or triangulation) which provides an initial solution that can then be refined using BA. In this work we replace these sub-problems by learning a model that takes as input the 2D keypoints detected across multiple views, and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives, and we show that it can be used for fast inference of the reconstruction for new and unseen sequences. The experimental results show that the proposed model outperforms competing learning-based methods, and challenges COLMAP while having lower runtime.
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao · Haiping Wu · Weijian Xu · Xiyang Dai · Houdong Hu · Yumao Lu · Michael Zeng · Ce Liu · Lu Yuan
We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.
In Search of a Data Transformation That Accelerates Neural Field Training
Junwon Seo · Sangyoon Lee · Kwang In Kim · Jaeho Lee
Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed - generating neural fields requires an overfitting of a neural network, which can take a significant number of SGD steps to reach the desired fidelity level. In this paper, we delve into the impacts of data transformations on the speed of neural field training, specifically focusing on how permuting pixel locations affect the convergence speed of SGD. Counterintuitively, we find that randomly permuting the pixel locations can considerably accelerate the training. To explain this phenomenon, we examine the neural field training through the lens of PSNR curves, loss landscapes, and error patterns. Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns, which facilitate easy optimization in the early stage but hinder capturing fine details of the signal.