Abstract:
The premise for the great advancement of molecular machine learning is dependent on a considerable amount of labeled data. In many real-world scenarios, the labeled molecules are limited in quantity or laborious to derive. Recent pseudo-labeling methods are usually designed based on a single domain knowledge, thereby failing to understand the comprehensive molecular configurations and limiting their adaptability to generalize across diverse biochemical context. To this end, we introduce an innovative paradigm for dealing with the molecule pseudo-labeling, named as Molecular Data Programming (MDP). In particular, we adopt systematic supervision sources via crafting multiple graph labeling functions, which covers various molecular structural knowledge of graph kernels, molecular fingerprints, and topological features. Each of them creates an uncertain and biased labels for the unlabeled molecules. To address the decision conflicts among the diverse pseudo-labels, we design a label synchronizer to differentiably model confidences and correlations between the labeling functions, which yields probabilistic molecular labels to adapt for specific applications. These probabilistic molecular labels are used to train a molecular classifier for improving its generalization capability. On eight benchmark datasets, we empirically demonstrate the effectiveness of MDP on the weakly supervised molecule classification tasks, achieving an average improvement of $9.5\%$.
Chat is not available.