The Cheminformatics Network Blog: March 2025

Molecular similarity assessment underlies all structure-activity relationship modeling. Similarity kernels encode distances between molecules in chemical space, with similar molecules lying closer together and dissimilar ones farther apart. The so-called kernel trick enables a complicated nonlinear similarity relationship between molecules in the descriptor space to be mapped onto a dot product relationship in a higher dimensional space. Similarity kernels and chemical space encodings are key to the success of machine learning and generative AI methods. The Neural Tangent Kernel (NTK) [1] is defined in terms of the gradient of the output f of a neural network with respect to the model parameters w:

K(x,x’) = ∇_w f(w,x)^T. ∇_w f(w,x’)

The NTK represents a similarity measure between the inputs x and x’, and describes how updating the model parameters on one molecule x affects the predictions for another molecule x’. The NTK thus describes the evolution of a deep neural network during training by gradient descent. I argue that the source function [2], a sensitive measure of transferability of fragments between molecules, represents such a kernel, albeit one of “attention” rather than similarity. The local source [2-3]

LS(r, r’) = (−1/4π) ∇²ρ(r’)/|r−r’|

represents the effectiveness of the concentration (or depletion) of electron density ρ(r’) at r’ in functioning as a source (or sink) for the electron density at r. The Laplacian ∇²ρ(r’) serves as the generator of the electron density distribution, by virtue of Poisson’s equation, and the (integrated) source function S(r, Ω) = ∫_Ω LS(r, r’) dr' provides a measure of the relative contribution of an atom or group to the density at any point. The self-attention mechanism [4] upon which transformers are based now forms the basis of all large language models (LLM). In an analogous manner, it captures the relationships between different words in a sentence or characters in a string, regardless of how far apart, and scores them in order of importance, thereby enabling the model to “attend to” those parts of the input string relevant for predicting the next word or character. Each attention layer employs three parameters: a query vector (Q) and keys vector (K) of dimension d_k, and a values vector (V):

Attention (Q, K, V) = Softmax (Q.K^T/√d_k) V.

This "scaled dot product attention" computes the dot products of the query with all keys, to give the weights on the values V at each position in the string. Typically, multiple attention layers are applied in parallel, to constitute a multi-head attention block, enabling the transformer to pay attention to information from different positions in the string, and several such blocks are employed to process the input stream in parallel on GPUs. LLMs can thus have billions of trainable parameters. This thus opens up the possibility of using transformers to learn the electron density using the source function.

References:

[1] A. Jacot, et al., Adv. Neural Inf. Proc. Sys. (2018) 8580 – 8589.
[2] R.F.W. Bader, C. Gatti, Chem Phys Lett. 287 (1998) 233-238.
[3] C. Gatti, L. Bertini, Acta Crystallogr A, 60 (2004) 438-449.
[4] A. Vaswani, et al., Proc. 31st Int. Conf. Neural Inf. Proc. Sys. (NIPS 2017) 6000–6010. arXiv:1706.03762v5

N. Sukumar, The Source Function as Attention Kernel, Sagamore XX Conference on Quantum Crystallography, Shiv Nadar Institution of Eminence, India, November 10-15, 2024.

The Cheminformatics Network Blog

Contributors

Saturday, March 15, 2025

The Source Function As Attention Kernel