The Source Function As Attention Kernel
K(x,x’) = ∇w f(w,x)T. ∇w f(w,x’)
The NTK represents a similarity measure between the inputs x and x’, and describes how updating the model parameters on one molecule x affects the predictions for another molecule x’. The NTK thus describes the evolution of a deep neural network during training by gradient descent. I argue that the source function [2], a sensitive measure of transferability of fragments between molecules, represents such a kernel, albeit one of “attention” rather than similarity. The local source [2-3]
LS(r, r’) = (−1/4π) ∇2ρ(r’)/|r−r’|
represents the effectiveness of the concentration (or depletion) of electron density ρ(r’) at r’ in functioning as a source (or sink) for the electron density at r. The Laplacian ∇2ρ(r’) serves as the generator of the electron density distribution, by virtue of Poisson’s equation, and the (integrated) source function S(r, Ω) = ∫Ω LS(r, r’) dr' provides a measure of the relative contribution of an atom or group to the density at any point. The self-attention mechanism [4] upon which transformers are based now forms the basis of all large language models (LLM). In an analogous manner, it captures the relationships between different words in a sentence or characters in a string, regardless of how far apart, and scores them in order of importance, thereby enabling the model to “attend to” those parts of the input string relevant for predicting the next word or character. Each attention layer employs three parameters: a query vector (Q) and keys vector (K) of dimension dk, and a values vector (V):
Attention (Q, K, V) = Softmax (Q.KT/√dk) V.
This "scaled dot product attention" computes the dot products of the query with all keys, to give the weights on the values V at each position in the string. Typically, multiple attention layers are applied in parallel, to constitute a multi-head attention block, enabling the transformer to pay attention to information from different positions in the string, and several such blocks are employed to process the input stream in parallel on GPUs. LLMs can thus have billions of trainable parameters. This thus opens up the possibility of using transformers to learn the electron density using the source function.
References:
[1] A. Jacot, et al., Adv. Neural Inf. Proc. Sys. (2018) 8580 – 8589.
[2] R.F.W. Bader, C. Gatti, Chem Phys Lett. 287 (1998) 233-238.
[3] C. Gatti, L. Bertini, Acta Crystallogr A, 60 (2004) 438-449.
[4] A. Vaswani, et al., Proc. 31st Int. Conf. Neural Inf. Proc. Sys. (NIPS 2017) 6000–6010. arXiv:1706.03762v5
N. Sukumar, The Source Function as Attention Kernel, Sagamore XX Conference on Quantum Crystallography, Shiv Nadar Institution of Eminence, India, November 10-15, 2024.