The Cheminformatics Network Blog

Cheminformatics, Bioinformatics, Systems Biology, Network Theory, Drug Design, Computational Chemistry and Computational Biology

Monday, October 07, 2024

Navigating Molecular Networks

I am delighted to announce my forthcoming book Navigating Molecular Networks: Exploring the Chemical Space Concept in Novel Materials Design as part of the SpringerBriefs in Materials book series. The book
  1. Caters to a diverse audience encompassing students and researchers in physics, chemistry, and materials science;
  2. Incorporates a multipronged approach spanning from vector space analysis to random matrix theory;
  3. Explores graph and deep learning applications in molecular and materials design.
This book delves into the foundational principles governing the treatment of molecular networks and "chemical space"—the comprehensive domain encompassing all physically achievable molecules—from the perspectives of vector space, graph theory, and data science. It explores similarity kernels, network measures, spectral graph theory, and random matrix theory, weaving intriguing connections between these diverse subjects. Notably, it emphasizes the visualization of molecular networks. The exploration continues by delving into contemporary generative deep learning models, increasingly pivotal in the pursuit of new materials possessing specific properties, showcasing some of the most compelling advancements in this field. Concluding with a discussion on the meanings of discovery, creativity, and the role of artificial intelligence (AI) therein.

Its primary audience comprises senior undergraduate and graduate students specializing in physics, chemistry, and materials science. Additionally, it caters to those interested in the potential transformation of material discovery through computational, network, AI, and machine learning (ML) methodologies.

Softcover ISBN: 978-3-031-76289-5
eBook ISBN: 978-3-031-76290-1

Table of contents
Chapter 1: Molecular networks

  1. Why Molecular Networks? Graphs and Simplices
  2. Matrix representations of Weighted and Unweighted Networks
  3. Matrix representations of Directed and Undirected Graphs
  4. Unipartite and Bipartite Networks
  5. Coordinate and Graph representations of Chemical Space
  6. Feature Networks
Chapter 2: Transformations of Chemical space
  1. Vector spaces and Metric Tensors
  2. Dimensionality reduction
  3. Similarity Kernels and Kernel methods
Chapter 3: Spectral Graph Theory
  1. Network measures
  2. Eigenvalues of the Adjacency Matrix
  3. Eigenvalues of the Laplacian Matrix
  4. Graph Centrality measures
  5. Graph Curvature
  6. Eigenvectors of the modularity matrix
Chapter 4: Universality and Random Matrix theory
  1. Eigenvalue correlations
  2. RMT for Chemical Reaction Networks
  3. RMT for Feature Networks
Chapter 5: Mapping and Navigating Chemical Space Networks
  1. k-NN and k-Means
  2. Visualizing Chemical Space Networks
  3. Model Applicability Domain and Scaffold Hopping
  4. Violation of the Similarity principle - Activity Cliffs
Chapter 6: Generative AI – Growing the Network
  1. Genetic Algorithms
  2. Back propagation and Variational Auto Encoders
  3. Graph Convolutional Networks
  4. Generative Adversarial Networks and Reinforcement Learning
  5. Transformers and Generative Language Models
  6. Why does Over-parametrization work?
  7. Infinitely wide networks and Neural tangent kernels
  8. Extensions and Future Directions
Chapter 7: Discovery and Creativity
Glossary of terms

Thursday, August 01, 2024

Causal versus correlative models

I was recently watching a debate between a professor and a journalist on the upcoming US presidential elections. The professor made some predictions based on a model he had developed. The journalist asked whether it was a causal model. The professor replied that it was a correlative and predictive model that has never failed in the past, but it was not a causal model. The journalist asked whether his predictions would still hold under such-and-such a scenario, to which the professor replied that he wouldn't respond to silly hypotheticals. But he stuck to his model predictions. This nicely illustrates the problems of training set bias and model domain applicability - namely when a model can be trusted to make prospective predictions. As long as the new data are similar to the training data, they fall within the applicability domain of the model, and one can have confidence in the model predictions - statistically they can be expected to be similar to previous predictions or the reported test set predictions. But as soon as one wanders substantially outside the domain applicability of the original model, all bets are off. I think this is something that many who develop and use generative models seem to have lost sight of. Generative models are designed to produce new and novel data, and they are often used to generate new data very different from the training data. This puts them well outside the applicability domain of the original model. Generating new data is relatively easy; generating data that are useful for specific applications is a different matter altogether. The solution is, of course, to keep testing the model on the newly generated data against the "ground truth", and to develop new models expanding the domain of applicability as you go along. But this requires more work than just lazily generating new data with a generative model and hoping for the best. The situation is different if we have a causal model. By a causal model, we mean that we understand the underlying processes, and therefore we understand when conditions change such that the original model may no longer be applicable, and thus understand what must be done to extend the model to new domains. Originally posted on LinkedIn

Sunday, July 07, 2024

Our chapter on Polymer and Nanocomposite Informatics.

by Neelesh Ashok, K. P. Soman, Madhav Samanta, M. S. Sruthi, Prabaharan Poornachandran, Suja Devi V. G & N. Sukumar
Artificial intelligence (AI) and machine learning (ML) have a variety of practical technological applications and are now impacting how we live and work in myriad ways. Polymer informatics, which employs AI and ML to aid in the development, design, and discovery of polymers, is a fast-expanding field. Models trained on polymer data available in databases make it possible to rapidly predict a range of polymer properties, and to screen prospective polymer candidates for desirable characteristics. AI and ML are also used to predict the ease of synthesis of a target polymer and plan its (retro)synthetic steps. Data-driven techniques are employed to develop machine-understandable polymer representations, and to handle the enormous chemical and physical variability of polymers at multiple scales. Cutting-edge generative AI methods are now being employed for inverse design of polymers with specific properties that make them attractive for various applications. These advances in polymer informatics hold out promise for improved efficiency, accelerated development, and greater ease of manufacture of a new generation of polymers useful to society. This chapter provides an overview of recent developments in AI-aided, data-driven polymer chemistry and polymer nanocomposites, and presents a few recent case studies highlighting the scope and diversity of recent applications of AI techniques in the polymer design process, as well as the state of the art and challenges in polymer informatics.
Polymer and Nanocomposite Informatics: Recent Applications of Artificial Intelligence and Data Repositories

Computational Drug Discovery: A Primer

By N. Sukumar, Harishchander Anandaram and Pratiti Bhadra
(Ion Cure Press, 2023)

This book presents a concise yet thorough introduction to the process of computational drug discovery, including how machine learning techniques are increasingly used in the design of new drugs. It provides a balanced coverage of chemical space, biological space and computational modeling aspects.

Foreword
Over the last two decades, the field of drug discovery has undergone remarkable transformations due to the increasing availability of chemical and biological data, as well as the powerful computational tools to analyze and model this data. In "Computational Drug Discovery: A Primer", Professor Sukumar and his co-authors provide a comprehensive account of key techniques and methodologies necessary to drive innovation in the area. This book covers the topics of cheminformatics and machine learning, as applied to the design and discovery of drugs.
This book presents a concise yet thorough introduction to the process of computational drug discovery, including how machine learning techniques are increasingly used in the design of new drugs. The book begins by delving into the basic principles of molecular modeling techniques and highlighting the importance of extracting different types of domain information from molecular structures. One of the book's major strengths is its excellent focus on chemical space networks and biological networks, including metabolic networks, gene regulatory networks, and signal transduction networks.
After providing enough background information, the book then shifts its focus to predictive modeling and how to map structural information to activities and properties. It provides exhaustive coverage of different cheminformatics approaches. The book also provides an overview of various conventional data mining and statistical techniques, including various linear and nonlinear learning algorithms and techniques, along with optimization techniques such as Genetic algorithms. Most importantly the book provides an overview of artificial neural networks and deep learning, with lucid details about different DNN architectures useful in drug discovery. The authors further provide invaluable tips for making robust and reliable predictions in drug discovery. Among these tips is the importance of choosing a model that strikes a balance between predictive ability and interpretability. While deep neural network architectures are useful for shortlisting drug-like molecules, the authors emphasize the need to develop physics-informed 3D-based models in order to meet the urgent need for more accurate predictions.
To sum up, this book provides a balanced coverage of chemical space, biological space and computational modeling aspects. With his extensive experience in teaching and research in this area, Professor Sukumar along with his co-authors have commendably presented complex concepts in a readily understandable manner. Overall, this book is a must have resource for students, academicians, teachers, researchers and practitioners interested in the cutting-edge field of computational drug discovery.
Jayaraman K. Valadi
Distinguished Professor
FLAME University, Pune, India

Preface
This book is the outcome of undergraduate and postgraduate course I (NS) developed and taught at Shiv Nadar University, Dadri. It is aimed at an audience with some knowledge of chemistry and mathematics, but unfamiliar with cheminformatics, machine learning or its applications in drug discovery. I am thankful to the Institute of Mathematical Sciences, Chennai, for hosting me in September-November 2022, during which time most of the actual writing on the book was completed. I am deeply indebted to all my former colleagues at the Rensselaer Polytechnic Institute in Troy, NY, especially Professors Curt Breneman, Kristin Bennett, and Mark Embrechts, to Drs. Michael P. Krein and Saurav Das, and to Professor Valadi Jayaraman, for many valuable insights that have gone into this book. Thanks are also to Professor Areejit Samal and Drs. Vinith Rejathalal, Sanjanashree Palanivel and Navaneeth Haridasan, for helpful discussions. I also owe thanks to the many students, especially Drs. Ganesh Prabhu, Pinaki Saha, Vivek Ananth and Sagar Bhayye, and to Manuja Kothiyal, Rudra Agarwal, Ritwik Bhattacharya, Ananya Biswas, Raman Dutt, Gunjan Gupta, Sanjana Krishnamani, Vivek Krishnan, Sanjana Maheshwari, Aniket Mishra, Garvisha Mittal, and Madhav Samanta, who challenged me with numerous interesting questions over the years. All the authors thank Prof. K. P. Soman, Amrita Vishwa Vidyapeetham, Coimbatore, for his encouragement.
This book does not assume any prior knowledge of medicinal chemistry, computational chemistry, or drug design. It should be accessible equally to students of chemistry, physics, biology, bioinformatics, as well as computer science and data science. The subject matter encompasses the fields of cheminformatics as well as machine learning, applied to the design and discovery of small molecule drugs. It is hoped that this book will provide a brief introduction to readers who want to get an idea about the process of computational drug discovery, and how machine learning techniques are being increasingly used in the design of new drugs. Resources to the original literature and in-depth reviews are provided at the end for readers interested in delving deeper into the subject matter.

Contents 1. Drug Discovery in the Information-rich age
1.1. Why Computational Drug Discovery? The Drug Discovery pipeline
1.2. ADMET Screening
1.3. Lipinski’s Rules of 5
1.4. Chemical Space
1.5. Drug Delivery across the Blood Brain Barrier
1.6. Structure-Based and Ligand-Based Drug Design
1.7. Pattern recognition and Machine Learning
2. Representation of Chemical Structure and Similarity
2.1. Topological Indices
2.2. Substructural Descriptors and 2D fingerprints
2.3. 3D descriptors
2.4. Local Molecular Surface Property Descriptors
2.5. Shape descriptors
2.6. Chiral descriptors
2.7. Molecular Similarity Measures
3. Chemical and Biological Networks
3.1. Chemical Space Networks
3.2. Biological Networks in Biomarker Discovery
3.3. Metabolic Networks
3.4. Gene Regulatory Networks
3.5. Protein-Protein Interaction Networks
3.6. Signal Transduction Networks
3.7. Analysis of Biological Network-based Biomarker Discovery
3.8. Artificial Intelligence and Biological Networks-based Biomarker Discovery
3.9. Summary, Challenges and Future Prospects
4. Mapping Structure to Activity: Predictive Modeling
4.1. Linear Free Energy Relationships
4.2. Pharmacophores and Molecular Interaction Fields
4.3. Model Domain of Applicability
4.4. Activity Cliffs
4.5. Performance Measures in Classification and Regression
4.6. Model Validation
4.7. Structure Based Methods - Docking and Scoring
4.8. Molecular Dynamics Simulation in Computational Drug Discovery
5. Data Mining and Statistical Methods
5.1. Linear and Non-Linear Models
5.2. Data preprocessing and unbalanced datasets
5.3. Principal Component Analysis and Partial Least-Squares Regression
5.4. Feature selection
5.5. Evolutionary computing and Genetic Algorithms
5.6. K-Means and k-NN
5.7. Classification trees and Random forests
5.8. Support Vector Machines classification and regression
6. Artificial Neural Networks and Deep Learning
6.1. Self-Organizing Maps
6.2. Multi-Layer Perceptrons
6.3. Deep Neural Networks and Auto-Encoders
6.4. Convolutional Neural Networks
6.5. Generative Adversarial Networks
6.6. Reinforcement Learning
6.7. Transfer Learning
6.8. Recurrent Neural Networks and Transformers
7. Best Practices in Predictive Cheminformatics

Sunday, August 08, 2021

Networking Protein-Ligand Binding Sites

Sagar BhayyeSagar Bhayye, N. Sukumar
International Conference on Drug Discovery (ICDD) 2020
BITS Hyderabad, February 2020

Proteins are complex macromolecules that play a critical role in various body functions. Various functional proteins require small molecules (also known as ligands) to initiate various bio-chemical processes. The affinity of a ligand towards the specific protein is attributed to physicochemical properties of the protein-ligand binding site such as size, shape, surface charge distribution, etc. Because of this, a single ligand can bind to two or more structurally and functionally different proteins, leading to decrease in potency of drug molecules due to increase in distribution, and adverse drug reactions due to non-specific ligand binding to various proteins. In this study, a set of 4105 protein crystal structures belonging to different classes and species were used to construct a similarity network based on protein-ligand binding site properties. The Property-Encoded Shape Distributions (PESD) method was utilized to calculate protein-ligand binding site signatures and hence pair-wise similarities between ligand binding sites of the selected set of proteins. Metrics such as Euclidean, Chi-Square and Manhattan distances were used to calculate similarities between ligand binding sites of proteins, leading to quantitative understanding of their similarity relationships. Adjacency matrices calculated using these three different metrics were then used for construction of similarity networks. The networks were analyzed for properties such as vertex degree, average path length, degree distribution, average clustering coefficient, degree assortativity, modularity and different centrality measures. Properties of the three similarity networks were compared with the Erdӧs-Renyi random network. The Euclidean network shows higher average clustering coefficient and lower average path length than the Erdӧs-Renyi random network, indicating small world behavior.

 


Wednesday, July 11, 2018

What does Electron Density Analysis tell us about Bonding in Transition Metal-doped Boron and Carbon Clusters?

Sagamore XIX Conference on Quantum Crystallography
Halifax, Canada, July 11, 2018

Video link: https://youtu.be/qxu5uALd6Xs


N. Sukumar1, Pinaki Saha2, Amol B. Rahane3, Rudra Agarwal4, Vijay Kumar5

1  Department of Chemistry and Center for Informatics, Shiv Nadar University, Dadri, India – n.sukumar@snu.edu.in
2  Department of Chemistry, Shiv Nadar University, Dadri, India – ps630@snu.edu.in
3  Dr. Vijay Kumar Foundation, Gurgaon, India – amol_rahane2000@yahoo.com
4  Department of Chemistry, Shiv Nadar University, Dadri, India – ra298@snu.edu.in
5  Dr. Vijay Kumar Foundation, Gurgaon & Center for Informatics, Shiv Nadar University, Dadri, India – vijay.kumar@snu.edu.in

Keywords: electron density analysis, boron clusters, carbon clusters, electron delocalization, structural stability

ABSTRACT


Although the nature of the chemical bond is at the heart of chemistry, chemists often work with several distinct conceptions of the chemical bond, which are not necessarily compatible with each other. The Lewis concept of the electron pair bond [1] is now over a century old, predating the quantum mechanical theory of bonding in molecules. We now recognize electron pairing to be a consequence of the Pauli exclusion principle and the associated Fermi hole. The traditional Lewis electron pair bond concept has been extended to admit the possibility of 3-center, 2-electron bonds in “electron deficient” boranes, and subsequently further extended, using AdNDP analysis [2] (an extension of natural bond orbitals NBO analysis), to include n-center (but always 2-electron) objects (with n arbitrarily large). An alternate to such orbital treatments is provided by examination of topological features of the electron density, such as bond paths (gradient paths of the electron density) connecting pairs of nuclei. Such bond paths are not associated with a fixed electron count. However, as has been pointed out by several authors [3,4], the mere existence of a bond path between a pair of nuclei does not signify the existence of a chemical bond between them or indicate the strength of the interaction. Double integration of the Fermi hole density over spatial regions provides a valid measure of electron localization and delocalization [5]. One can also conceive of the chemical bond as a force that holds a pair of atoms together, quantified by the dissociation energy required to break the bond. While this works well for simple diatomics, the correlation between dissociation energy and electron count or the electron density between a pair of nuclei is not straightforward for open shell systems or polyatomic molecules.
The divergence between these different conceptions of the chemical bond is particularly dramatic for “electron deficient” boron compounds and for metallic nanoclusters, where extensive electron delocalization and multi-center bonding are prevalent. Nevertheless, combining information from topological features of the electron density with orbital-based models allows meaningful chemical conclusions about bonding to be drawn, even for unusual molecular systems.
Here we have analyzed trends in bonding and stability for several clusters including ring-shaped clusters for boron and carbon as well as drum-shaped and fullerene-like clusters of boron, from computed ab initio electron density distributions, and investigated the effects of transition metal (TM) doping on their structural and physical properties. Analysis of the electron density at bond and ring critical points, the Laplacian of the electron density, the electron localization function [6], the source function [7], and localization-delocalization indices, all indicate the coexistence of covalent bonds and delocalized charge distribution in boron clusters [8]. Rings of carbon atoms too seem to be stabilized by metal coordination for selected sizes and electron counts. For drum-shaped M@B14 (M = a 3d TM atom) and M@B16 (M = 3d, 4d, and 5d TM atom) clusters, our results suggest two- and three-center σ bonding within and between two B7/B8 rings, respectively, and hybridization between the TM d orbitals and the π bonded molecular orbitals of the drum. Assembly of Co@B14 clusters has been shown to stabilize a metallic Co atomic nanowire within a boron nanotube [9].
We have also studied metal atom encapsulated fullerene-like boron cage structures and shown that Cr@B20 is the smallest cage for Cr encapsulation, while B22 is the smallest symmetric cage for Mo and W encapsulation. Electron density and molecular orbital analysis suggests that Cr@B18, Cr@B20, M@B22 (M = Cr, Mo, and W) and M@B24 (M = Mo and W) cages are stabilized by 18 p-bonded valence electrons, whereas the drum-shaped M@B18 (M = Mo and W) clusters are stabilized by 20 p-bonded valence electrons [10]. We have also studied larger boron clusters in the size range 68-74 and shown that the global minimum structure for B70 is a tubular structure, which is nearly degenerate with a quasi-planar structure having three hexagonal vacancies [10]. Analysis of a large number of atomic clusters, of various shapes and sizes, indicates a broad parallelism between different measures of bonding and localization in these clusters.

Fig. 1 Electrostatic potential of (a) Cr@B22 and (b) tubular B70 cluster mapped onto a r(r) = 0.1 e/Bohr3 electron density isosurface. Blue regions indicate negative electrostatic potentials associated with the boron atoms. (c) Contour plot of the Laplacian of Cr@B22 cluster in a plane passing through atoms B7, B8, B15, B16, B21, B22, and Cr. Solid (dashed blue) contours indicate positive (negative) values of L = -Ñ2r(r)

Acknowledgements


The authors gratefully acknowledge use of the high-performance computing facility Magus of Shiv Nadar University. ABR and VK thankfully acknowledge financial support from International Technology Center - Pacific. We thank Prof. Cherif Matta for providing access to AIMLDM software.

References


[1] G. N. Lewis, J. Amer. Chem. Soc. 38, 762 (1916).
[2] D. Y. Zubarev, A. I. Boldyrev, Phys. Chem. Chem. Phys. 10, 5207(2008); A. P. Sergeeva, D. Y. Zubarev, H.-J. Zhai, A. I. Boldyrev, L. S. Wang, J. Am. Chem. Soc. 130, 7244 (2008); W. Huang, A. P. Sergeeva, H.-J. Zhai, B. B. Averkiev, L. S. Wang, A. I. Boldyrev, Nat. Chem. 2, 202 (2010).
[3] R. F. W. Bader, Atoms in Molecules: A Quantum Theory, Oxford Press, Oxford (1990).
[4] S. Shahbazian, Chem. Eur. J. (2018) doi:10.1002/chem.201705163
[5] C. F. Matta, J. Comput. Chem. 35, 1165 (2014); M. J. Timm , C. F. Matta, L. Massa, L. Huang, J. Phys. Chem. A, 118, 11304 (2014); C. F. Matta, I. Sumar, R. Cook, P. W. Ayers, In: Applications of Topological Methods in Molecular Chemistry (Challenges and Advances in Computational Chemistry and Physics Series); Chauvin, R.; Silvi, B.; Alikhani, E.; Lepetit, C. (Eds.), Springer, (2015) ; I. Sumar, R. Cook, P. W. Ayers, C. F. Matta, Comput. Theor. Chem. 1070, 55-67 (2015);.I. Sumar, R. Cook, P. W. Ayers, C. F. Matta, Phys. Script. 91, 013001 (2016).
[6] A. D. Becke, K. E. Edgecombe, J. Chem. Phys. 92, 5397 (1990).
[7] R. F. W. Bader, C. Gatti, Chem. Phys. Lett. 287, 233 (1998); C. Gatti, L. Bertini, Acta Cryst. A, 60, 438 (2004); C. Gatti, F. Cargnoni, L. Bertini, J. Comp. Chem. 24, 422 (2003).
[8] P. Saha, A. B. Rahane, V. Kumar, N. Sukumar, Phys. Script. 91, 053005 (2016).
[9] P. Saha, A. B. Rahane, V. Kumar, N. Sukumar, J. Phys. Chem. C, 121, 10728 (2017).
[10] P. Saha, Ph.D. thesis, Shiv Nadar University, India (2018); A.B. Rahane, P. Saha, N. Sukumar, and V. Kumar, to be published.



 

Friday, February 02, 2018

Application of electron density-based analysis in the study of nanoclusters and biomolecular interactions

Related publications
  1. Pinaki Saha, Amol B. Rahane, Vijay Kumar, N. Sukumar, Analysis of the electron density features of small boron ring clusters and the effects of doping, Physica Scripta 91, 053005 (2016). DOI: 10.1088/0031-8949/91/5/053005 IF: 1.126
  2. Pinaki Saha, Amol B. Rahane, Vijay Kumar, and N. Sukumar, Electronic Origin of the Stability of Transition Metal Doped B14 Drum Shaped Boron Clusters and Their Assembly in to a Nanotube. J. Phys. Chem. C, 121(20), 10728–10742 (2017). DOI: 10.1021/acs.jpcc.6b10838 IF: 4.536
  3. Suman Kumar Mandal, Pinaki Saha, Parthapratim Munshi, N. Sukumar, Exploring Potent Ligand for Proteins: Insights from Knowledge-based Scoring Functions and Molecular Interaction Energies, Struct. Chem. 28(5), 1537-1552 (2017). DOI: 10.1007/s11224-017-1007-y
  4. Amol B. Rahane, Pinaki Saha, N. Sukumar and Vijay Kumar, Smallest Fullerene-like Structures of Boron with Cr, Mo, and W Encapsulation (manuscript under review).

Thursday, June 29, 2017

PhD Defense of Mr. Ganesh Prabhu, Department of Chemistry, Shiv Nadar University



PhD Defense of Mr. Ganesh Prabhu
Department of Chemistry, Shiv Nadar University

DATE:                    Monday, July 3, 2017

TIME:                    1:30 PM – 2:30 PM    

LOCATION:         D-217 (Seminar Hall -2)

TITLE:    Diversity-Oriented Synthesis of A Molecular Library, Analysis of Molecular Diversity Through Network/Graph Measures and Correlation with Biological Activity Profile(s)

ADVISOR:            Dr. N Sukumar and Dr. Subhabrata Sen
ABSTRACT: In this thesis, we examined the concepts of molecular similarity / dissimilarity and the growth of Diversity Oriented Synthesis (DOS) as an alternative to combinatorial chemistry. A brief description of the procedures for the synthesis of hybrid compounds using DOS via “Platform technology” is followed by detailed experimental analysis for synthesizing natural product inspired hybrids using the pyrroloisoquinoline scaffold as a platform. The hybrids were screened against several phenotypes for cytotoxicity, antiplasmodial activity and anti-malarial activity. A cheminformatics study was undertaken for comparing the physico-chemical properties of the synthesized hybrids against similar commercial drugs. We also studied the dissimilarity / similarity features of the chemical libraries through an analysis of the topological properties of threshold Chemical Space Networks (CSNs). During the process, we developed QuaLDI (Quantitative Library Diversity Index), a simple measure for quantifying diversity in DOS, Focussed and PubChem Libraries. The effectiveness of QuaLDI was evaluated by comparing the results from QuaLDI with other diversity measures. We used correlation matrix guided approach combined with QuaLDI measure as an effective approach for selecting minimally correlated descriptors for QSAR modelling.