The Cheminformatics Network Blog

Cheminformatics, Bioinformatics, Systems Biology, Network Theory, Drug Design, Computational Chemistry and Computational Biology

Sunday, January 15, 2006

Comparative Evaluation of Prediction Algorithms

CoEPrA (Comparative Evaluation of Prediction Algorithms) is a modeling competition organized to provide an objective testing for various classification and regression algorithms via the process of blind prediction. The problems proposed in the CoEPrA experiment are selected from cheminformatics, drug design, QSAR, bioinformatics, computational biology, medicine, toxicology, microarray gene expression data, and proteomics. For details, see http://www.coepra.org/

Thursday, January 12, 2006

CoLIBRI

The CoLIBRI paper (co-authored with Alex Tropsha's group) appeared yesterday on the web edition of

Chemometric Analysis of Ligand Receptor Complementarity: Identifying Complementary Ligands Based on Receptor Information (CoLiBRI)

Scott Oloff, Shuxing Zhang, Nagamani Sukumar, Curt Breneman and Alexander Tropsha, J. Chem. Inf. Model., ASAP Article Web Release Date: January 11, 2006 Copyright © 2006 American Chemical Society

Monday, January 09, 2006

Notes from the Gaussian Users' Meeting

The day before the start of Pacifichem (December 14), an all-day Gaussian Users' Meeting was held in Honolulu. Mike Frisch provided a brief introduction. Gary Trucks gave a comprehensive analysis of the performance of several DFT functionals. In his experience, of the pure functionals, HCTH is the best for geometries; while O3LYP is the best among the hybrid functionals (X3LYP is really no improvement over B3LYP, although it was designed to be). MP2 has problems; for instance, it predicts that the ground state of benzene is non-planar! For frequencies, HCTH and tauHCTH are the best pure functionals, while O3LYP and tauHCTH are the best hybrid functionals -- errors vary systematically. Again X3LYP is no improvement over B3LYP. Thus the best choice for geometry andfrequency is O3LYP, with HCTH second. Overall BMK is bad; it sacrifices geometries and frequencies. B3P86, PBE/PBE and B3LYP are good general purpose functionals. For electronic excitations, B3P86 is the best general purpose functional, PBE/PBE is sometimes ok. Overall B3P86 is comparable to anything in the last 15 years. Overall, O3LYP is good for geometries and frequencies; HCTH is the best pure functional. B3P86 is better if electronic excitations are important.

The next talk was on PCModel: Saunders, Houk, Wu, Still, Lipton, Cheng & Guida, JACS 112, 1419-1427 (1990). Grid methods suffer from combinatorial explosion, but guarantee a solution. Stochastic methods, on the other hand, give up the guarantee of a solution in favour of speed.
  • Cartesian search works well if lots of rings and constraints; searches local space well;
  • Dihedral search works better on long, staurated chains; searches global space.

Rina Dukor (BioTools) then spoke on Chiro-optic properties. She pointed out that 9 of the top-10 drugs have chiral active ingredients. e.g.:
  • Warfarin -- both enantiomers are anticoagulants;
  • Propanolol -- S enantiomer is a beta-blocker, R is not;
  • Sotalol -- enantiomers have different activities;
  • Leva Dopa -- one enantiomer is a drug, another is toxic!
IR spectra of enantiomers are identical, but their VCD spectra are opposite in sign; CD is the difference in absorbance of left - right circularly polarized light. VCD is an alternate to XRD for determination of absolute conformation. g03 can now compute VCD, CD (electronic), ROA, ORD. A free online database and searchable (by CAS#) literature on VCD spectra can be found on btools.com

Bernie Schlegel spoke on AIMD. BOMD converges the wavefunction at each step and propagate the nuclei. Car-Parinello propagates the wavefunction along with geometry using an extended Lagrangian. Ehrenfest method propagates the wavefunction using TD HF/DFT and propagates the nuclei using classical trajectories. In CP:
  • orbitals are expanded in plane waves,
  • plane wave coefficients being propagated with an extended Lagrangian;
  • FFT is used for propagation;
  • gradients require only Hellmann-Feynman terms.
In ADMP (Atom-centered Density Matrix Propagation):
  • orbitals are expanded in atom-centered basis,
  • far wewer basis basis functions being needed than plane waves;
  • Density Matrix propagation using an extended Lagrangian;
  • FFT is used for propagation;
  • gradients require Hellmann-Feynman and Pulay terms.
Euler-Lagrange equations of motion used for nuclei and for Density Matrix (with a fictitious mass for components of the Density Matrix); integrate using velocity Verlet algorithm.

The last speaker was Doug Fox, who gave us the fruits of his years of experience on Geometry Optimization. Cartesians or Z-matrix, either can be used to set up the initial state:
  • Opt=Z-matrix or Opt=ModRedundant;
  • Opt=(TS,CalcFC) for TS Search works if you can start from a good geometry;
  • Opt=(QST2/3) helps bracket the search region;
  • Opt=(QST2,Path=n) is useful if you also want to map the path.
  • Opt=ModRedundant needs input for each structure (or a blank line).
Order of structures: Reactants, Products, TS for QST3
Beware order mismatch for atoms in TS Search! Use GaussView -> Atom Editor -> Connection Editor -> AutoFix
Symmetry is normally handled by inputing coordinates in the right symmetry. This is a good use of Z-matrix, e.g. for getting angles right.
Wrong # negative eigenvalues =>
  • TS Search with 2 or 0 negative eigenvalues;
  • or Minimum with 1 or more negative eigenvalues.
Wrong chemistry => TS which connects different chemistries.
GaussView can be helpful to follow a failed optimization. Check the progress of optimization using Eigenvector Following. The meeting concluded with a brief panel discussion.

Sunday, January 08, 2006

Notes from Pacifichem

Pacifichem 2005 was held at Honolulu from Dec.15-20, 2005, the second Pacifichem meeting I attended. I had a presentation on Intelligent data mining for modeling and prediction of protein-ligand, protein-surface and protein-DNA interactions (N. Sukumar; C. M. Breneman; S. A. Cramer; K. P. Bennett; M. Sundling; Q. Luo; D. Zhuang) in the symposium on Complexity and Related Computational Methods in Bioactive Discovery.

There were several talks in this session on/using cellular automata simulations (a topic of personal interest to me), notably by D. Winkler; F. R. Burden; M. Polley (Modelling emergent properties of complex biological and chemical systems), Lamont B. Kier (Theory of ligand passage through hydrodynamic chreodes modeled with cellular automata) and P. Seybold (Cellular automata models of complex dynamic systems).

M. Polley
; D. Winkler; F. Burden
(Virtual library design of drug-like molecules using evolutionary methods and compound fitness functions) from CSIRO (Australia) mentioned a SMILES mutation & crossover algorithm with on-the-fly QSAR - published in J.Med.Chem. a while ago. I was particularly interested in this work because of our own SMUT algorithm.

B. Testa; G. Vistoli; A. Pedretti; A. Bojarski; M. Nowak (Computational explorations of the property space of biomolecules) spoke on exploring molecular properties encoding recognition forces (e.g., lipophilicity) and their dependence on conformation, using acetylcholine as an example: the property space of molecules is strongly influenced by their molecular environment. Hence structure<->property<->environment<->structure triangular relationship.

A. Rayan; D. Marcus; O. Givaty; D. Barasch; Amiram Goldblum (Searching for molecular bioactivity in large databases by double focusing) spoke of Double focussing with MBI (molecular bioactivity index towards a specific receptor) & DLI (drug likeness index).
As a measure, he used the Mathews correlation coefficient, defined as:
MCC=(TP+TN-FP-FN)/ (TP+FP)(TN+FN)(TP+FN)(FP+TN)

V. Gillet; S. J. Cottrell; R. Taylor spoke on Application of multiobjective evolutionary optimisation techniques to pharmacophore identification. Multi-objective evolutionary algorithm exploits population nature of EA to search for multiple objectives in parallel - MOGA. Uses Pareto ranking (Gasteiger et al) for Mapping the Pareto front for library design - optimizing on diversity & size. Penalize molecules that fall in crowded regions of the Pareto front ensures that solutions are well spread out on tke Pareto surface (niching). This is a necessary but not sufficient condition for chemical diversity, which has to be explicitly promoted.

B. Levitan spoke on Simulated approaches for screening combinatorial libraries for multiple affinity-based properties using multi-objective iterative selection/amplification assays. This was again of great interest to use due to our ongoing work on SVM-based multi-objective feature selection. They presented simulations on methods inspired by multi-objective optimization to design iterative selection-amplification protocols that can tunably guide the screening results toward prespecified combinations of affinities or toward a population characterizing the tradeoffs between the affinities.

Tudor I. Oprea's talk in this session was on Rapid evaluation of synthetic and molecular complexity for lead discovery. He also had a presentation on Design of fragment based libraries in the symposium on Chemical Biology: Small Chemical Compounds as Magic Bullets to Elucidate Biological Mechanisms. Molecular Complexity of Barone & Shannon correlates well with MW.
JCICS 44, 378 (2004): 41, 269-272 (2001); Allu & Oprea, JCIM 45, 1237-1243 (2005)
For higher MW, probability of finding actives increases with complexity. He presented a simple metric that evaluates both synthetic and molecular complexity (SMCM) starting from chemical structures.

M.Charton spoke onTopological parameters v.composition as a function of vertices and edges. His thesis is that topological parameters are not directly related to physical,
chemical or biological properties, but are methods of counting structural features.
Polarizability is well accounted for by # of C atoms. Other relevant parameters are steric effects & degree of branching. Topological parameters are not fundamental but composite parameters. Counts of atoms (vertices) & bonds (edges) can be used directly as parameters.
M.Charton,J.Computer Aided Mol.Design 17,197-209 (2003).
M.Charton and B.l.Charton;J.Phys.Org.Chem.16,715-720 (2003).
M.Charton,EuroQSAR 2002 Designing Drugs and Crpo Protectants:processes,problems,and solutions M.Ford, D.Livingstone,J.Dearden and H.Van de Waterbeemd,Eds.,Blackwell Publishing Ltd.,2003,p.122-124.
M.Charton,Proc.15 th Eur.QSAR Conf.2005,in press.

D. K. Agrafiotis (Self-organizing principle for modeling proximity data) presented a Stochastic parametrization: stochastic proximity embedding (SPE), a self-organizing algorithm for producing meaningful underlying dimensions from proximity data; it generates low-dimensional Euclidean embeddings preserving the similarities between a set of related observations.
Agrafiotis DK, Xu H. PNAS 99, 15869-15872 (2002); JCICS 43, 1186-1191 (2003)
Bioactive conformations are less compact than random ones.

Donald G. Truhlar; A. W. Jasper; S. Nangia; O. Tishchenko; C. Zhu had a talk on New theoretical and computational methods for quantum photochemistry in the symposium on Nonadiabatic Phenomena and Related Dynamics: Theory and Experiment, which I attended in its entirety. Don spoke about Coherent switching with decay of mixing - decoherence, demixing. Decoherence is essential for quantum non-BO dynamics.
Truhlar & Nakamura, JCP 115, 10353 (2001); 117,5576 (2002)

I also attended the symposia on Theoretical Methods for Prediction of Molecular Properties, and on Computational Quantum Chemistry: Methodology and Application in honour of Leo Radom. D. P. O'Neill; P. Gill (Benchmark correlation energies for small molecules) pointed out two definitions of Correlation Energy:
Lowdin = E(RHF) - E(exact)
Pople = lowest possible energy obtainable from a single determinant - E(exact) ~ E(UHF) - E(exact)

S. Hirata (Best of three worlds: Combined linear, cluster and perturbation expansions for electron correlation) noted that CI handles differential correlation well;
while Coupled Cluster handles dynamical correlation well.
EOM-CC = CI + CC
Use perturbation for the remainder of the dynamical correlation.
CCSD(T) does not give a well-defined wave function.
CI + CC + PT ground state energies are size extensive; excitation energies are size intensive.

Martin Head-Gordon reflected on Quantum computers, using a CI algorithm for STO-3G water: M.Head-Gordon, et al, Science 309, 1704 (2005); PRL 83, 5162 (1999)
He noted that # of qubits scales linearly with problem size; while # of gates scales polynomially.

S. Yamanaka; K. Nakata; K. Kusakabe; R. Takeda; H. Nakamura; T. Takada; K. Yamaguchi (Ab initio iterative CASCI-DFT approach for large molecular systems) spoke on MR-DFT: CAS wfn.theory accounts for non-dynamical correlation completely but takes enormous computational cost to handle dynamical correlation. DFT handles dynamical correlation well. In Coulomb driven MR-DFT: modify universal functional. Whereas in Wavefunction driven MR-DFT: variation of universal functional restricted to space of MR-wfns. For excited states use Theophilou formulation for subspace densities.

K. Hirao presented his New hybrid DFT functional. He pointed out that DFT predicts first-order properties well, but does badly for response properties due to wrong long-range behaviour of the XC potential.
LC-GGA: %Becke/HF exchange falls of as 1/r12; this ratio is fixed for B3LYP.
LDA gives vdW interaction but overestimates binding & underestimates vdW bond length; GGA doesn't give vdW binding; MP2 also overestimates binding & is very basis set dependent.
Sato. Mol.Phys.103, 1151 (2005); JCP 123, 10437 (2005)

D. G. Fedorov; T. Ishida; K. Kitaura spoke on Electron correlation and the multilayer approach in the fragment molecular orbital method (FMO) where a molecule is divided into fragments and then ab initio fragment monomer and dimer calculations are performed in the Coulomb field due to other fragments. Nearly analytic gradients and properties can be computed and a calculation of a 4000 atom was reported. The scaling is nearly linear. J.Comp.Chem. 26, 1 (2005).