On the information content of 2D and 3D descriptors for QSAR

Journal of the Brazilian Chemical Society, Jan 2002

To gain better understanding on the information content of two-dimensional (2D) vs. three-dimensional (3D) descriptor systems, we analyzed principal component analysis scores derived from 87 2D descriptors and 798 3D (ALMOND) variables on a set of 5998 compounds of medicinal chemistry interest. The information overlap between ALMOND and 2D-based descriptors, as modeled by the fraction of explained variance (r2) and by seven-groups cross-validation (q2) in a two PLS components model was 40%. Individual component analysis indicates that the first and second principal components from the 2D-descriptors are related to the first and third dimensions from the ALMOND PCA model. The first ALMOND component is explained (61%) by size-related descriptors, whereas the third component is marginally explained (25%) by hydrophobicity-related descriptors. Surprisingly, 2D-based hydrogen-bonding descriptors did not contribute significantly in this analysis. These results do not a priori justify the choice of one methodology over the other, when performing QSAR studies.

Article PDF cannot be displayed. You can download it here:

http://www.scielo.br/pdf/jbchs/v13n6/13848.pdf

On the information content of 2D and 3D descriptors for QSAR

J. Braz. Chem. Soc., Vol. 13, No. 6, 811-815, 2002. Printed in Brazil - ©2002 Sociedade Brasileira de Química 0103 - 5053 $6.00+0.00 Tudor I. Oprea Office of Biocomputing, BSMB61, University of New Mexico School of Medicine, Albuquerque NM 87131-5196 Com o objetivo de melhor entender as informações paramétricas contidas em descritores bidimensionais (2D) e tridimensionais (3D), os escores de 87 descritores 2D e 798 variáveis 3D (ALMOND) obtidos de uma série de 5998 compostos de interesse em química medicinal, foram analisados através de análise de componentes principais. A fração de variância explicada (r2) e a validação cruzada (q2) para sete grupos, em duas componentes PLS, foram de 40%. Uma análise individual dos componentes, mostra que as duas primeiras PCs obtidas a partir dos descritores 2D estão relacionadas com a primeira e terceira PCs dos descritores 3D. A primeira componente 3D é explicada (61%) por descritores relacionados ao tamanho, enquanto que o conteúdo da terceira é essencialmente hidrofóbico, mas com pequena variância (25%). Surpreendentemente, descritores relacionados a ligações hidrogênio não contribuíram de forma significativa para a análise final. Estes resultados não permitem, a priori, a escolha de um método em detrimento de outro, quando da realização de estudos em QSAR. To gain better understanding on the information content of two-dimensional (2D) vs. threedimensional (3D) descriptor systems, we analyzed principal component analysis scores derived from 87 2D descriptors and 798 3D (ALMOND) variables on a set of 5998 compounds of medicinal chemistry interest. The information overlap between ALMOND and 2D-based descriptors, as modeled by the fraction of explained variance (r2) and by seven-groups cross-validation (q2) in a two PLS components model was 40%. Individual component analysis indicates that the first and second principal components from the 2D-descriptors are related to the first and third dimensions from the ALMOND PCA model. The first ALMOND component is explained (61%) by size-related descriptors, whereas the third component is marginally explained (25%) by hydrophobicity-related descriptors. Surprisingly, 2D-based hydrogen-bonding descriptors did not contribute significantly in this analysis. These results do not a priori justify the choice of one methodology over the other, when performing QSAR studies. Keywords: ALMOND, cheminformatics, chemometrics, QSAR Introduction There are currently over 3000 molecular descriptors1 that can be used in QSAR (Quantitative Structure Activity Relationship) studies.2 Their application to QSAR has been recently surveyed.3 Significant information about a QSAR dataset can be extracted using 2D- (two-dimensional) descriptors, i.e., descriptors that do not use information related to the three-dimensional characteristics of model compounds. Most of these descriptors can be classified as: i) Size-related: molecular weight – MW; calculated4 molecular refractivity – CMR; molecular volume and molecular surface area, pre-computed from tabulated values (e.g., using Van der Waals radii), etc.; * e-mail: ii) Hydrophobicity-related: the logarithm of the octanolwater partition coefficient, LogP 5 – besides CLOGP,6 several other LogP estimating programs are available;7 the π fragmental constant;8 the logarithm of the (molar) aqueous solubility9 ,10 (LogSw); iii) Descriptors related to electronic effects: CMR; the (tabulated) estimated polarizability;11 Hückel-level estimates of the highestoccupied, and lowest-unoccupied, molecular orbitals; partial atomic charges based on electronegativity equilibration schemes;12 ,13 counts of positive or negative ionic centers; etc; iv) Hydrogen bonding descriptors that estimate the basicity or acidity factors, e.g., the HYBOT 14 ,15 or Abraham descriptors,16 or electro-topological (E-state) descriptors,17 or counts18 of hydrogen bond acceptors or donors; v) Topological descriptors 19 derived from connectivity20 matrices.21 ,22 Article On the Information Content of 2D and 3D Descriptors for QSAR 812 The above types of descriptors have been successfully used to derive QSAR models for the past four decades. However, for the past 15 years, our ability to investigate the third dimension in a meaningful way, e.g., by analyzing conformers, has led to the development of 3D (three dimensional) QSAR methods. Best represented by CoMFA23 (Comparative Molecular Field Analysis) or by the combination of GRID24 and PLS25 (Partial Least Squares), 3D-QSAR methods26-28 try to explain the variance in biological activity by monitoring variations in the 3D structures of chemical compounds. CoMFA, for example, attempts to relate molecular interaction fields, MIFs, of a series of molecules, to biological activity via PLS,25 thus matching differences or similarities in the MIFs (steric and electrostatic are default) to differences or similarities in the biological activity. Quite early, the use of graphical analysis29 to evaluate CoMFAPLS results was recognized as the main strength of 3DQSAR methods. However, the value of 3D descriptors was put to question in the context of cheminformatics. As Brown and Martin have shown, simple (2D-based) substructure keys are more successful in grouping active compounds, compared to more elaborate 3D-based keys.30 Brown and Martin went further to show that 2D-based descriptors are more useful in predicting LogP and pKa, compared to 3D descriptors.31 Yvonne Martin further discusses the balance between 2D and 3D-QSAR models.32 However, LogP and pKa are physico-chemical properties where the third dimension (conformational flexibility) bears little, if any, relevance. This is not the case for the vast majority of biological activities. To gain better understanding on the information content of 2D vs. 3D descriptors, we analyzed principal component analysis (PCA) scores derived from SaSA33 and ALMOND34 on a set of 5998 compounds of medicinal chemistry interest.35 This paper discusses the relevance of 2D vs. 3D descriptors, in part discussed elsewhere,36 in the absence of any property correlations (Y vectors). Materials and Methods SaSA descriptors SaSA33 computes 72 descriptors starting from the 2D structures. Size-related descriptors included MW, the number of heavy atoms, the number of carbons, and CMR.4 Polarizability is estimated by CMR and by an atom-based scheme. 11 Flexibility and rigidity are estimated 18 by counting the total number of bonds, the number of rings and the number of rotatable bonds and the number of rigid Oprea J. Braz. Chem. Soc. bonds, and by several topological indices that estimate other properties22 as well. The Wiener, Balaban, Randic and Motoc indices, as well as the Kier and Hall suite of connectivity descriptors20 are also computed in SaSA. Hydrogen-bonding capacity is estimated using HYBOT14 descriptors. Furthermore, SaSA uses simple counts for oxygen, nitrogen, H-bond donors and H-bond acceptors, positive and negative ion (...truncated)


This is a preview of a remote PDF: http://www.scielo.br/pdf/jbchs/v13n6/13848.pdf
Article home page: http://www.scielo.br/scielo.php?script=sci_abstract&pid=S0103-50532002000600013&lng=pt&nrm=iso&tlng=en

Tudor I. Oprea. On the information content of 2D and 3D descriptors for QSAR, Journal of the Brazilian Chemical Society, 2002, pp. 811-815, Volume 13, Issue 6, DOI: 10.1590/S0103-50532002000600013