Jörg Kurt Wegner and Holger Fröhlich and Andreas Zell

Feature selection for Descriptor based Classification Models. 2. Human Intestinal Absorption (HIA)

J. Chem. Inf. Comput. Sci. 2004, 44, pp. 931-939


Abstract

We show that the topological polar surface area (TPSA) descriptor and the radial distribution function (RDF) applied to electronic and steric atom properties, like the conjugated electrotopological state (CETS), are the most relevant features/descriptors for predicting the human intestinal absorption (HIA) out of a large set of 2934 features/descriptors. A HIA data set with 196 molecules with measured HIA values and 2934 features/ descriptors were calculated using JOELib and MOE. We used an adaptive boosting algorithm to solve the binary classification problem (AdaBoost.M1) and Genetic Algorithms based on Shannon Entropy Cliques (GA-SEC) variants as hybrid feature selection algorithms. The selection of relevant features was applied with respect to the generalization ability of the classification model, avoiding a high variance for unseen molecules (overfitting).

Download

[pdf]


Bibtex

@Article{wfz04b,
  author   =     "J. K. Wegner and H. Fr{\"{o}}hlich and A. Zell",
  title    =     "{F}eature selection for {D}escriptor based {C}lassification {M}odels. 2. {H}uman {I}ntestinal {A}bsorption ({HIA})",
  abstract =     "We show that the topological polar surface area (TPSA) descriptor and the radial
                  distribution function (RDF) applied to electronic and steric atom properties,
                  like the conjugated electrotopological state (CETS), are the most relevant
                  features/descriptors for predicting the human intestinal absorption (HIA) out
                  of a large set of 2934 features/descriptors. A HIA data set with 196 molecules
                  with measured HIA values and 2934 features/descriptors were calculated using
                  JOELib and MOE. We used an adaptive boosting algorithm to solve the binary
                  classification problem (AdaBoost.M1) and Genetic Algorithms based on Shannon
                  Entropy Cliques (GA-SEC) variants as hybrid feature selection algorithms.
                  The selection of relevant features was applied with respect to the generalization
                  ability of the classification model, avoiding a high variance for unseen molecules
                  (overfitting).",
  journal  =     "J. Chem. Inf. Comput. Sci.",
  volume   =     "44",
  year     =     "2004",
  pages    =     "931-939",
  url      =     "http://dx.doi.org/10.1021/ci0342324",
  doi      =     "10.1021/ci0342324",
  note     =     "",
  contents =     "human intestinal absorption, bioavailability, model quality, feature selection, genetic algorithm, boosting, support vector machines, radial distribution function",
  topics =       "human intestinal absorption, bioavailability, model quality, feature selection, genetic algorithm, boosting, support vector machines, radial distribution function",
}