Getting past diversity in assessing virtual library designs
[摘要] The incorporation of high-throughput screening (HTS) into the drug discovery and development process has prompted many pharmaceutical companies to shift from synthesis of individual compounds to combinatorial synthesis programs. This has led in turn to a broadening of the range of chemistry amenable to combinatorial approaches. One can now easily generate a virtual library composed mostly (if not entirely) of reasonably drug-like, synthetically accessible compounds the full realization of which would bankrupt any existing or conceivable pharmaceutical company many times over. This situation creates a pressing need for design tools to help chemists decide which particular products from such a virtual library should actually be made and tested. Many programs have been created for generating such sublibrary designs, with the most recent work focusing on choosing reagents so as to maximize some property of the specified products.1 In many cases, the property being optimized is molecular diversity (substructural or pharmacophoric) among the products, though other objective functions have been used as well.2 In all cases, however, the intrinsic redundancy of combinatorial libraries guarantees that there will be many different solutions which are essentially equivalent with regard to the specific criterion being evaluated. Representativeness, in particular, is an important secondary consideration. Chemists need to be able to compare such alternative sublibraries in some general, detailed way. In addition, computational chemists need a meaningful way to evaluate the effectiveness of different design programs3 and of reagent versus product-based design tools.4 Here we use a series of sublibraries designed to be both representative and diverse to illustrate general analytical approaches based on Tanimoto similarities between substructural fingerprints. We use nearest neighbor similarity profiles and a recently developed variation on non-linear mapping for visualization to compare the various sublibraries.5 Methods Diversity analysis Molecular diversity was determined by comparing UNITY® substructural fingerprints6 for the compounds in question. These are bit vectors in which elements are set to 1 if particular substructures are present.7 The similarity between fingerprints was evaluated in terms of the Tanimoto coefficient8 T as applied to bit set vectors x and y:where the bracketing vertical bars indicate cardinality. For diversity selection, the similarity between two sets is taken as the maximum similarity found between any member of one set and a member of the other, i.e., the largest nearest neighbor similarity. This criterion underlies the maximal diversity selection algorithm9 used in the dbdiss program distributed as part of the Selector module in SYBYL®. A broader sense of the similarity of two (sub)sets can be obtained by examining the distribution of nearest neighbor similarities of one (sub)set with respect to the other. The nearest neighbor similarity profiles discussed here were obtained using the dbcmpr program, which is also part of the Selector module of SYBYL. It is often enlightening to use dbcmpr to compare a set to itself ¾ i.e., to take the same set as target and reference. As discussed elsewhere,10 it is particularly enlightening to do this when the set in question is a maximally diverse subset obtained using the algorithm embodied in dbdiss. The self-similarity profiles for such maximally diverse subsets can provide valuable insight into the structural scope of the data sets (here, sublibraries) from which they spring. The procedure is analogous to characterizing the area of a room by spreading a set number of dimes around it so as to maximize the distance (dissimilarity) between them. Having done so, the proximity (similarity) of the dimes characterizes the area (here, hypervolume) and shape of the room. OptiSim selection Optimizable k-dissimilarity (OptiSim11) selection entails selection of the "best" candidate from each of a series of subsamples of size k drawn at random from the data set of interest.12 Redundancy is prevented by checking each potential candidate against those items (here, compounds) selected in previous iterations; if it is too similar to any item already chosen, it is disqualified from further consideration. For most applications, a modified form of uniform random sampling without replacement is used, so that all potential candidates are considered before any candidate is reconsidered. The criterion used here to determine which candidate is "best" is structural diversity with respect to the compounds selected during previous iterations, with the first selection drawn at random or specified externally. To preclude structural redundancy,13 candidates were excluded from subsamples if their fingerprints exhibited a Tanimoto similarity to those already selected greater than 0.90, corresponding to a "Tanimoto distance" (dT = 1 ¾ T) less than 0.10. When applied to a large combinatorial library, the stochastic component of this simple strategy leads to selection of a set of compounds representative of the library as a whole. Choosing that candidate from each subsample which is least similar to those already selected, on the other hand, enhances the diversity of the selection set with respect to simple random selection. The balance between representativeness and diversity is set by the choice of k, with smaller subsample sizes favoring representativeness and larger sizes favoring diversity. Studies to date indicate that values of k in the range of 3 to 5 increase diversity without sacrificing much representativeness.12 Sublibrary block design Combinatorial sublibraries were created by applying an extension of OptiSim selection5 in which successive reagent selections alternate between reagent classes. Consider, for example, two reagent sets A and B such that A + B + X ® AXB, where X is a common core or scaffold. Seed reagents A0 and B0 are selected at random. A subsample comprised of k candidates (a11, a12,
, a1k) chosen at random from A is then created, taking care that none of the products of reaction with B0 (e.g., a11XB0) are too similar to A0XB0. The reagent leading to the product with the lowest Tanimoto similarity to A0XB0 is taken as the "best" candidate reagent; it becomes A1. The design then pivots to consider reagents from B, with b11, b12,
, b1k chosen at random, subject to the constraint that no product AiXb1j is too similar to either A0XB0 or A1XB0. The best candidate b1j is then determined by identifying the one for which the similarity to the two products already selected is smallest. This candidate becomes B1, the products A0XB1 and A1XB1 are added to the selection set, and the program proceeds to consider a new subsample of k candidate reagents from A. In many cases, an unbalanced design is desired. If a larger reagent subset is specified for B than for A, pivoting stops once the quota for As has been fulfilled and subsamples of reagents are drawn from B until the block is completed. A new block is then initiated by drawing k products at random from the parent library and comparing them against all products included in the first block, and a new block is grown. The pattern of product selection produced by application of this method is illustrated in Figure 1. The designs described here were produced using a prototypical implementation of the method written in SYBYL programming language (SPL). It bears noting that filters other than simple redundancy ¾ e.g., acceptability of expected physical properties - can readily be put in place for determining the eligibility of candidate reagents for the subsample. Similarly, the "best" candidate in each subsample need not be determined by structural diversity, as it is here; similarity to a lead compound or incremental goodness of fit of the selected population to some target profile can be substituted. It should also be noted, however, that applying the diversity criterion to a series of subsamples rather than to the library as a whole serves to shift the properties of the sublibraries obtained away from simple diversity. Non-linear mapping with horizon (NLM-H) UNITY substructural fingerprints are made up of 988 binary elements. It is impossible for a human being to directly perceive relationships in such a high-dimensional space, and the Cartesian space to which we are accustomed is not the most appropriate one for making such comparisons anyway. As noted above, the Tanimoto similarity coefficient is better suited for this purpose, but it can only be directly applied to pairwise comparisons. Hence a tool is needed which can project most of the relevant information contained in the 988-dimensional "fingerprint space" down into two or three dimensions without unduly distorting important underlying Tanimoto relationships. One way to accomplish this is by using principal components analysis (PCA) to get initial coordinates, and then using non-linear mapping (NLM)14 to relieve distortions created by that projection. Behavior in such projections is dominated by long-range relationships, however, whereas it is local similarities that contain the most important information in fingerprint space; differences between low similarities tend to be meaningless.15 Worse, long range relationships in this space are intrinsically very high dimensional, leading to large residual distortions in the projections obtained. Local relationships, on the other hand, tend to be of relatively low dimensionality, because the space is typically quite sparse. The best strategy in such a
[发布日期] [发布机构]
[效力级别] [学科分类] 化学(综合)
[关键词] combinatorial library design;molecular diversity;representativeness;OptiSim;dissimilarity selection. [时效性]