A Rough Set-Based Model of HIV-1 Reverse Transcriptase Resistome

Reverse transcriptase (RT) is a viral enzyme crucial for HIV-1 replication. Currently, 12 drugs are targeted against the RT. The low fidelity of the RT-mediated transcription leads to the quick accumulation of drug-resistance mutations. The sequence-resistance relationship remains only partially understood. Using publicly available data collected from over 15 years of HIV proteome research, we have created a general and predictive rule-based model of HIV-1 resistance to eight RT inhibitors. Our rough set-based model considers changes in the physicochemical properties of a mutated sequence as compared to the wild-type strain. Thanks to the application of the Monte Carlo feature selection method, the model takes into account only the properties that significantly contribute to the resistance phenomenon. The obtained results show that drug-resistance is determined in more complex way than believed. We confirmed the importance of many resistance-associated sites, found some sites to be less relevant than formerly postulated and—more importantly—identified several previously neglected sites as potentially relevant. By mapping some of the newly discovered sites on the 3D structure of the RT, we were able to suggest possible molecular-mechanisms of drug-resistance. Importantly, our model has the ability to generalize predictions to the previously unseen cases. The study is an example of how computational biology methods can increase our understanding of the HIV-1 resistome.


Introduction
More than two decades have passed since the discovery of HIV, the causative agent of AIDS.Numerous groups focused their research on understanding the details of HIV life cycle and on developing efficient antiviral therapies.Unfortunately, the high rate of replication combined with the high mutability of the virus leads to the rapid emergence of drug-resistant strains efficiently undermining the efforts to stop the AIDS pandemic.Currently, there are some 7,000 new HIV infections reported worldwide every day.In total, more than 30 million people in both the developed and the developing countries are HIV-positive. 1About 10 9 virions are produced in an infected individual every day and it has been estimated that each possible single-point mutation arises 10 4 -10 5 times in this population. 2While some mutations result in the production of functionally-impaired viruses, other lead to the emergence of drug-resistant forms.
Reverse transcriptase (RT) is one of the viral enzymes that are required for successful replication.The RT catalyzes reverse transcription, a process of transforming single-stranded viral RNA into double-stranded viral DNA.The viral DNA is later incorporated into the host genome and it re-programs the host cell to produce new viral particles that undergo maturation, bud off and infect new cells thus completing the viral life-cycle.In peripheral blood lymphocytes the maturation occurs after viral release while in macrophages it takes place prior to the release, within the cell, in the multivesicular bodies.Not unlike the other enzymes in the family of reverse transcriptases, the HIV-1 RT lacks proof-reading activity which, combined with the high replication rate of the virus and the RT-mediated recombination, leads to the rapid emergence of HIV mutants.Many of these mutants are drug-resistant.The first antiviral therapies were targeted against the RT and this enzyme still remains one of the most common targets for anti-HIV drugs.An initial hope that followed the introduction of AZT (Zidovudine), the first anti-viral agent targeting HIV, has been quickly shattered by the rapid emergence of drug-resistant viruses.Among the 25 drugs currently used in HIV therapy, 12 attempt at inhibiting the RT enzyme.
There exist two groups of RT inhibitors, namely the nucleoside/nucleotide RT inhibitors (NRTI) and the non-nucleoside RT inhibitors (NNRTI).The former ones mimic dNTPs, the ordinary RT substrates but due to the lack of the 3'-OH group in the ribose ring they inhibit DNA chain elongation immediately after being incorporated.The mode of action of the NNRTI drugs is somewhat different since they bind in the so-called NNRTI-binding pocket of the RT and induce conformational changes that terminate the synthesis of the viral DNA.
Various attempts have been undertaken to associate particular mutations in the RT sequence with the drug resistance level.Often, however, it is not a single mutation, but rather a non-linear combination of different mutations that leads to drug resistance.This increases the complexity of the problem and various machine learning techniques have been used in order to predict resistance from RT sequence.Drăghici and Potter 3 have used neural networks to build a predictive model of HIV drug resistance to RT inhibitors.The commonly used Geno2Pheno tool 4 relates sequence to resistance by using regression models.An international panel of experts semiannually releases a set of rules for predicting resistance. 5Similar approach has been used by Johnson et al 6 Garriga and Menéndez-Arias 7 released a tool that uses the available sets of expertderived rules to predict resistance.In their interesting studies, Rhee et al 8 use five different statistical learning methods (decision trees, neural networks, support vector regression, least-squares regression and least-angle regression) to model sequence-resistance relationship in HIV-1.A fresh and stimulating approach to the problem is presented in Kjaer et al 9 where the authors propose to represent protein sequences in terms of physicochemical properties of amino acids.Recently, Prosperi et al 10 published an interesting comparison of linear and non-linear machine learning techniques used in HIV resistome research.They conclude that fully data-driven models derived from large-scale data are promising as antiretroviral treatment decision support tools and postulate complementing sequence data sets with patient-derived data such as treatment history.
Although the existing models were able to predict HIV-1 resistance to RT inhibitors, none of them provided any deeper insight into the underlying mechanisms in a physicochemical sense.There was also a lack of a method that would be able to predict resistance caused by a previously unseen mutation.
In this paper we attempted at filling this gap by developing a computational model of HIV-1 resistance to several RT inhibitors.Rather than looking at mutating amino acids, we based our model on local physicochemical properties of a protein sequence.This approach, combined with the Monte Carlo feature selection and the rough set theory resulted in an interpretable high quality model of the RT resistome.The model consists of a number of general IF-THEN rules associating changes in the physicochemical properties of RT-sequence with drug resistance level, e.g.:

THEN resistant to Nevirapine
This makes the model easy-to-interpret and generative and lets us believe that the presented approach will contribute to the development of new, more potent antiretroviral drugs.

Materials and Methods Data
We used publicly available data obtained from Stanford HIV Drug Resistance Database. 8For each of the examined drugs we extracted a number of amino acid sequences of the HIV-1 RT p66 subunit.Each sequence in the database has been annotated with the resistance value relative to the HXB2 wildtype strain.Since Zhang et al 11 have demonstrated that the Monograms PhenoSense is more reliable than other drug-resistance-testing assays and that it produces highly reproducible results, we used only the sequences with the resistance value determined using this method.In total, there were 781 sequences of the p66 subunit (91% of them complete within the first 240 aa sites, 31% of them complete within all the 560 aa sites) that we could use for constructing data sets.Following the established clinical practice, we labeled each sequence as "susceptible", "moderately resistant" or "resistant".We used cut-off values for the discretization as described in Rhee et al. 8 The detailed distributions of the resistance classes per drug are presented in Table 1.

Description of sequences
Kjaer et al 9 have used 544 different physicochemical properties of amino acids obtained from the aaIndex database 12 to describe HIV-1 protein sequences.Although we used the descriptors from the same database, our approach is different.Rather than constructing a large number of data sets, each based on a single physicochemical property, we constructed one data set per each antiviral drug and described each amino acid in a sequence by a vector of biologically relevant and interpretable properties.Following procedure described by Rudnicki and Komorowski, 13 we extracted a number of biologically-meaningful descriptors from the aaIndex database.
First, we selected descriptors that are representative for three broad biophysical categories: 1. Transfer free energy from octanol to water 14 for hydrophobicity; 2. Normalized van der Waals volume 15 for size; 3. Isoelectric point 16 for charge.
These properties were fixed during the simulated annealing run.Than we added randomly four different properties and computed the sum of the r-square for all pairs of this set, which was used as a pseudo-energy measure.A single move in the simulation consisted of replacing one of the four random properties.Moves leading to the decrease of pseudo-energy were always accepted, and moves leading to the increase of pseudo-energy were accepted with the probability: where DE is the the increase of pseudo-energy, T is a pseudo-temperature and k is a scaling constant.The pseudo-temperature was slowly decreasing during simulation, from 1000 to 1, and the scaling constant was selected by trial and error.Ultimately, we selected seven relatively low-correlated (cf.Fig. 1) physicochemical descriptors that are presented in Table 2.
The selected properties let us represent each naturally occurring amino acid as a unique point in the coordinates frame spanned by them.After the description, each amino acid sequence in the data set was represented by 3,920 properties (560 aa × 7 properties).We described each site in an aa sequence as a difference between the vector representing the wild-type and the vector representing the observed amino acid.Therefore, if no mutation was observed at all, the site was described by  the vector of seven zeroes.The final data sets were the ensembles of the described sequences annotated with the drug resistance values.

Monte Carlo feature selection
In order to select only the attributes (here the properties of 560 amino acids) that significantly contributed to drug resistance, we applied Monte Carlo Feature Selection (MCFS) method as described in Dramiński et al. 21In short, MCFS relies on the construction of a large number of decision trees.In this way, all non-informative features were removed from the initial data set.The results of the feature selection are presented in tables: Table 3-Table 10.
For the sake of comparison, the process of attributes-ranking differs between Breiman's random forests (RF) 22 and MCFS.In RF, the ranking is obtained by reshuffling the values of an attribute and observing the change in the quality of classification.In MCFS randomization test is done in a standard way by reshuffling decision labels.The importance of an attribute is determined by looking at the weighted accuracy related to randomization test-derived background.Another important difference between MCFS and RF is that while in the former individual trees are built on training samples drawn without replacement from the original set of samples (and are evaluated on the remaining samples) in the latter bootstrap techniques are used which rely on sampling with replacement.
We perform feature selection on the whole entire data sets prior to splitting them into the training set and the test set.In our previous work, 21 we argue in detail and show by examples that the MCFS provides a possibly objective ranking of features, independent of a classifier to be later used and pertaining only to the classification problem per se.In particular, using the MCFS does not lead to overfitting when proper classification is performed.At the same time, to benefit the most from the application of the MCFS, it should be performed on the largest available set of examples.

rough sets
Rough set theory described in Pawlak 23 has been introduced in the early eighties.It constitutes a mathematical framework particularly suitable for dealing with imprecise and incomplete data.In the rough set-based machine learning a set of minimal decision IF-THEN rules is inferred from a number of labelled examples.These rules constitute a model that can be used for assigning class labels to the previously unseen objects.The IF part of a rule is a conjunction of feature values and the THEN part is a disjunction of class labels.We used the ROSETTA 24 implementation of the rough set theory in order to learn a number of IF-THEN rules that associate the MCFS-selected physicochemical properties of the amino acids of the HIV-1 RT with the resistance level.
As it is required by the rough sets approach that all the features take discrete values, we first applied the entropy scaler and the equal frequency binning discretization algorithm.The process of inferring minimal sets of features (reducts) is computationally expensive.We used a genetic algorithm, a heuristic approach to finding approximate reducts.The obtained reducts let us infer a number of IF-THEN rules that link minimal combinations of amino acid properties with a resistance level.In order to make the model even more general, we applied a rule-generalization algorithm as described by Mąkosa. 25In short, a general rule is obtained by merging similar or partially redundant rules and on relaxing constraints imposed by them.For instance symbols represent the status of a site: *sites known to contribute to resistance to the particular drug; + sites where mutations are associated with resistance to some nrTI drugs but not to Abacavir; ++ sites where mutations contribute to resistance to nnrTI drugs; +++ sites that are not included in the literature. 5,6,30he following three rules (abbreviations explained in Table 2):  Symbols represent the status of a site: *Sites known to contribute to resistance to Delavirdine; + sites where mutations are associated with resistance to some NNRTI drugs but not to Delavirdine; ++ sites where mutations contribute to resistance to nrTI drugs; +++ sites that are not included in the literature.5,6,30 Bioinformatics and Biology Insights 2009:3 Typically all the rules that constitute the model vote for the final decision.A threshold defining a minimal amount of votes necessary to label an object with a decision may result in multiple decisions for the same object.We would like to emphasize that the rules used by the model are inherently descriptive and can easily be analyzed by a domain expert.The description of the data is presented in Table 1.Table 12 provides the detailed description of the models.

Validation
The validity of each model was determined in 10-fold cross-validation and in the so-called randomization test.In addition, the predictive quality of each general model was verified using an external test set.
symbols represent the status of a site: *sites known to contribute to resistance to Lamivudine; + sites where mutations are associated with resistance to some nrTI drugs but not to Lamivudine; ++ sites where mutations contribute to resistance to nnrTI drugs; +++ sites that are not included in the literature. 5,6,30ave been generated by random data, we constructed additional 1000 data sets per model by randomly permuting the decision in the original data set.Thus, we broke correspondence between the sequence and the resistance value.Each of the 1000 randomized data sets was evaluated using 10-fold cross-validation.Ultimately, we were using all the sequences from the original data set to train a rough set-based classifier and validated the predictions on the external test set.The performance of the models was validated using prediction accuracy and the area under the ROC (or Receiver Operating Characteristic) curve AUC.The accuracy, equal to a fraction of correctly classified sequences, was measured by its mean value for the cross-validated experiments and, finally, by its measurement on the external test.The AUC was measured by its mean for the cross-validated experiments.
For a two-class classification task, the ROC curve accounts for an uneven distribution of the decision classes in the original data set and visualizes the behavior of the classifier at different sensitivity to  symbols represent the status of a site: *sites known to contribute to resistance to Tenofovir; + sites where mutations are associated with resistance to some nrTI drugs but not to Tenofovir; ++ sites where mutations contribute to resistance to nnrTI drugs; +++ sites that are not included in the literature. 5,6,30ecificity ratios.Sensitivity is defined as a ratio between true positive predictions and the total number of positives.Specificity is a ratio between true negative predictions and the total number of negative examples.The ROC curve is constructed by plotting sensitivity vs. 1-specificity.The AUC value is an integral over the ROC curve.For a perfect binary classifier we have AUC = 1.0 whereas for a random classifier AUC = 0.5.Since in our case the decision takes three distinct resistance values: "susceptible", "moderately resistant" and "resistant", we provide a separate AUC value for each class by treating the two remaining classes as one.For instance, to calculate an AUC value for the class "susceptible", we consider both the "moderately resistant" and the "resistant" as a new "non-susceptible" class.
At last, we used the results of the randomization tests to compute a kind of p-values, i.e. the probability that the relationships found in the original data arose by pure chance.Our computations were based on the assumption that the AUCs obtained in the randomization test are normally distributed.The normality was symbols represent the status of a site: *sites known to contribute to resistance to Zidovudine; + sites where mutations are associated with resistance to some nrTI drugs but not to Zidovudine; ++ sites where mutations contribute to resistance to nnrTI drugs; +++ sites that are not included in the literature. 5,6,30sessed by examining the so-called Q-Q plots and applying Shapiro-Wilk test for normality.Subsequently we used Student's t-test to obtain the p-values.
In addition, we compared the performance of our models with the performance of their standard decision tree-based counterparts with mutations represented by one-letter aa codes.We used J48 algorithm as provided in the WEKA 26 suite to derive the decision tree models.

Results and Discussion
Application of the Monte Carlo feature selection method combined with a rough set-based approach resulted in statistically sound, interpretable and generative rule-based models of the RT sequence-resistance relationship.The models can be used to predict HIV-1 resistance to six different NRTI drugs and two NNRTIs.By representing mutating amino acids in terms of physicochemical changes, the models gained generality and can be used to predict resistance for previously unseen mutants.Let us assume that only the following amino acids have been observed at site 101: A, E, H, K, P, Q, R, S, insertion, and that this observation led to the following rule: IF (polarity at site 101 = (-∞, 2.100)) THEN resistant to Nevirapine Symbols represent the status of a site: *Sites known to contribute to resistance to Didanosine; + sites where mutations are associated with resistance to some NRTI drugs but not to Didanosine; ++ sites where mutations contribute to resistance to nnrTI drugs; +++ sites that are not included in the literature. 5,6,30w, if the model is asked to predict whether a newly observed mutation to asparagine at site 101 will result in drug resistance, the polarity value for asparagine, (polarity N = 11.60) will be substituted to the rule and the prediction will be "Resistant to NVP".
At the first step, each RT sequence was represented by 3,920 properties.Application of the MCFS led to a significant reduction of this number (see Table 3-Table 10).It was already at this point that we have discovered that mutations at several, previously unnoticed sites contribute to drug resistance.There are 5 such sites for Abacavir, 5 for Didanosine, 4 for Lamivudine, 8 for Stavudine, 6 for Tenofovir, 6 for Zidovudine, 10 for Delavirdine and symbols represent the status of a site: *sites known to contribute to resistance to nevirapine; + sites where mutations are associated with resistance to some nnrTI drugs but not to nevirapine; ++ sites where mutations contribute to resistance to nrTI drugs; +++ sites that are not included in the literature. 5,6,30for Nevirapine.Apart from these, there are several sites where mutations were previously associated with resistance to some drugs, but our results suggest that also resistance to other drugs may be induced by them.We speculate that mutations at the newly discovered sites may be either directly responsible for drug-resistance or may play compensatory role by accompanying other drug-resistance mutations and diminishing their negative effects, e.g. the decreased replication rate.Table 11 presents sites that are included in various sets of rules for predicting drug resistance 5,6 but were not selected as significant by the MCFS method.The missed sites are either underrepresented in the data sets or their influence on drug-resistance is much weaker than previously assumed.This issue has to be investigated further.Following the feature-selection step, we applied rough set approach to build rule-based models of HIV-1 resistance to drugs.We used two different sets of parameters leading either to very specific or to more general rules that underly a model.Prior to model-building, we excluded 20% of the available examples from each data set in order to use them for independent validation.We used the remaining data for model-construction.We validated our models in 10-fold cross-validation and used area under ROC curve to measure their performance.All the models showed good results with accuracy varying from 69% for Delavirdine to 89% for Lamivudine when using specific sets of rules and from 69% for Delavirdine to 88% for Lamivudine when using generalized rules.Similarly, the corresponding AUC values were high in the majority of the models (cf.Table 12).In some cases, e.g. the resistance-to-Nevirapine model that was based on general rules, we observed low AUC values for the "moderately resistant" class.This may be due to the fact Table 11.sites mentioned in 5,6 but not selected as significant by the MCFS method are marked with "X".Abbreviations: ABC, abacavir; ddI, didanosine; 3TC, lamivudine; d4T, stavudine; AZT, zidovudine; NVP, nevirapine.Delavirdine is not included in the articles.
that the artificially set threshold values and the arbitrary split into three resistance classes is not completely reflected in real mutation patterns.Generalization of the rules did not lead to any significant deterioration of the classification quality. 25At the same time it reduced the number of rules by an order of magnitude.Models built on general rules are smaller, less sensitive to overtraining and easier to analyze.Finally, we validated each model on an external test set (20% of the available examples).In addition, we compared the performance of our models to the standard decision tree-based models.The decision trees performed similarly to their rough set-based counterparts but at the same time they were less stable.The decision tree-based models derived with no feature selection step loose generality and an important interpretational layer.The results are summarized in Table 12.
We also compared our model based on generalized rules with the model described by the domain expert rules 5 (cf.Table 13).For both sets of rules, we computed coverage and accuracy.In the case of the domain expert rules, we could use the entire data sets for the computation while in the case of the rough set model, we used only the test sets to avoid the possible bias caused by the fact that the rules were derived from the training data.Therefore for our model, we provide only a pessimistic estimates of accuracy and coverage.While accurate, expert rules are applicable only to a very limited fraction of examples.The generalized rules that underlie our model have significantly higher coverage.
Importantly, our generalized rules are conjuncts of the values (intervals of values) of physicochemical properties of amino acids.This allows seeing which amino acids fulfill the criteria imposed by a given rule, also when such amino acids were not represented in the training set.Given the following rule: IF P101 polarity ((-∞, 2.100)) AND P190 freq.turn ([0.045,∞))THEN resistant toNevirapine we can easily find which amino acids satisfy the conditions and substitute them into the rule: IF P101(any of: D,E,H,K,N,Q,R) AND P190(any but: A,G,N,P,Y) THEN resistant toNevirapine Even though asparagine (N) was not observed at site 101 in the available data, our general model is able to foresee that an occurrence of such a mutation may result in the acquisition of resistance.Such an approach already proved to be successful in revealing mechanisms underlying resistance to protease inhibitors. 27igure 2 and Figure 3 present an instance of analysis of the strongest rules determining resistance to Abacavir and Nevirapine respectively.For more details see Supplementary Material, Figure S1-S7.
All the remaining sets of rules were included in the online supplementary material.Detailed analysis indicates that although amino acids at these newly discovered positions interact directly neither with nucleic acid nor with the ABC triphosphate (ABCTP), the detected mutations may disturb the complex network of hydrophobic and polar interactions responsible for the stability of the tertiary structure.This may lead to subtle structural changes in the relative orientation of the domains and active site architecture, preventing ABCTP binding in a catalytically competent configuration.However, it seems that these small structural changes do not prevent the ability of a drug-resistant enzyme to incorporate normal nucleotides in the catalyzed reaction.
There are 10 sites (98, 100, 101, 103, 106, 108, 181, 188, 190 and 230) that experts have associated with the resistance to Nevirapine.The model finds all these important (except the 108 and the 230 site) and pinpoints six other sites as significant (102, 211, 357, 379, 401 and 468).None of these was previously associated with resistance.Additionally, sites 74 and 184 associated so far only with resistance to NRTI drugs and site 179 previously connected to resistance to the other NNRTI drugs, transpired to play significant role in acquiring the resistance to Nevirapine.
Since the training data does not contain any information on the history of treatment, some of the newly discovered sites might have emerged as a result of the past therapies.For instance, sites 74 and 184 known to contribute to resistance to NRTI drugs were selected as important to the resistance to Nevirapine which is a NNRTI drug.Therefore their role in the resistance to Nevirapine should be further investigated.
Similarly, sites that are often mutated in other HIV subtypes 28-30 (e.g. 35, 43, 122, 123, 135, 200, 211) should be treated with caution.While Kearney et al 28 consider sites 35, 83, 122, 123, 135, 200 and 211 as "non-resistance polymorphic", Kantor and Katzenstein 29 suggest that mutations at these sites (in particular 43 and 211) may play a significant role in drug resistance evolution and increase viral fitness.Site 118 that our method selected as important to resistance to some NRTI drugs was previously considered important but in 2005 was removed from the list of resistance-inducing mutations. 31he remaining sites discovered by our method yet not included in the expert rules 5,6,30 deserve further attention.Indeed, mutations at sites 208, 218 and 228 have even been previously suspected 32 to contribute to resistance.
The presented predictive models are derived from a large, although limited number of training examples.Even a very large number of examples would not guarantee that they cover all possible sorts of mutations.A particular advantage of rough sets is the ability to deal with contradictions.A rule that classifies an object to e.g. the "susceptible OR resistant" class is actually very useful since it indicates that, with the present knowledge, the object can belong any of these classes.If such rule has a significant coverage, it suggests the directions of further research.This ability is especially important in the context of medical applications where it is more desirable to perform additional examination than misclassifying the case.
While statistically sound, our findings should be subjected to further experimental validation and we see them as a navigational aid for clinicians and molecular biologists.

conclusion
The presented approach led us to the in silico discovery of several previously unknown mutations that contribute to resistance to RT inhibitors.Moreover, we discovered the exact values of the biochemical properties that will lead to resistance.This extends applicability of our model to previously unseen cases.Last, but not least, this approach can be applied to a wide class Table 13.The coverage and the accuracy of the rules.For expert rules we compute accuracy and coverage using all the available examples.The "moderately resistant" cases are treated as "resistant".In the case of rule-based model we compute accuracy and coverage using only the test set examples.This gives pessimistic assessment of both the measures but enables one to avoid possible bias coming from the fact that the rules were derived from the training set.The underlined value indicate that the classifier was negated.Colors correspond to the number of amino acids satisfying the constraint:

Drug
All amino acids allowed.
The most specific rule (up to 10 mutations lead to resistance).The most specific rule (up to 10 mutations lead to resistance).

Figure 1 .
Figure 1.Correlation matrix of physicochemical descriptors.The lower triangle contains bivariate scatter plots with a fitted line.The actual absolute values of the correlation are provided.The significance levels of the correlation are encoded in the following way: p = 0.001(***); p = 0.01(**); p = 0.05(*); p = 0.1(.).
First, we randomly divided each data set into a training set and an external test set.Each training set contained 80% of the sequences from the original data set and the remaining 20% of the sequences constituted the external test set.Both the training and the test set had the same distribution of the decision class (resistance) as the original data.Subsequently, we performed 10-fold cross-validation on the training set.The training data were randomly divided into ten subsets of equal size, D i , i = 1, 2, …, 10.We then generated ten new training sets of sequences (N i) by sequentially removing one of the D i subsets from the original training set.Thus, the N 1 data set contained all the data but the D 1 subset, the N 2 data set contained all the data but the D 2 and so forth.Thereafter we used each of the N i training sets to build a rough set-based classifier.The classifier was then used to classify the objects from the remaining D i subset.Therefore each sequence from the original data set was present once in a test set and nine times in a training set.In order to assess the probability that the obtained results could 69 multi-resistance complex and of the 151 multi-resistance complex.Included in6 only.

Figure 2 .
Figure 2.The strongest rules determining resistance to Abacavir.Amino acids are encoded using standard one-letter abbreviations.# indicates insertion of any type; "AA" is an amino acid observed in the data in the given resistance class; "[AA]" represents an amino acid observed in the data, but in the other resistance class and "aa" denotes an amino acid not observed in the data."LHS support" is a number of examples satisfying the rule.

Figure 3 .
Figure3.The strongest rules determining resistance to nevirapine.Amino acids are encoded using standard one-letter abbreviations.# indicates insertion of any type; "AA" is an amino acid observed in the data in the given resistance class; "[AA]" represents an amino acid observed in the data, but in the other resistance class and "aa" denotes an amino acid not observed in the data."LHS support" is a number of examples satisfying the rule.

Table 1 .
Number of resistance-annotated sequence examples per class.

Table 2 .
Physicochemical descriptors of amino acids used in this study.

Table 3 .
Sites selected by the MCFS as significant for resistance to Abacavir (NRTI).Only the top-scoring property is presented per site.Prevalence of mutations in the data and MCFs score are reported.

Table 4 .
Sites selected by the MCFS as significant for resistance to Delavirdine (NNRTI).Only the top-scoring property is presented per site.Prevalence of mutations in the data and MCFs score are reported.

Table 5 .
Sites selected by the MCFS as significant for resistance to Lamivudine (NRTI).Only the top-scoring property is presented per site.Prevalence of mutations in the data and MCFs score are reported.

Table 6 .
5,6,30selected by the MCFS as significant for resistance to Stavudine (NRTI).Only the top-scoring property is presented per site.Prevalence of mutations in the data and MCFs score are reported.symbolsrepresent the status of a site: *sites known to contribute to resistance to stavudine; + sites where mutations are associated with resistance to some nrTI drugs but not to stavudine; ++ sites where mutations contribute to resistance to nnrTI drugs; +++ sites that are not included in the literature.5,6,30 *

Table 7 .
Sites selected by the MCFS as significant for resistance to Tenofovir (NRTI).Only the top-scoring property is presented per site.Prevalence of mutations in the data and MCFs score are reported.

Table 8 .
Sites selected by the MCFS as significant for resistance to Zidovudine (NRTI).Only the top-scoring property is presented per site.Prevalence of mutations in the data and MCFs score are reported.

Table 9 .
Sites selected by the MCFS as significant for resistance to Didanosine (NRTI).Only the top-scoring property is presented per site.Prevalence of mutations in the data and MCFs score are reported.

Table 10 .
Sites selected by the MCFS as significant for resistance to Nevirapine (NNRTI).Only the top-scoring property is presented per site.Prevalence of mutations in the data and MCFs score are reported.

Table 12 .
results of the 10-fold cross-validation and the external test obtained by using the set of standard and the set of generalized rules.The underlined value indicates the use of a negated classifier.SD stands for standard deviation and RMSE for root mean squared error (WEKA provides RMSE instead of SD).The highest accuracy and AUC values are in bold.

127 publish with Libertas Academica and every scientist working in your field can read your article
http://www.la-press.com of similar problems, such as analysis of influenza neuramidase-mutants resistant to drugs, protein engineering or efficient drug design.