Grammatical Immune System Evolution for reverse engineering nonlinear dynamic Bayesian models.

An artificial immune system algorithm is introduced in which nonlinear dynamic models are evolved to fit time series of interacting biomolecules. This grammar-based machine learning method learns the structure and parameters of the underlying dynamic model. In silico immunogenetic mechanisms for the generation of model-structure diversity are implemented with the aid of a grammar, which also enforces semantic constraints of the evolved models. The grammar acts as a DNA repair polymerase that can identify recombination and hypermutation signals in the antibody (model) genome. These signals contain information interpretable by the grammar to maintain model context. Grammatical Immune System Evolution (GISE) is applied to a nonlinear system identification problem in which a generalized (nonlinear) dynamic Bayesian model is evolved to fit biologically motivated artificial time-series data. From experimental data, we use GISE to infer an improved kinetic model for the oxidative metabolism of 17beta-estradiol (E(2)), the parent hormone of the estrogen metabolism pathway.


Introduction
The goal of systems biology is to understand the network of interacting genes, proteins, and biochemical reactions that regulate systemic properties of an organism. A realistic biological network, rather than a static graph, should contain nodes that produce time-varying input/output and edges that represent fl ux through the system [1]. For many biological pathways there is a lack of accurate mathematical models capable of capturing causal dependencies and mechanistic information contained in kinetic data. Thus, one of the goals of computational biology is to develop data-driven algorithms to automate the identifi cation of mathematical model structure from time series. Dynamic Bayesian networks (DBNs) have been used to identify linear relationships between variables in gene networks from time series [2,3,4]. However, in addition to time-dependence, a realistic biological network should capture nonlinear relationships. We use a fl exible differential equation formalism that allows us to treat time-dependent nonlinearities in the form of a nonlinear dynamic Bayesian model (NDBM).
Specifi cally, we use a Kalman Filter (KF) to optimize model parameters and determine the accuracy of models that track time series. The KF is a tracking and estimating tool widely used in engineering and has been applied to the inference of NDBMs from biological time series [7]. The KF is a Bayesian method in the sense that it provides a mechanism to incorporate prior information from a previous time point to update the state of the system at the current time point. For linear models, the KF reduces to a DBN; however, the KF becomes the more general NDBM when a nonlinear model is specifi ed. Although the KF provides an effi cient, recursive method for identifying the parameters of a NDBM, the enormity of the search space of possible nonlinear model structures calls for heuristic search methods, such as evolutionary algorithms [5,6,7], to identify the underlying structure. An important challenge in timeseries bioinformatics addressed in this work is the automated, data-driven identifi cation of mathematical model structures.
Previously we implemented a hybrid grammatical evolution (GE) approach to infer nonlinear model structures and parameters from time series [7]. The advantage of GE lies in the simplicity of translating variable-length binary string genomes into programs using a context-free grammar [8]. However, the advantage of the context-free grammar can lead to problems during crossover and mutation because model segments downstream of a mutation or crossover-point that possessed an evolutionary advantage in the parent chromosome may have a completely different meaning in the context of the offspring. Thus, there are theoretical reasons that crossover and mutation can be destructive operators in GE [9]. Novel crossover methods, such as homologous crossover [10] and tree-based crossover [11], which attempt to preserve model building blocks, typically show no improvement over standard GE crossover techniques. In the application domain of human genetics, it has been shown that the performance of GE was not statistically different from a random search [12].
In Sections 1-4 we introduce a new grammarbased, artifi cial immune system algorithm called Grammatical Immune System Evolution (GISE). To overcome the destructive nature of evolutionary operators in other grammar-based evolutionary algorithms, we introduce a hypermutation operator that preserves evolutionarily fi t model features. As we discuss in more detail in Sec. 2, we focus on hypermutation because of its important biological role in secreting high-affi nity antibodies with increased antiviral function. GISE takes advantage of a grammar's ability to restrict the search space based on domain knowledge, while at the same time limiting the damage caused by evolutionary operators. As discussed in Secs. 3 and 4, when generating programs, the grammar inserts nonterminal information into an untranscribed pre-program, and this information is excised after transcription and subsequent expression of the antibody (i.e. the dynamic model). Rather than acting on GE binary strings or directly on GP programs, GISE evolutionary operators act on these intermediate pre-programs. Mutation is initiated by a break in the untranscribed pre-program. The grammar then acts as an error-prone DNA repair mechanism using the nonterminal information previously inserted by the grammar as a repair signal. These nonterminal repair signals enforce semantic constraints imposed by the grammar, thereby preserving evolutionarily fit model segments.
We use GISE to automatically reverse engineer NDBMs to fi t time-series data simulated to include nonlinearity and interactions (Secs. 5.1 and 5.2). We demonstrate that GISE, unlike GE, greatly outperforms a Monte Carlo search for realistic nonlinear simulated models. In Sec. 5.3 we apply GISE to experimental time-series data for the oxidative metabolism of 17β-estradiol (E 2 ). We infer an improved model of the kinetics of this pathway that has been implicated in breast cancer. In this application of the immune system heuristic, the time series represents the antigen, the candidate models represent antibodies, and the goodness of fi t as determined by the KF is analogous to the binding affi nity.

Artifi cial Immune Systems
Despite having an estimated genome of fewer than 25,000 protein-encoding genes, the human adaptive immune system is able to recognize tens of millions of antigens. This seemingly limitless ability to generate diveristy has inspired the development of evolutionary algorithms that mimick features of the adaptive immune sytem. These algorithms are known as artifi cial immune systems [13], and they typically utilize high-level features of the immune system such as clonal selection, negative selection, and immunological memory. With the aid of a grammar, we attempt to simulate the molecularlevel immunogenetic mechanisms that generate diversity in antigen receptors, such as immunoglobulin in B cells and the T cell receptor (TCR) in T cells.
The ability to create such a diverse array of antigen-specifi c antibodies from a fi nite set of genes from the germline is accomplished primarily through a genetic reshuffl ing process known as V(D)J recombination and a diversifi cation process known as immunoglobulin hypermutation (IHM). However, there is experimental evidence to suggest that somatic mutation is the key to creating highaffi nity antibodies. It is well known that virusinduced antibodies in infants exhibit poor functional activity compared to that of adults. For example in rotavirus it was shown that, although infant antibody gene sequences use the same immunodominant gene segments as adults to respond to the virus, there was a marked lack of somatic mutations in the infant antibody sequences [14]. Recently we found that human adult antibodies specifi c to rotavirus bind in a region enhanced by somatic mutations and these mutations account for the enhanced affi nity of the adult antibodies [15,16]. Thus, the grammar-based immune system algorithm described in this work focuses on creating a computational model of the immunogenetic mechanism of hypermutation for generating dynamic model diversity because of the important biological role of IHM in the secretion of high-affi nity antibodies with increased antiviral function.

Grammatical Immunoglobulin Hypermutation
The mechanism of IHM is not completely understood. IHM has been linked to transcription and requires the presence of immunoglobulin enhancers [17]. In addition to random nuculeotide substitutions, certain hotspot motifs have an increased susceptibility to mutation [18]. Mutations are introduced into the antibody genes by an error-prone polymerase during the repair of double stranded breaks (DSBs). The resulting antibody generally preserves the antibody architecture established by V(D)J recombination. It has been shown that hypermutation can occur in any antibody sequence provided there is an active promoter and immunoglobulin enhancer. Promoters and enhancers are types of regulatory regions that help control the transcription of genes. Enhancer DNA sequences bind transcription factors called enhancer-binding proteins that increase the rate of transcription. In the grammar-based hypermutation operator used by GISE, the grammar plays the role of the DSB repair mechanism and nonterminal tags are used as promoters and enhancers.
The steps of Grammatical Hypermutation (GHM) employed by GISE are illustrated in Figure 1, in which an initial program x 2 + xy leads to the fi nal mutated program x 2 + y 2 . A grammar is a set of production rules that can produce sentences in any language. Sentences created by our grammar are systems of coupled nonlinear differential equations. We use a formal notation for describing the syntax of a context-free grammar as a set of production rules that consist of terminals (model elements) and nonterminals (the production rules themselves) [7,8,19]. For simplicity, only (var) nonterminals are used in the Figure 1 illustration. In order to preserve the model architecture, the grammar creates an "untranscribed" model with explicit nonterminal elements (Step a). In the untranscribed program, terminals are enclosed by nonterminal tags (e.g. 〈var〉 ⋅ 〈/var〉). These nonterminal tags are similar to enhancers that will be spliced out after transcription. In Step b, a DSB occurs at a random location (hatched region), which specifi es the nonterminal of the model that will be mutated. Error-prone repair is initiated in Step c. It is determined from the grammar that there are three possible terminals for the (var) enhancer, which are {x, y, z}, and from these terminals y is chosen by chance to replace x. The replacement is carried out in Step d. The grammar tags of the mutated model are spliced out in Step e, and the expressed antibody (mathematical model) is introduced to the antigen (i.e. the program fi tness is evaluated on the data).
The nonterminal tags in the untranscribed program allow the grammar to interpret information from the enhancer and maintain context within the model. This helps GHM prevent the destruction of good model elements downstream of the DSB. This is especially important during model refi nement. For example, let us assume the target model in Figure 1 is, indeed, x 2 + y 2 . Whereas GHM in GISE will not modify the y terminal from the original model downstream of the targeted mutation x (Fig. 1b), there is a fi nite probability that GE would mutate the final terminal in addition to the terminal targeted for mutation. The context-free nature of the grammar causes downstream GE program elements to be sensitive to small changes in the mutated element; small changes that may lead GE down a very different grammar path. In addition, the degree of context-free sensitivity will depend on the complexity of the grammar. Through the enhancer tags (e.g. 〈var〉 ⋅ 〈/var〉), GISE essentially adds context to the context-free grammar by forcing  affi nity for the data die by apoptosis and are removed from the population. Then in line 7, hypermutated models are generated from the highest affi nity models to repopulate the set of models. Thus, in line 8, the population is fi lled with the best models (the highest affi nity models and their hypermutated models), which become the basis for the next cycle of selection and hypermutation beginning at line 3. After N cycles have been completed, the best models are inspected.
The GISE steps in Figure 2 are primarily concerned with learning the structure of the NDBM. Crucial to the success of the algorithm, however, is the estimation of model parameters and the calculation of goodness-of-fit (affinity, Aff(m,D)) of the model m for the data D, which we carry out using the Unscented Kalman Filter (UKF). We use a discrete nonlinear deterministic state space model for the state at observed time point t k + 1 in terms of its predecessor at time t k : (1) and (2) models to respect the constraints of the model prior to the action of the GHM operator.

Generalized Dynamic Bayesian Network
Hypermutation in Figure 1 is the major operator of the full GISE algorithm, illustrated as pseudocode in Figure 2. GISE evolves a population of NDBM models of population size |M| for N cycles of hypermutation and selection. The population of models at cycle i is given by M(i), where each model in the GISE analogy represents an antibody encoded by a plasma B cell. For cycle 0 (line 2) an initial population of models M(0) is generated at random from the user-specified grammar Γ. These models represent the germline set of antibodies that will be somatically hypermutated for cycles 1… N (line 3). The user specifi es the fraction α of models that are hypermutated at each cycle i with the help of the grammar Γ. During a cycle, the binding affinity (Aff(m,D), line 5) is calculated for all models m with respect to the data D. The time-series D acts as the antigen in Aff(m,D), and the ability of the model (antibody) to fi t (bind to) the data (the antigen) is computed using the KF discussed below. In line 6, the models with the lowest where f satisfi es the coupled set of differential equations and y represents the derivative of a molecular quantity with respect to time. The measurement noise is represented by η k and enters into our calculations implicitly through the covariance which becomes part of the covariance matrix P yy calculation below. The operator E[⋅] is the expectation value. The elements of the vector λ of dimension D λ are the parameters of the model f. Since the model parameters are unknown, we treat the elements of λ as state variables to be tracked with process noise ⑀: The process noise in the stochastically varying parameter vector is represented by ⑀, but this noise vector enters into our calculations only indirectly through the covariance matrix Equations (1,2,3,4) can be viewed as a representation of a nonlinear dynamic Bayesian model (NDBM). These equations reduce to a DBN when the vector of nonlinear functions f is replaced by a linear transformation of y with a matrix A (i.e. Ay + ⑀). Next we describe an optimal, recursive method-the UKF-to estimate the NDBM parameters λ and the model fi tness Aff(m,D).
The diagram in Figure 3 illustrates the Bayesian estimate of the states of a given model. It is convenient to create an augmented state x of the system in which the dependent variables y are augmented by the parameter vector λ: integrated out to t + Δt. We use a fourth-order runge-kutta solver to integrate the model Equation 3.
In the a posteriori estimate of the augmented state x t t + Δ (Eq. 5), we use the Kalman gain matrix K to blend the difference between the experimental data z t and the predicted prior estimate y t t + Δ . In order to make the dimensions agree between the augmented and unaugmented vectors (Fig. 3 and Eq. 6), the Kalman matrix includes a constant contribution Q in the covariance matrix P x x y y corresponding to the parameters λ in the augmented state. This constant covariance contribution mimics the uncertainty of the variables being tracked. The larger the assumed value of Q, the more relaxed the search for the optimal parameters. The recursive engine of the Kalman fi lter involves correcting the predicted moments at time point k + 1 with the observed data using equations and K P P k x y y y + − = 1 1 .
For convenience of notation, we use k as the time index, so that if the system is at time t, then the current state is x k and the state at time t + Δt is x k + 1 . The a posteriori estimate of the augmented state at time step k + 1, given by Equation (6), consists of the a priori prediction x k k +1| at the previous time step and a correction term proportional to the difference between the observed data z k and the estimate of the unaugmented state y k k +1| at the previous step. In Equation (7), P xy is the covariance matrix for the deviation of the x and y states from their a priori estimates. The Kalman gain or blend matrix K, updated by Equation (7), is chosen to minimize the trace of the a posteriori error covariance matrix Pˆx x because the trace of this covariance matrix equals the sum of the squared errors of the components of the posterior estimate of x ( ) . i.e. x k k + + 1 1 ⏐ We use the unscented transformation (UT) [20] to effi ciently and accurately estimate the fi rst two moments of the state distribution undergoing transformations during the prediction of the future state of the system and during the correction of the state with the observation model. The UT estimate of the posterior mean and covariance is accurate to third order for any nonlinearity, and the UKF is much faster than Monte Carlo Markov chain particle filters.
Further details on the unscented transformation for the deterministic calculation of the statistics of a random variable undergoing a nonlinear transformation and the application to Kalman fi ltering for state space modeling can be found in Refs. [7,20,21].

Theoretical and Experimental Validation of GISE
Testing novel system identification algorithms on real biological data is challenging because the true causal connections underlying the system often are unknown. Thus, before applying GISE to real experimental data, we apply it to simulated data from two biological pathway targets for method validation. We simulate 50 time points for each varying quantity y, and we assume Q = 0.015 process noise for estimation of parameters λ. When the UKF algorithm is called by GISE, all model parameters are initialized to 0.001. This corresponds to an uncoupled system of differential equations, which is the least biased initialization of the model parameters. An iterative process of parameter optimization is implemented to achieve precise model parameters [7]. The vector of parameter estimates at the final time point are compared with the parameters from the previous iteration, which are used as the input for the next iteration. The UKF process is continued until convergence of the parameters to within a tolerance of 0.0001 is achieved or a maximum loop count of 200 is reached.
To avoid over-fi tting, we use the Akaike Information Criterion (AIC) [22] for distinguishing between models f(y(t), λ, ⑀(t)). Thus, the model fi tness (affi nity) is given by: where ln[L] is the maximum log-likelihood and D λ is the complexity of the model, or the number of parameters λ in f. AIC avoids over-fi tting by balancing the bias of the model predictions with the complexity of the model structure. Once a full population of models has been evaluated, GISE copies the top 20% (the quantity α in Fig. 2) of the most fi t models to the next cycle . For each of these models, four hypermutated copies are created to fill the remainder of the cycle's population by GHM, carried out as described in Figure 1.

Application to operon simulations
The first multiline, nonlinear dynamic target pathway is based on the following operon model: where k = (0.9, 1.0, 0.6), γ = (1.0, 0.6, 0:8), and θ 1 = 0.9 with initial conditions y(0) = (1.0, 0.6, 0.0). First introduced in Ref. [23] and then extended in Ref. [24], the operon model continues to be a useful framework for modeling biological systems [25,26]. Equation (9) represents a model of an hypothetical single-gene regulatory network involving a negative feedback loop with measured gene products y. A single gene with mRNA concentration y 1 produces an enzyme with concentration y 2 . Enzyme y 2 catalyzes a reaction step leading to metabolite y 3 , which inhibits the gene that codes for the enzyme. Parameters k and γ are production and degradation constants, respectively, and θ modulates the inhibitory Hill function.
The following is the type of grammar used by GISE to evolve a gene regulatory network based on an operon model. construction, 〈random-int〉 assigns a random index to the dependent variable from the possible number of dependent variables. The number of parameters is tracked during model construction so that when a 〈param〉 nonterminal is encountered, 〈incremented-int〉 assigns the next index to the parameter array p. Figure 4 shows the cumulative percent of runs (out of 100) that identify the simulated target model of Equation (9) for GISE compared with Monte Carlo search (MCS). Each bar gives the percentage of runs that have identifi ed the correct model at the given cycle. This is an important evaluation of GISE because GE was previously shown to perform no better than a random search for supervised learning in the application domain of human genetics [12]. To make the comparison fair, our MCS constructs models from the same grammar used by GISE. Both algorithms perform the same number of function evaluations (population size = 50), only MCS does not use GHM. Each GISE run takes 4-5 hours to run on a single 3.2 GHz Intel Xeon processor. For all cycles, GISE has a much higher percentage of hits, and by cycle 10, 73% of the GISE runs have hit the target model versus 23% for MCS. This demonstrates the GHM operator in GISE promotes learning by preserving useful model components.
Models are constructed from the grammar beginning with the 〈model-expr〉 nonterminal and subsequent nonterminals are replaced by the grammar rules recursively until the model contains only terminals. Multiple choices for a rule are delimited by a vertical bar. Subtraction operators are not needed because the UKF is able to determine the sign of the model parameters. To manually add parsimony pressure, we add an extra 〈linear〉 to the 〈function〉 choices. When the grammar encounters a 〈variable〉 nonterminal during model We stopped GISE and MCS at a number of cycles suffi cient to demonstrate that GISE performs better than MCS for the given model search space. If a large enough number of random models are generated, even MCS will identify the target model, but this type of brute force approach is not practical for higher-dimensional data where more appreciable learning is needed. In practice, the GISE population size and number of cycles should be tuned to account for the size of the search space, or the number of molecular 〈model-expr〉 ::= 〈function〉 + 〈function〉 〈function〉 ::= 〈linear〉 (0) ⏐ 〈linear〉 (1) ⏐ 〈regulatory〉 (2) 〈linear〉 ::= (〈param〉 * 〈variable〉) 〈regulatory〉 ::= (〈param〉/(〈param〉 + 〈variable〉)) 〈variable〉 : quantities being modeled and for the computational resources available.

Application to S-system simulations
The second, more complex, target model is based on the S-system (synergistic) formalism: where the initial conditions are y(0) = (0.7, 0.12, 0.14, 0.16, 0.18). This S-system gene network of fi ve differential equations has been used to test various system identifi cation algorithms [5,27,28]. In this two-gene network, y 1 is the mRNA produced from gene 1, y 2 is the enzyme encoded by gene 1, and y 3 is an inducer protein catalyzed by y 2 . The quantity y 4 is the mRNA produced from gene 2 and y 5 is a regulator protein encoded by gene 2. Positive feedback from y 3 and negative feedback from y 5 are assumed in the production of mRNAs from the two genes. The S-system gene network model of Equation 10 is a more realistic test of GISE because of the larger number of variables and because we are using a grammar with opeon-model components instead of biasing the grammar with S-system components used to simulate the gene network based on Equation 10. We use the same operon grammar used in Sec. 5.1 with the exception that we add another rule to the (model-expr) nonterminal in order to more reliably fi t this larger system: 〈model-expr〉 ::= 〈function〉 + 〈function〉 (0) From Figures 5a-e, we see that GISE is able to fi t the S-system simulated data using an operon grammar with reasonable accuracy. A GISE run on this larger data set to fi nd the optimum model structure and parameters takes 7-8 hours to run on a single 3.2 GHz Intel Xeon processor. Each subfigure shows a different y profile for the top 3 GISE model predictions. Even though model 2 has a better goodness of fi t, one can see from Figures 5a-e that model 1 fi ts the data better, possibly over fi tting. Model 2 has a higher fi tness because it has one fewer parameter than the other two models and hence is more parsimonious. Model 3, with the worst goodness-of-fi t of the three top models, may actually be the most desirable   Table 1 and Eq. 10). Model 1 is the next best fi t, but it has one false positive connection for dy 1  by simultaneously stimulating cell proliferation and gene expression via the estrogen receptor and by causing DNA damage via their oxidative products, the 2-OH and 4-OH catechol estrogens [29,30]. To better understand estrogen metabolism in the breast, the authors in Ref. [31] employed gas and liquid chromatography with mass spectrometry to measure E 2 , the catechol estrogens 2-hydroxyestradiol (2-OHE 2 ) and 4hydroxyestradiol (4-OHE 2 ) as well as methoxyestrogens and estrogenglutathione conjugates. Using this data, a mathematical model was constructed in Ref. [32] using quasi steady-state assumptions and experimental rate constants. To test the computational method described in the current paper, we employed GISE to learn a nonlinear model structure and model parameters directly from the phase-I estrogen time series. The resulting GISE model (Eq. 11) showed an improved fit to the data (Fig. 7) over the previously constructed model in Ref. [32]. Regarding causal connections, there was no signifi cant change in the connectivity among the top models inferred by GISE, suggesting that these connections likely are true positives.  Table 1 with Eq. 10). In model 1 by contrast, dy 2 /dt has a false positive connection, dy 3 /dt has a false negative connection, and dy 5 /dt has a false positive connection. We defi ne a false positive connection as a model term that predicts a spurious connection between variables, and a false negative connection as a true connection between variables that was not predicted by the model. In Figure 6, we corrupt data simulated by the model given by Equation 10 with 10% Gaussian observation noise and show that the GISE and the UKF can handle the type of noise that often arises in biological data.

Estrogen metabolism pathway
We now apply GISE to experimental estrogen metabolism time series. Estrogens have been implicated in the development of breast cancer Table 1. Analytical form of the top 3 GISE models predicted for the target coupled system simulated by Equation 10. Model predictions are given in Figure 5. GISE models are listed in order of goodness of fi t. The models were constrained by the grammar to use an operon formalism. True positive and false positive connections in these models can be determined by comparison with Equation 10.

Discussion
In this study we introduced Grammatical Immune System Evolution (GISE) and applied it to the evolution of nonlinear dynamic Bayesian models (NDBMs) to automatically reverse engineer models from biologically motivated time-series simulations and from real experimental data from the estrogen metabolism pathway. Grammars allow one to incorporate domain-specifi c knowledge and thereby reduce the search space. Grammatical Evolution (GE) has these advantages, but in a particular real-world application GE's performance showed no statistically signifi cant difference between the performance of a simple Monte Carlo search, which is likely due to the destructive nature of mutation and crossover in GE. Motivated by this, we used the GISE formalism to create a non-destructive somatic mutation operator. This somatic hypermutation operator is based on the diversity-generating mechanism used in the human adaptive immune system. The purpose of this operator was to traverse the model search space efficiently while preserving evolutionarily fit model features.
We showed that GISE can routinely infer the correct NDBM from time series for three interacting variables by cycle 10 with a modest population size of only 50, and that GISE performs signifi cantly better than a Monte Carlo search with the same population size and number of cycles (Sec. 5.1), thus demonstrating that GISE promotes learning by preserving useful model Information Sciences high performance computing resource made possible in part by the NSF award NS-0420614. The authors would like to thank Fritz F. Parl and Philip S. Crooke for supplying data on the estrogen metabolism time series used in this work.

Disclosure
The authors report no confl icts of interest.