In Silico Biology 5, 0010 (2004); ©2004, Bioinformation Systems e.V. |
Dagstuhl Seminar "Integrative Bioinformatics" |
^{1} The Wallace H. Coulter Department of Biomedical Engineering at Georgia Institute of Technology and Emory University
313 Ferst Drive, Atlanta, GA 30332, USA
Email: Eberhard.Voit@bme.gatech.edu
^{2} Department of Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina
303K Cannon Place, 135 Cannon Street, Charleston, SC 29425, USA
^{3} Department of Microbiology and Immunology, 6730 Medical Science Bldg. II
The University of Michigan, Ann Arbor, MI 48109-0620
* Corresponding author
Edited by E. Wingender; received September 29, 2004; revised and accepted November 17, 2004; published December 23, 2004
Modern methods of high-throughput molecular biology render it possible to generate time series of metabolite concentrations and the expression of genes and proteins in vivo. These time profiles contain valuable information about the structure and dynamics of the underlying biological system. This information is implicit and its extraction is a challenging but ultimately very rewarding task for the mathematical modeler. Using a well-suited modeling framework, such as Biochemical Systems Theory (BST), it is possible to formulate the extraction of information as an inverse problem that in principle may be solved with a genetic algorithm or nonlinear regression. However, two types of issues associated with this inverse problem make the extraction task difficult. One type pertains to the algorithmic difficulties encountered in nonlinear regressions with moderate and large systems. The other type is of an entirely different nature. It is a consequence of assumptions that are often taken for granted in the design and analysis of mathematical models of biological systems and that need to be revisited in the context of inverse analyses. The article describes the extraction process and some of its challenges and proposes partial solutions.
Keywords: Biochemical Systems Theory, metabolic profile, network identification, pathway analysis, proteomics, S-system, time series
The rapid development of systems biology over the past few years has been driven by the advancement of experimental methods that generate high-throughput data characterizing the expression of genomes, activity patterns of proteins, and the simultaneous profiles of select groups of metabolite concentrations in vivo. These data usually provide snapshots of a biological system, which may be compared with corresponding snapshots of an alternative system, for instance, with the goal of assessing the differences between stressed and unstressed cells or of associating a disease pattern with specific physiological markers not present in healthy subjects. Some of the methods are furthermore suitable for obtaining in vivo measurements in short time sequence, which allows the generation of dynamic profiles. The repertoire of experimental high-throughput tools presently includes nuclear magnetic resonance [e. g., Szyperski, 1998; Neves et al., 2002], mass spectrometry [e. g., Goodenowe, 2003], 2-d gels [Gerner et al., 2002], different types of DNA arrays [e. g., Gupta and Oliver, 2003; Snijders et al., 2003], tissue array analysis [Alizadeh et al., 2001; Simon et al., 2004], transfection arrays [Tominaga, 2004], and protein kinase phosphorylation [Mckenzie and Strauss, 2003]. It is reasonable to expect that many other innovative ideas will provide additional high-throughput tools in the near future.
It is also merely a matter of time before some of these methods will be extended from small numbers of simultaneous measurements and relatively coarse time series to dense time sequences of many concentration or expression values. The resulting time series data will provide unprecedented insights in the functioning of cells in vivo. However, gaining these insights will require that methods be available for appropriately analyzing time series data with dynamical models and thus for extracting information on the structure and regulation of biological systems. This task of information extraction will be highly rewarding and require two components beyond the generation of good data. One is the identification of a suitable modeling framework and the other is the establishment and fine-tuning of theory and algorithms supporting the extraction and interpretation of information from time series.
Several articles have discussed in detail why Biochemical Systems Theory (BST) [Savageau, 1969a; 1969b; 1976; Voit, 2000; Torres and Voit, 2002] is a very promising candidate for pathway identification tasks based on time series data [e. g., Voit and Savageau, 1982; Voit, 2000; Maki et al., 2002; Kikuchi et al., 2003; Voit and Almeida, 2004], and we will not repeat these arguments here. Instead, we will only briefly review features of this theory that are particularly pertinent in the present context and place our primary focus on some of the challenges that biological system identification from time profiles harbors.
BST comes in two variants that offer alternative options for representing biological systems, the Generalized Mass Action and the S-system forms [e. g., Shiraishi and Savageau, 1992]. For the identification of structure from time series data, the S-system variant seems particularly useful, especially if not much additional information about the biological system is available. In this representation, the rate of change in each pool (variable) is represented as the difference between influx into the pool and efflux out of the pool. Each flux term is summarily approximated by a product of power-law functions, so that the generic form of any S-system model is
(1) |
where n is the number of state variables. The exponents g_{ij} and h_{ij} are called kinetic orders and quantify the effect of X_{j} on the production or degradation of X_{i}, respectively. A kinetic order of zero implies that the corresponding variable X_{j} does not have an effect on X_{i}. If the kinetic order is positive, the effect is activating or augmenting, and if it is negative, the effect is inhibiting. The multipliers α_{i} and β_{i} are rate constants that quantify the turnover rate of the production or degradation, respectively. It is noted that the S-system formulation allows the inclusion of m independent variables, so that the products in Eq. (1) extend to n + m. Because these independent variables remain constant during each analysis, they can be absorbed into the rate constants, and it is sufficient for our purposes here to consider the form in Eq. (1).
As an example, consider the two simple pathways in Fig.1, which differ in the feedback inhibition exerted by end product X_{4}. BST associates this inhibition exclusively with the parameter g_{14}. Because this parameter represents an inhibitory effect, its value is negative, and the magnitude of this value indicates the strength of inhibition. Strong inhibition may be modeled, e. g., with g_{14} =
Several articles have appeared in the recent literature describing computational strategies, using S-systems, for this "inverse problem" of extracting information from time series data [e. g., Voit and Savageau, 1982; Maki et al., 2002; Kikuchi et al., 2003; Voit and Almeida, 2004; Veflingstad et al., 2004; Wang, 2004]. These articles primarily emphasized algorithmic issues, which are challenging and increase exponentially in difficulty with the size of the problem. However, algorithmic advances alone will not be sufficient to identify large systems from dynamic data. Recognizing this, Almeida and Voit, 2003, suggested making maximal use of other a priori biological information that might be available in addition to the time series data. For instance, the lack of any direct interaction between two variables allows the immediate elimination of all corresponding kinetic orders before the estimation begins.
While the use of information beyond the time series data themselves seems to be a very beneficial, if not crucial component of dealing with the complexities of the inverse problems at hand, caution and re-evaluation of some common modeling ideas is needed. We describe some of these here. Specifically, we revisit some tenets of model design that are generally taken for granted in biological systems modeling, but may need to be relaxed when in vivo time series data are to be analyzed.
Direct versus indirect effects
The generic strategy for designing systems models calls for setting up one equation for each dependent (state) variable and to include in this equation those and only those variables that directly affect this variable. For instance, if X_{1} is converted into X_{2} and then X_{3}, the equation describing the dynamics of X_{3} contains X_{2} but not X_{1}. This strategy is justified by the argument that the effect of X_{1} on X_{3} is "automatically" mediated through the dynamics of X_{2}. Nevertheless, indirect effects can create problems for inverse tasks of parameter and structure identification. After all, X_{1} does affect X_{3}, even if only indirectly, as is demonstrated in Figure 2, where a bolus increase in X_{1} leads to subsequent, staggered rises and falls in X_{2} and X_{3} (Fig. 2a). The phase-plane plot of X_{3} versus X_{2} shows the direct substrate-product relationship between the two variables (Fig. 2b), but the analogous plot of X_{3} versus X_{1} also shows dependence (Fig. 2c). As a consequence, if the pathway structure is to be determined from time series data, one may incorrectly identify g_{31} > 0 and thus deduce a direct effect of X_{1} on X_{3}.
In theory, the fitting of the system of differential equations should identify g_{31} as not significantly different from zero, especially if non-zero kinetic orders are penalized in the estimation process [Kikuchi et al., 2003; Voit and Almeida, 2003], but practical applications to noisy data show that this is not always the case. For simple linear pathways, Vance et al., 2002, and Torralba et al., 2003, devised methods for inferring direct and indirect causality, but these methods are not likely to be applicable to more complex systems and need to be complemented with additional methods. Kikuchi et al., 2003, and others somewhat alleviated the problem by requiring many data sets that described responses to different systematic perturbations, but one has to ask how often such complete data are available. While the consideration of structural parsimony, implemented as a penalty for non-zero kinetic orders [Kikuchi et al., 2003; Voit and Almeida, 2003], or a model construction scheme with minimal numbers of parameters [Marino and Voit, 2004], alleviates the issues to some degree, it seems that the problem is in general unsolved at this point in time.
Precursor-product relationships
If within an unbranched pathway X_{i} is converted directly into X_{j}, the loss from X_{i} is equal to the production of X_{j}, a feature that is called precursor-product relationship. The relationship mandates that the structure and parameter values of the loss function in the equation of X_{i} equal those of the production function in the equation of X_{j}. For the inverse procedure of deducing parameter values from measured metabolic profiles one is therefore inclined to equate the two functions beforehand and to estimate only one common set of parameter values. Under ideal conditions, this assumption is certainly justified. However, real pathways may appear to be linear in idealized representations but are not always truly unbranched in vivo. Instead, secondary pathways may divert material from the primary pathway or add material that is produced in a converging path. For instance, a typical glycolysis model assumes that glucose is converted into glucose 6-phosphate (G6P) and that this step is the only source of G6P. In reality there are several branches leading to G6P, such as the reversible phosphoglucomutase reaction between G6P and G1P. Even if such branches have relatively small fluxes in comparison to the path of interest, they compromise the precursor-product relationships that were taken for granted based on the idealized model. Keeping the parameters of the degradation function of X_{i} and the production function of X_{j} separate solves the problem in principle: If the parameter values turn out to be the same (within the accuracy of the estimation procedure and the noise in the data), one may conclude that there are no significant branches (or, less likely, that additional incoming and outgoing branches cancel each other). For reasons of practicality and computational efficiency, however, any unnecessary increase in the number of parameters to be estimated leads to undesirable redundancies and should be avoided where possible.
For the same reasons, one has to ask to what degree reversibility of reactions needs to be considered. In principle, very few reactions are truly irreversible, but allowing for the potential of reversibility adds to the system several parameters that need to be estimated [cf. Sorribas and Savageau, 1989].
Intermediate steps and time delays
In some cases one might not be interested in intermediate metabolites of a linear unbranched pathway. For instance, Curto et al., 1998, reduced the first ten reaction steps of purine metabolism [cf. Stanbury et al., 1983: chapter 50] to a single step representing the overall conversion of phosphoribosylpyrophosphate into inosine monophosphate. For steady-state analyses, such a reduction is inconsequential because the influx, efflux and all intermediate fluxes are equivalent and all intermediate metabolites are in balance. Even dynamic analyses are often only mildly affected as long as there are no branches diverging from or converging to the linear pathway. However, the intermediate steps do lead to some time delay and to a distortion of any signals flowing through the pathway, and these effects may be significant enough to cause difficulties in the estimation of parameter values from in vivo time series data. The same is true for other non-enzymatic reactions that affect a metabolic system, such as transport through membranes and transcription and translation.
One should note that the exclusion of intermediate steps, while reducing the number of parameters, is not always desirable. For instance, one might be interested in maximizing the production of certain intermediate metabolites in a linear pathway due to their industrial importance. As an example, diacetyl is a valuable by-product of pyruvate metabolism in Lactococcus lactis, and there is great industrial interest in developing metabolic strategies for enhanced production [Neves, 2001]. The process involves multiple reactions in series, and it is necessary for a useful model to include these steps so that it is possible to maximize the selectivity of diacetyl with respect to other, less desirable metabolites.
Homogeneity of populations
Most biochemical in vivo experiments are executed with populations of cells, and it is commonly assumed that a set of average parameter values is sufficient for modeling their responses. This may or may not be a valid assumption. If cells differ significantly in their numerical characteristics, observed and modeled time courses may differ, thereby complicating the inverse problem of identifying structure from time data. As an example, consider the uptake of a bolus of substrate from the surrounding medium. Because the concentration of substrate is highest at the beginning and subsequently decreases due to uptake by the cells, the time course of substrate concentration should be expected to show the steepest decrease at the beginning. However, one often observes an s-shaped decrease [e. g., Ramos et al., 2004], which is irreconcilably incompatible with the model. The discrepancy must therefore be due to other mechanisms that are not considered in the model. For instance, an s-shaped uptake characteristic is obtained if one supposes that the cells begin taking up substrate with different response times. This is illustrated in Fig. 3 where substrate uptake according to a default linear transport model is compared to a model where the uptake speed is normally distributed among cells.
Dynamic changes in independent variables
One typically assumes that enzymes and other independent variables remain constant during a model analysis. However, if the observed data consist of time series beginning with some perturbation and extending over a time period of half an hour or more, it is possible that the perturbation triggers changes in gene expression that alter some of the enzyme activities relevant to the pathway of interest. If so, the dynamics of the activities is inconsistent with the model assumptions, thereby potentially creating difficulties during the fitting of dynamic metabolite data.
The vignettes above indicate that idealized models may experience difficulties capturing in vivo phenomena exactly. This conclusion is of course nothing new per se, because the definition of a model entails that not all discrepancies between a model and reality can ever be resolved. However, the discrepancies caused by simplification and abstraction appear in a new disguise and, ironically, become more blatant the better the data are. In this light it is useful to develop strategies that at least allow the exploration of these discrepancies. For instance, it would be beneficial to answer questions like 'If we could model substrate uptake so that it captured the observations well, would that eliminate other discrepancies between model responses and observed time series or not?' We show in the following that decoupling of equations and the use of smoothed time traces provides such a tool.
Several authors have shown that it is possible to decouple systems of differential equations for purposes of parameter estimation. At first this may seem to violate the integrity of a model system and its mathematical representation, but the procedure has been proven valid if applied correctly [Voit and Almeida, 2004]. In some cases the decoupling was accomplished by estimating slopes of all (n) time courses at many (N) time points, which reduced the system of n coupled differential equations to nxN algebraic equations [Voit and Savageau, 1982; Voit, 2000: chapter 5; Voit and Almeida, 2004]. In a different implementation, Maki et al., 2002, estimated the parameters of one differential equation at a time, using for the time courses of other variables the raw data or a smoothed analogue.
Another application of decoupling systems of differential equations, also inspired by the need for computational simplification, targeted the numerical calculation of non-steady-state ion transfer caused by diffusion, migration and convection in uni-dimensional electrochemical systems [Volgin et al, 2003]. In this case the authors showed that a set of coupled continuity equations could be solved in the form of independent difference transfer equations for each electrolyte species. Similarly, to simulate the activation dynamics in anistropic cardiac tissues more efficiently, Clements et al., 2004, decoupled a system of elliptic-parabolic partial differential equations, which resulted in a computationally more convenient reaction-diffusion equation.
Whereas decoupling in these examples has been used to ameliorate computational difficulties, the same type of decoupling may also be used for testing hypotheses regarding the validity of implicit or explicit model assumptions. The generic situation is that some observed time course, say for X_{k}, is not represented well with the model and that the discrepancies cannot be resolved through valid parameter adjustments. The idea is then to replace this model equation with the actual data, which in the terminology of differential equations corresponds to considering the data set as forcing function. This replacement of equations may be implemented with data in either raw or some smoothed form and allows for various options. At one extreme, one may replace all equations with the observed data except for the equation for X_{k}. Solving this system shows whether the equation for X_{k} in itself is appropriate and the observed discrepancies are due to other components of the model. At the other extreme, one may be unable to model some metabolite appropriately and replace its system equation with the measured time profile as input. One example for this situation may be a transport or substrate uptake process that is more complex than the model allows, as it was described above. Another example is the modeling of ubiquitous metabolites like ATP, P_{i}, and NADH. These metabolites are involved in dozens if not hundreds of equations in vivo, and it is presently not feasible to model them all. If their time profile has been measured along with other components of the system, the data may be taken as inputs at face value or in a smoothed fashion, thereby allowing explorations of the rest of the system. Thus, by replacing one, several, or all equations except for one with actual data or smoothed analogues, it is possible to explore different types of scenarios that are otherwise very difficult to disentangle.
Ultimately the most important question is of course how well the model is able to predict situations that had not been used for model identification or data fitting, and one would consider a model as more valuable the further it can be extrapolated toward untested conditions, such as changes in substrate concentrations, pH, or oxygen conditions in the surrounding medium. In general, the predictive power of a model can only be assessed by comparison with experimental data. However, as Martin, 1997, demonstrated for a somewhat similar situation in forestry research, if data are available on several related conditions, it is possible to design a model that interpolates validly between these scenarios, thereby facilitating a more comprehensive set of simulation analyses.
Time series data contain enormous information about the system from which they were observed. For the modeler, such data pose a very rewarding, yet difficult challenge. The difficulties of extracting information fall into two categories. First, the computational issues of estimating parameters in large nonlinear systems are daunting, and while several groups have addressed different aspects of the estimation process, they are still far from being solved in general. The second category of challenges is due to the well-recognized fact that models are more or less crude abstractions of reality. In the given context of structure identification, an important consequence is that it is not a priory clear how closely a given model is able to capture observed data even under the best conditions, that is, with optimal parameter values. This question cannot be answered in generic terms, but it was shown here where discrepancies between model and data may emerge and how to explore them. The strategy of decoupling systems of equations provides a tool for such explorations. Originally used to ease some of the computational issues of parameter estimation, decoupling was shown here to provide a modeling tool that allows the investigator to pinpoint where discrepancies might originate. The decoupling and replacement of equations with raw data permits the focused analysis of the effects of a single variable within a system that is taken as observed and no longer modeled. Alternatively, the decoupling and replacement strategy allows partial modeling of a system in the presence of poor data fits that are caused by one or a few system components that are simply not adequately captured by the model. Decoupling and replacement of equations does not resolve the root cause of a poor fit, but it does aid the researcher in pinpointing and diagnosing the topological location of a problematic component within the integrated system.
This work was supported in part by a Quantitative Systems Biotechnology grant (BES-0120288; E.O. Voit, PI) from the National Science Foundation and a National Heart, Lung and Blood Institute Proteomics Initiative (Contract N01-HV-28181; D. Knapp, PI). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsoring institutions.