Again Consider Equation 699 and the Definitions of

Syst Biol. 2018 Jul; 67(4): 616–632.

Information Criteria for Comparing Partition Schemes

Tae-Kun Seo

¹ Department of Biological Sciences, Korea Polar Research Found, 26 Songdomirae-ro, Yeonsu-gu, Incheon 406-840, Commonwealth of Korea

Jeffrey L Thorne

² Bioinformatics Enquiry Center, Box 7566, North Carolina State University, Raleigh NC 27695-7566, USA

Stephen Smith, Associate Editor

Received 2017 Jun 16; Revised 2017 Dec vii; Accepted 2017 December 17.

Abstract

When inferring phylogenies, one of import conclusion is whether and how nucleotide substitution parameters should exist shared across different subsets or partitions of the data. One sort of partitioning error occurs when heterogeneous subsets are mistakenly lumped together and treated every bit if they share parameter values. The opposite kind of error is mistakenly treating homogeneous subsets every bit if they result from distinct sets of parameters. Lumping and splitting errors are not every bit bad. Lumping errors tin yield parameter estimates that do not accurately reverberate whatsoever of the subsets that were combined whereas splitting errors yield estimates that did not do good from sharing data beyond partitions. Phylogenetic partitioning decisions are ofttimes made by applying information criteria such as the Akaike information benchmark (AIC). As with other data criteria, the AIC evaluates a model or partition scheme by combining the maximum log-likelihood value with a penalty that depends on the number of parameters being estimated. For the purpose of selecting an optimal partitioning scheme, nosotros derive an adjustment to the AIC that we refer to every bit the AIC equation M1 and that is motivated by the idea that splitting errors are less serious than lumping errors. We also introduce a like adjustment to the Bayesian information criterion (BIC) that nosotros refer to equally the BIC equation M2 . Via simulation and empirical data analysis, we dissimilarity AIC and BIC behavior to our suggested adjustments. We talk over these results and also emphasize why we wait the probability of lumping errors with the AIC equation M3 and the BIC equation M4 to be relatively robust to model parameterization.

Keywords: AIC, BIC, data criteria, model comparison, multilocus analysis, partition scheme comparing, phylogenomics

Probabilistic models of DNA and protein sequence alter take primal roles in phylogenetics and molecular development. Many models accept been proposed, but it is non always obvious which are most appropriate. With multilocus data, model choice can become especially hard because the best model for one subset of the data may exist suboptimal for another. The challenge of how to organize data into subsets that should exist analyzed by shared parameter values has go increasingly central as improvements in DNA sequencing engineering accept led to bigger data sets.

A simple and intuitive approach is to concatenate multilocus sequence data and regard the merged consequence as a single locus. Yet, drawbacks of this "concatenation" process exist and "separate analysis" has been suggested as an alternative (eastward.g. Adachi et al. 2000; Cao et al. 2000a,b; Nikaido et al. 2003; Nishihara et al. 2007). In carve up analysis (too referred to every bit partitioned analysis), some model parameters may be shared amid loci but other parameters may be distinct to individual loci. The concatenation approach has been criticized for ignoring heterogeneity among loci because unlike genes may support unlike tree topologies for reasons ranging from incomplete lineage sorting to horizontal gene transfer to introgression (Leigh et al. 2011; Anderson et al. 2012). With chain, locus-specific information is lost. In addition, evolutionary heterogeneity amid loci is not limited to topology. Fifty-fifty when multiple loci support the same topology, natural selection, and/or mutation tin can cause the nucleotide commutation process to vary among loci. For instance, when multiple loci are generated past an identical tree topology but according to unlike branch lengths, ignoring the resulting heterotachy (Lopez et al. 2002) can cause the inferred topology to differ from the one that is shared among the private loci (e.g. Chang 1996; Kolaczkowski and Thornton 2004; Seo 2008). For diverse reasons, it would exist preferable to handle variation amidst loci when variation exists.

Although separate analysis tin accept advantages, it also has disadvantages. Because evolutionary parameters are separately estimated for each locus, the full number of estimated parameters can be much greater for separate analysis than chain. There is a trade-off between model fit and estimation uncertainty (Hastie et al. 2009). Every bit the number of parameters increases, model fit tends to improve but the variances for estimated parameters go bigger.

While they did not focus on partitioned analysis, Lemmon and Moriarty (2004) did a careful simulation study regarding underparameterization and overparameterization in phylogenetics. They found that both take negative consequences, but those from underparameterization were more astringent. Thus, it is crucial to balance the number of parameters and estimation uncertainty. Another drawback of separate analyses is that they crave more computation than chain. Therefore, for a given fix of multilocus data, careful consideration is needed for whether all loci should be handled separately or whether some should be concatenated. I goal is to make up one's mind an optimal "partition scheme," in which data partitions with similar evolutionary properties are merged and in which partitions with dissimilar properties are not merged.

Previous studies (Li et al. 2008; Lanfear et al. 2012, 2014) take noted that partition schemes can be regarded as statistical models and compared via conventional model choice criteria such equally the likelihood ratio examination (LRT), or the Akaike information criterion (AIC) (Akaike 1974), or the Bayesian information criterion (BIC) (Schwarz 1978). These criteria tin can and so be employed as part of an exhaustive search that examines all possible segmentation schemes to find the best one. Alternatively, a heuristic search can be used forth with the pick criteria. This is frequently desirable because the number of possible partition schemes grows apace as the number of loci increases. A Bayesian approach to simultaneously estimate the number of partitions and their model parameters (Wu et al. 2012) is another possible alternative.

Two sorts of errors can be made in phylogenetic partitioning. These errors are lumping partitions together when they should be separately treated (i.e. underparameterization) and splitting loci into distinct subsets when they should exist grouped together (i.east. overparameterization). Lumping errors tin yield bias such as systematic errors in topology interpretation (e.chiliad. Chang 1996; Kolaczkowski and Thornton 2004; Seo 2008) whereas splitting errors can crusade boosted parameter dubiousness. Here, we innovate and justify 2 model selection criteria that we term the AIC equation M5 and BIC equation M6 considering they are modifications designed for partitioning decisions of the widely-used AIC and BIC. The AIC equation M7 and the BIC equation M8 are predicated on the assumption that lumping errors have more than serious consequences than splitting errors.

Theory

We motivate the new information criteria by showing the connection betwixt the AIC equation M9 and a likelihood ratio test of the zilch hypothesis of division homogeneity versus the culling of partition heterogeneity. Nosotros then review the connection betwixt the AIC and the Kullback– Leibler deviation. We follow this with an explanation of how the AIC equation M10 relates to the Kullback– Leibler (KL) divergence between the truth and a detail partition scheme of interest when there is homogeneity between partitions. We consummate this section by introducing the BIC equation M11 .

Introduction and the Likelihood Ratio Exam equally Motivation

Suppose that a G-partition scheme is to be compared with the "concatenated" 1-partition scheme. The parameter vector of model equation M12 for partition equation M13 of the One thousand-partition scheme will be denoted equation M14 . The log-likelihood of the equation M15 th sectionalisation with model equation M16 is defined as,

equation M17

where equation M18 is the equation M19 th aligned sequence column of the equation M20 th partition and equation M21 is the sequence length of the equation M22 th partition. When the entire drove of possible partitions is being considered, we utilize the summation sign equation M23 to represent this 1-partition concatenation scheme. The model for the 1-partition scheme is termed equation M24 and its log-likelihood is

equation M25

Nosotros treat and discuss generalizations later, but we begin by assuming that the concatenated model equation M26 and the models equation M27 for each partition equation M28 share the same parameterization (i.e. topology and substitution model) only may differ in parameter values. For example, all models could share the same topology and all could accept the HKY (Hasegawa et al. 1985) parameterization ( equation M29 ).

The maximum likelihood estimators (MLEs) of the equation M30 th division and of the 1-sectionalization scheme are:

equation M31

where equation M32 implies "the values of the parameters equation M33 of model equation M34 that maximize equation M35 ". The equation M36 estimators stand for MLEs obtained from the equation M37 th partition with model equation M38 . Note that equation M39 and equation M40 are expected to be unequal because the old values are obtained from only the equation M41 th sectionalisation whereas the latter values are obtained from the entire information fix.

A limitation of our notation is that it does not reflect the frequently arising situation where some parameters are forced to accept values that are shared among all partitions and other parameters are allowed to have values that are unique to each partition. While this limitation could be corrected by expanding the notation, we choose to adopt our more simple notation and instead focus on the cases where either all parameter values are shared amid partitions or all accept distinct values for each sectionalization. Still, every bit will be emphasized later on, the concepts that we hash out and the theory that justifies the AIC equation M42 and BIC equation M43 also apply to the example where some parameters have values that are shared among partitions and others do not.

Consider the likelihood ratio exam when the null hypothesis is a single sectionalisation (i.e. concatenation) and when the more general hypothesis is that each of the equation M44 partitions shares the parameterization of the 1-partition model but that each has its own parameter values. Because the null hypothesis is a special instance that is nested within the more than general 1, the likelihood ratio test statistic tin be approximated as having a equation M45 distribution when the nil hypothesis is true. Specifically,

equation M46

(i)

where equation M47 parameters are estimated co-ordinate to the null hypothesis and where equation M48 parameters are separately estimated for each partition in the K-partitioning scheme. When the nada hypothesis of partition homogeneity is correct, Equation (one) implies

equation M49

(ii)

considering equation M50 is the expected value of a equation M51 random variable. Here, the " equation M52 " sign indicates that the expectations are equal. Therefore,

equation M53

(3)

equation M54

(4)

Equation (2) shows that the expected log-likelihood of a concatenated model tin be used to approximate the expected log-likelihood of a K-partition model when partitions are homogeneous (or vice-versa). The equation M55 term on the left side of Equation (3) happens to exist the AIC equation M56 penalization for a concatenated model. Information technology is the same penalty as would be assigned by the AIC. The equation M57 term on the right side of Equation (4) is the AIC equation M58 penalty for a Grand-partition model and the equation M59 term on the right side of Equation (iii) reflects the extra cost of the human activity of partitioning. In dissimilarity, the AIC penalty for a G-division model would be equation M60 and this tin can exist much heavier than the AIC equation M61 penalisation when equation M62 is large. As a result, the AIC is more prone than the AIC equation M63 to favoring concatenated models.

While the LRT approach to model comparison has some attractive features, one limitation is that the LRT arroyo assumes that the more restricted "nested" statistical model is true. The AIC merely aims to select the candidate that is closest to the truth and it relies on an approximation that is expected to be "fantabulous" when the causeless model is "skilful" (Burnham and Anderson 2002). As well, the AIC equation M64 differs from the LRT that inspires Equation (4) because information technology is non predicated on the exchange model and tree topology existence correct (encounter Office one of the Appendix). Yet, Equation (3) illustrates that the LRT and the AIC equation M65 penalty are closely connected when the substitution model and topology do happen to be correct.

Kullback– Leibler Divergence (KLD) and the AIC

Suppose that equation M66 homogeneous samples, ( equation M67 ), are obtained from the true but unknown distribution equation M68 , and we want to fit them with model equation M69 . The KLD between equation M70 and equation M71 is

equation M72

(five)

The model whose KLD is minimized is regarded every bit the best amidst model candidates. Because equation M73 does non vary amongst models, Equation (5) shows that the best model must be the 1 that maximizes equation M74 . Akaike (1974) showed that equation M75 can be approximated using the MLE equation M76 and an adjustment for the number of parameters that are estimated. Written in our notation, Akaike (1974) showed

equation M77

(6)

equation M78

(7)

where equation M79 is the dimension of equation M80 and where the approximation requires equation M81 to exist large. In Equation (seven), the bracketed term is the unbiased computer of equation M82 . The equation M83 in this estimator corrects the bias and is necessary because the data set is used to become equation M84 and is then used again to approximate equation M85 (Konishi and Kitagawa 2004). The AIC is divers as

equation M86

(viii)

and is derived from this unbiased calculator. When there are competing models, the one that maximizes its unbiased estimator of equation M87 (or equivalently that minimizes its AIC) is regarded as the best model. Applying the definition of Equation (8), the AIC scores of concatenated data are written in our notation equally

equation M88

(nine)

where equation M89 is the dimension of equation M90 (see also Lanfear et al. 2012, 2014; Li et al. 2008). For phylogenetic applications, the number of complimentary parameters tin be separated as

equation M91

(ten)

where equation M92 is the number of tree branches and equation M93 is the number of the remaining parameters (e.g. parameters affecting nucleotide frequencies, transition/transversion ratios, charge per unit heterogeneity among sites, etc).

Writing equation M94 to signal that each partition equation M95 of the equation M96 total partitions has its ain model equation M97 , the AIC scores of partitioned information (Lanfear et al. 2012, 2014) are

equation M98

(11)

where equation M99 is the number of freely estimated parameters and equation M100 . In the exhaustive comparison of division schemes, the scores of Equation (xi) are calculated for all candidate schemes and the minimum score determines the all-time i. Widely-used heuristic algorithms for choosing partition schemes (Li et al. 2008; Lanfear et al. 2012, 2014) involve repeatedly comparison a 2– partition scheme with a 1-partition scheme. Different equation M101 values in Equations (9) and (11) could be needed if the ane-partition and K-partition schemes adopted dissimilar tree topologies (due east.g. a fully bifurcating topology in i scheme and a "star topology" in the other).

When analyzing multilocus data, a model that has corresponding branch lengths of different partitions differ only by a proportionality constraint (Yang 1996; Pupko et al. 2002) is ofttimes adopted. We will refer to this as the "linked co-operative length" (LBL) model (encounter likewise Lanfear et al. 2012).

With the LBL model, a fix of branch lengths that will influence all partitions is estimated as is a proportionality cistron for each partition. The sum of proportionality factors is typically restricted to ensure uniqueness of MLEs and the degrees of liberty for the proportionality factors is equation M102 when in that location are equation M103 partitions. A model that has co-operative lengths being independently and separately estimated for all partitions can be used as an alternative to a proportional model. This model will be referred to as the "unlinked branch length" (UBL) model (see also Lanfear et al. 2012).

With multiple partitions, the LBL model can substantially reduce the number of complimentary parameters when equation M104 because the number of free parameters for co-operative lengths is equation M105 with the LBL model whereas it is equation M106 for the UBL model. However, the more limited flexibility of the LBL model ways it is less satisfactory than the UBL model for some data sets (e.grand. Pupko et al. 2002). For the UBL and LBL treatments,

equation M107

(12)

Here, we focus on comparison a K-partition scheme with a 1-division scheme. The aforementioned heuristic algorithms would have equation M108 but larger values of equation M109 are also possible. Using AIC, the 1-division scheme is selected when equation M110 and the K-partition scheme is selected when the inequality is in the other direction.

Kullback– Leibler Divergence and the AIC

Our AIC equation M123 modification of the AIC has penalization terms that are intended to reflect how much improvement in model fit would exist expected if partitions are homogeneous merely sequence information are analyzed past separately estimating parameters from each partitioning. These penalties tin can therefore be interpreted as stemming from the human activity of sectionalization. Our suggestion is that the all-time candidate segmentation scheme should be selected subsequently accounting for the act of partitioning. The AIC equation M124 punishment is not as heavy as the AIC penalty of equation M125 in Equation (eleven). Heavy penalties favor simple models (e.chiliad. concatenation) and tin lead to selection of a i-division model even when partitions are highly heterogeneous. The idea underlying the AIC equation M126 is to take a model choice benchmark that places all candidate selection schemes on an equal footing by using the advisable bias correction term according to division homogeneity. Later on this bias adjustment, we reason that the best sectionalization scheme is the ane with the optimal score.

Consider the situation in which equation M127 samples are separated into equation M128 partitions. The size of each segmentation will exist equation M129 ( equation M130 ) and the MLE of the ane-partition scheme will be denoted equation M131 . The KLD for the equation M132 th partition is

equation M133

(13)

A weighted average of Equation (13) with weights equation M134 yields the KLD when the model of interest has heterogeneous partitions,

equation M135

(14)

Minimizing the AIC of the UBL parameterization in Equation (12) can therefore exist seen to be equivalent to minimizing equation M136 .

Now, consider the case where the UBL model is used simply the equation M137 partitions are really homogeneous. Part 2 of the Appendix outlines one reason why the drawbacks of incorrectly partitioning homogeneous information do non seem especially astringent. From Equation (vii) and the argument outlined in Part 3 of the Appendix, we can show that the proper bias correction for "partitioning homogeneous data" is equation M138 :

equation M139

(fifteen)

This ways that either a i-partition scheme or a K-partitioning scheme tin can exist employed when at that place is homogeneity among partitions, but the proper bias correction depends on the number of partitions.

The bias correction for division homogeneous data tin be contrasted to a bias correction when partitions are heterogeneous. Whereas the bias correction for partitioning homogeneous data is equation M140 , the more than extreme correction for partitioning heterogeneous information is equation M141 . The bias correction for partition homogeneous partitions closely corresponds to the likelihood ratio test (see Equation (3)).

Based on this idea and the statement outlined in Part three of the Appendix, we tin define the AIC equation M142 scores of a K-partitioning scheme and a concatenated scheme. As higher up, we assume that each of the equation M143 partitions has the same model parameterization equally the concatenated sequences (i.east. equation M144 ). With this supposition, we accept the AIC equation M145 of the K-sectionalisation scheme beingness

equation M146

(16)

where equation M147 and represents the difference between the full number of parameters in the partition scheme ( equation M148 ) and the total number in the concatenation scheme ( equation M149 ). When an UBL model is adopted, equation M150 . When an LBL model is adopted, equation M151 . In dissimilarity, the AIC equation M152 for the concatenated model is

equation M153

(17)

In Equation (16), the summed log-likelihoods represent the model fit of the data and the remaining portion ( equation M154 ) works as a penalty. Larger equation M155 values arise from complex partition schemes and therefore complex schemes are accompanied by higher penalties. The equation M156 therefore accounts for the trade-off betwixt division fit and partition complication. The ane-sectionalisation scheme is selected when equation M157 and the One thousand-partition scheme is selected when the inequality is in the other direction.

With the UBL model, the AIC equation M158 score for the Grand-partition scheme is

equation M159

(18)

We see from Equation (3) that the AIC equation M160 scores of the concatenated model (encounter Equation (17)) and the K-partition model (meet Equation (18)) are expected to be about the same when the partitions are homogeneous. If the equation M161 partitions are heterogeneous, the AIC equation M162 score of the G-partition model is therefore expected to be lower. For this reason, nosotros suggest selecting the K-partition model when its score is less than that of the concatenated model.

We note that the AIC equation M163 score of Equation (16) can be converted to the AIC score by replacing the equation M164 term with the heavier penalty of equation M165 . For the UBL parameterization, i way to translate the AIC and AIC equation M166 penalties is that the AIC has a penalty of equation M167 each fourth dimension a parameter is estimated whereas the AIC equation M168 has a punishment of equation M169 the first time a parameter is estimated merely a penalty of just equation M170 for additional partitions from which the parameter is estimated. The fact that the punishment of equation M171 in Equation (16) is lighter than that of equation M172 in Equation (12) implies that the AIC equation M173 can notice heterogeneity better than the AIC. This is considering a moderate comeback of model fit represented past equation M174 can dominate the punishment term of equation M175 more than easily than that of equation M176 .

Results and give-and-take

Simulation Studies

Nosotros performed simulations to compare information criteria. All simulations employed the topology and branch lengths illustrated in Figure 1 to randomly generate data sets consisting of iv partitions. To simulate partition homogeneity, all branches in all partitions had length equation M205 . To simulate heterogeneous partitions, all branches in segmentation equation M206 ( equation M207 ) had branch length equation M208 . The number of simulated sites in division equation M209 was set to equation M210 . This means that the concatenated (one-partition) scheme had equation M211 sites. The values of equation M212 that were explored are equation M213 , equation M214 , equation M215 , equation M216 , equation M217 , and equation M218 . The Seq-Gen software (Rambaut and Grassly 1997) was employed to randomly generate equation M219 information sets via the HKY exchange model (Hasegawa et al. 1985) and a five-category detached-gamma treatment of rate heterogeneity among sites (Yang 1994b). The frequencies of A, C, One thousand, and T were respectively 0.3, 0.2, 0.iii, and 0.2 while the values of parameters for rate heterogeneity amid sites ( equation M220 ) and transition/transversion ratio ( equation M221 ) were both prepare to 5.0.

An external file that holds a picture, illustration, etc. Object name is syx097f1.jpg

All simulated information sets were analyzed both with a 4-partition scheme and a concatenated (1– partition) scheme. For each of the ii schemes, each imitation data ready was analyzed with each of eight nucleotide substitution models and each of these cases was explored both with rate homogeneity among sites and with detached-gamma rate heterogeneity (five charge per unit categories). The 8 exchange models that were used for inference are denoted: JC (Jukes and Cantor 1969), K2 (Kimura 1980), F81 (Felsenstein 1981), F84 (Felsenstein 1989), HKY (Hasegawa et al. 1985), T92 (Tamura 1992), TN93 (Tamura and Nei 1993), and GTR (Yang 1994a). For the analyses of the imitation data, we did not effort to discover the best exchange model and tree topology. Instead, we compared the one-partition scheme and 4-sectionalization schemes when both assumed the true tree topology and when both assumed the aforementioned substitution model. Also, the UBL parameterization was used for investigating the four-partition schemes and so that branch lengths of each partition were independently estimated. Maximum likelihood inference was conducted via Version iv.8 of the baseml program of the PAML software (Yang 2007).

Table ane shows how often the 4-sectionalisation scheme was selected with the AIC and the AIC equation M225 when the truth was that partitions are homogeneous (i.e. equation M226 ). Most of the simulation results for the AIC prove a depression proportion (<0.1%) of incorrectly selecting the 4-partition scheme. In contrast, the equation M227 incorrectly selects the 4-division scheme nigh fifty% of the fourth dimension. This college proportion is expected because the equation M228 benchmark is designed with the idea that "splitting" errors (incorrectly separating partitions) are less serious than "lumping" errors (incorrectly concatenating partitions). The observation of an approximately 50% mistake rate is reasonable in lite of the fact that the equation M229 criterion is designed and then that its expected value is the same for a 4-partition and 1-partition scheme when the truth is division homogeneity.

Table i.

Proportion of simulations where the four-partition scheme was selected rather than the ane-division (chain) scheme by AIC and by AIC equation M230 when partitions were actually homogeneous

Tabular array ii shows simulation results for the AIC and the AIC equation M233 when partitions are heterogeneous. In this state of affairs, the 4-partition scheme should be selected over the i-partition scheme and failure to practise and so is a "lumping error." Both the AIC and AIC equation M234 criteria perform well when partitions consist of relatively big numbers of sites. But, Table 2 reveals a marked contrast when partitions have fewer sites. In the situation of equation M235 sites, the AIC makes a lumping mistake for nearly all simulated data sets whereas the AIC equation M236 is quite unlikely to brand these errors. The results show that the equation M237 has college "sensitivity" than the AIC. That is, the equation M238 detects heterogeneity better than the AIC when partitions are heterogeneous. With regard to making lumping errors, we also notation that the equation M239 appears to be relatively robust to model misspecification in comparison with the AIC.

Table 2.

Proportion of simulations where the 4-sectionalisation scheme was selected rather than the 1-partition (chain) scheme by AIC and past AIC equation M240 when partitions were actually heterogeneous

When the number of sites per partition is in a biologically plausible range, both the BIC and BIC equation M243 are more than decumbent than the AIC to favor models that are less parameterized. The practical issue of the increased penalties on more parameterized models means that both the BIC and the BIC equation M244 are less likely than the AIC to make splitting errors. The results of Table 1 show that the AIC is very unlikely in our simulation settings to make a splitting mistake by choosing the 4-partition scheme over the 1-partition scheme when the truth is sectionalisation homogeneity. Therefore, it is not surprising that we exercise not observe splitting errors with the heavier BIC and BIC equation M245 penalties for our simulation conditions. Because splitting errors are not observed with the BIC and the BIC equation M246 , nosotros do not include tables of them for these criteria.

Tabular array three shows the performance of the BIC and equation M247 when the truth is that the heterogeneous 4-sectionalization scheme should be selected. As expected, the equation M248 detects partition heterogeneity better than BIC. That is, the equation M249 makes fewer lumping errors than the BIC. However, from the comparison of Tables 2 and 3, we observe that the equation M250 is less sensitive than the equation M251 and even less than the equation M252 . This is because the equation M253 has heavier penalties than the equation M254 and the equation M255 . As well, the equation M256 displays a marked sensitivity to the causeless substitution model. For example, the observed proportion of selecting the 4-partition scheme with the JC model is 0.998 when equation M257 and it is only 0.093 with the GTR+K model. In contrast, the equation M258 appears to be more robust than the equation M259 to model misspecification.

Table 3.

Proportion of simulations where the 4-partition scheme was selected rather than the 1-partition (concatenation) scheme by BIC and by BIC equation M260 when partitions were actually heterogeneous

The choice of model selection criterion can also affect branch length inferences. Table 4 considers the sums of branch lengths (i.eastward. the tree lengths) that were inferred from the imitation information with the unlike data criteria. It shows the specific case where the number of sites equation M263 was 100 and where both the true and assumed exchange models were HKY + Yard. Tabular array four shows that lumping errors can have more serious impacts than splitting errors on branch length estimates. Also, information technology indicates the potential value of the AIC equation M264 when downstream inferences that rely upon branch length estimates (e.g. divergence time and evolutionary rate estimates) are important (run across Case 2 of the following department).

Table 4.

Tree length inferences from different data criteria when Partition 1 ( equation M265 ) had 100 sites and when the true and assumed substitution model were both HKY+G. If the information criterion selected the i– sectionalization scheme for a simulated data prepare, then the inferred tree length for the concatenated model was recorded for all 4 partitions. If the 4– partition scheme was selected, then tree lengths separately inferred for each segmentation was recorded. Standard deviations of tree length inferences are shown in parentheses below the sample means that were inferred from the 1000 false data sets.

Empirical Information Analysis

We analyzed two data sets to examine how partition scheme selection affects evolutionary inferences. Because some details of our ML interpretation and partition option procedures differ from those of the previous studies, our results take minor differences from the previous ones. Even so, our emphasis hither is non on these differences simply is instead on how partition scheme choice is affected past the pick of information criterion and how downstream evolutionary analyses are affected by partition option.

Instance ane: Li et al. (2008) analyzed 10 protein-coding genes of 56 ray-finned fish taxa. By separating each protein-coding gene into iii codon positions, they started with a possible 30-partition scheme. They and then performed hierarchical clustering to generate candidate schemes with 29, 28, equation M270 , 2, 1 partitions. Although the number of possible ways to partition xxx items is the Bell Number (Bong 1934) equation M271 and this is far more than 30, we considered only the xxx schemes from hierarchical clustering that Li et al. (2008) reported.

For each of these 30 schemes with the AIC and with the BIC, our analyses allowed different commutation models for different partitions by invoking the "models = all" options of PartitionFinder (Lanfear et al. 2012). This differs from the Li et al. (2008) analysis that assumed the GTR+Grand model for all partitions in all schemes. We as well considered the AIC equation M272 and the BIC equation M273 for the xxx candidate sectionalisation schemes. All analyses with these two information criteria causeless the Neighbor-joining tree topology reported by the MEGA software (Tamura et al. 2013) when the 30 possible partitions of the Li et al. (2008) data were concatenated and then analyzed with the TN93 substitution model.

Our AIC equation M274 and BIC equation M275 analyses assumed that all partitions in each candidate scheme evolved according to a GTR substitution model with four detached-gamma categories of charge per unit heterogeneity among sites. Whereas Li et al. (2008) considered only the LBL parameterization, we considered both LBL and UBL parameterizations for the four data criteria with all candidate division schemes.

Detailed results from applying the 4 information criteria to the Li et al. (2008) candidate partitions are given in Tables 1 through 4 of Supplementary Textile, Online Appendix available on Dryad at http://dx.doi.org/10.5061/damsel.qq586. Most of the results are not surprising. The AIC equation M276 is more than prone to splitting than the AIC. Considering only the UBL parameterization, the AIC equation M277 selects the 30-sectionalisation scheme while the AIC chooses the 21-division scheme. As well, the BIC equation M278 is more prone to splitting than the BIC. When considering only the UBL parameterization, the BIC equation M279 selects the 3-partition scheme while the BIC prefers the 2-partition scheme. These UBL results also confirm the expectation that the BIC and the BIC equation M280 prefer fewer partitions than the AIC and the AIC equation M281 . This is because both the BIC and the BIC equation M282 penalties involve the amount of data too equally the number of estimated parameters.

The number of estimated parameters for the LBL parameterization is substantially less than for the UBL parameterization. This causes the LBL parameterization to tend to favor splitting more than the UBL parameterization. Among the LBL results, the BIC selects the 21-partition scheme whereas the other three information criteria all choose the 30-sectionalisation scheme.

When both UBL and LBL parameterizations are considered for each of the 30 candidate schemes, the BIC selects the LBL with 21 partitions and 288 free parameters while the BIC equation M283 selects the LBL with xxx partitions and 408 free parameters. For the same gear up of possible models, the AIC chooses the UBL with 21 partitions and 2486 free parameters while the AIC equation M284 favors the UBL with 30 partitions and 3540 free parameters. All of these model choices are consistent with the AIC equation M285 favoring splitting more than than the AIC, the BIC equation M286 favoring splitting more than the BIC, and the BIC and the BIC equation M287 both preferring fewer parameters than the AIC and the AIC equation M288 .

Because the 30 candidate schemes were generated past hierarchical clustering, all 30 have a nesting/nested relationship for the UBL parameterization. That is, the equation M289 – partitioning scheme is nested within the equation M290 -partition scheme ( equation M291 ). The same applies for the LBL parameterization. These nesting relationships mean that the traditional (asymptotic) LRT can be practical. For both LBL and UBL parameterizations, the 30-sectionalization scheme is significantly meliorate than all others according to the likelihood ratio exam. In contrast, the 21-segmentation UBL scheme is selected by the AIC over the 30-partition UBL scheme and over all other candidate schemes. This emphasizes that the AIC score can produce results that are contradictory to the LRT. The AIC equation M292 selects the 30-partition UBL scheme as the best among all candidates and it selects the 30-partition LBL as being the best LBL candidate. The consistency between the AIC equation M293 and the LRT is non surprising given their close relationship.

Ripplinger and Sullivan (2008) noted that model pick may touch on phylogeny estimation, specially for regions of an evolutionary tree that have low bootstrap support. Partitioning decisions are a type of model choice and our experience with partition pick coincides with Ripplinger and Sullivan's observation. For example, we considered the thirty-partitioning UBL scheme that was selected as optimal according to the AIC equation M294 and the 21-partition UBL scheme that was preferred past the AIC. While computational considerations motivated our decision to avoid intensively searching topology space when computing information criteria scores for the 30 candidate partitioning schemes, nosotros used the RAXML software (Stamatakis 2014) to more than carefully search among topologies for both the 21-partition UBL scheme and the xxx-partition UBL scheme. While the RAXML copse derived from these two schemes are topologically very similar, Figure 2 shows that the differences tend to occur in regions of the topologies with low bootstrap support.

An external file that holds a picture, illustration, etc. Object name is syx097f2.jpg

Different tree topologies for dissimilar sectionalisation schemes. ML tree topologies of 56 ray-finned fish group were reconstructed with the 30-partition (left) and 21-segmentation (right) schemes. These ii partition schemes are the best with the AIC equation M295 and the AIC, respectively. The bootstrap support levels shown near each node are based upon 500 bootstrap replicates. The copse were inferred with the GTR exchange model and four categories of detached-gamma rate heterogeneity among sites.

Case two: We analyzed four protein-coding mitochondrial genes and seven (six protein-coding and one noncoding) nuclear genes of 49 notothenioid fish grouping taxa. Colombo et al. (2015) focussed on the (possibly adaptive) radiation of the Antarctic clade, but our focus here is on partition pick. Every bit with Case 1, we used the "models = all" option and the tree topology provided by PartitionFinder for the AIC and the BIC calculations. For the AIC equation M296 and the BIC equation M297 calculations, we used the Neighbor-joining tree topology estimated with the TN93 substitution model (Tamura and Nei 1993) and the MEGA software (Tamura et al. 2013). Fixing this topology, nosotros obtained the AIC equation M298 and the BIC equation M299 scores for a GTR model with 4 discrete-gamma categories of rate heterogeneity.

We first ran PartitionFinder (Lanfear et al. 2012) with the BIC and the UBL parameterization. The sectionalisation space was searched with the greedy option. Starting with 31 possible partitions (one for the noncoding gene and 1 per codon position for each of the 10 protein-coding genes), partitions were hierarchically merged until the best partition scheme (a iv-partition scheme) was plant co-ordinate to the BIC. Because the BIC tends to favor concatenation relative to the other data criteria, we decided to focus on the 28 candidate partition schemes that led PartitionFinder from the starting 31-sectionalization scheme to the four-partition scheme. The number of partitions in these 28 candidate schemes therefore range from 4 to 31.

Assuming the UBL parameterization and evaluating these 28 candidate schemes, the 9-, iv-, 20-, and 6-partition schemes are selected as the all-time co-ordinate to the AIC, the BIC, the AIC equation M300 , and the BIC equation M301 respectively. Assuming the LBL parameterization, the 31-, eight-, 31-, and 18-sectionalisation schemes are selected as the best for AIC, BIC, AIC equation M302 , and BIC equation M303 , respectively. When because either the UBL or LBL parameterizations, the AIC selects the 933-parameter nine-partitioning UBL scheme and the AIC equation M304 prefers the 2080-parameter 20-partition UBL scheme while the BIC prefers the 146-parameter eight-partition LBL scheme and the BIC equation M305 prefers the 274-parameter 18-partition LBL scheme. In summary, the analyses again prove that BIC equation M306 is more than apt to split partitions than the BIC and the AIC equation M307 is more apt to dissever than the AIC. The results too again show that the BIC and BIC equation M308 prefer models with fewer free parameters than the AIC and AIC equation M309 .

With this data set, we again performed maximum likelihood topology interpretation co-ordinate to a diversity of sectionalization schemes and model parameterizations that were favored according to i of the 4 information criteria. The results were again consequent with the observation of Ripplinger and Sullivan (2008). While we found minor variations in inferred maximum likelihood topology, the variations were associated with regions of the tree that have low or moderate bootstrap support (data non shown).

Using the twenty-partition UBL scheme preferred by the AIC equation M310 and as well the 9-partition UBL scheme preferred by the AIC, we estimated evolutionary rates and divergence times with the MCMCtree software (Yang, 2007) past using both the sequence data and calibration points of Colombo et al. (2015). We concentrate on the third codon positions of the CO1 and the ND4 genes that are heterogeneous in the xx-segmentation UBL scheme but that are homogeneous (i.e. in the same partition) in the nine-partition UBL scheme. Figure 3 shows departure time estimates and the inferred trajectory of evolutionary rates of CO1 and ND4 3rd positions at nodes forth the path connecting the most recent mutual antecedent of the Antarctic clade to the tip in this clade that represents Chionodraco rastrospinosus. While the divergence time estimates are very like for the 20-partition and 9-partition schemes, the rate trajectories for the third positions of CO1 and ND4 are quite different. The charge per unit trajectory of the merged data shows a kind of average of CO1 and ND4 and is located between the two trajectories of CO1 and ND4, simply information technology loses information for the evolutionary properties of the individual genes. While this is only one example rather than evidence for a general tendency, nosotros expect that the choice of sectionalisation scheme is less likely to impact the estimation of tree topology than departure times and nosotros await that sectionalization scheme choice is more likely to affect the estimation of evolutionary rates of individual genes than information technology is to impact deviation times that are causeless to be shared among the genes.

An external file that holds a picture, illustration, etc. Object name is syx097f3.jpg

Different evolutionary rates for different sectionalisation schemes. Evolutionary rate trajectories from the nearly contempo common ancestor of the Antarctic clade to Chionodraco rastrospinosus. Evolutionary rates of 3rd sites of CO1 (circle), ND4 (rectangle) and merged CO1 and ND4 (cantankerous) are plotted on the y-axis with estimated difference times (millions of years agone) on the x-axis. The third sites of CO1 and ND4 are heterogeneous in the xx-partition scheme that is the best co-ordinate to the AIC equation M311 . Withal, they are homogeneous in the 9-sectionalisation scheme that is the best co-ordinate to the AIC.

Concluding Remarks

The asymmetric consequences of splitting and lumping errors are our motivation for suggesting the equation M312 and the equation M313 to compare partition schemes. We view the possible bias resulting from lumping errors as more serious than the increased variance generated past splitting errors. Other partitioning techniques to account for the asymmetric consequences are also possible. For example, sectionalization guided by Bayesian decision theory (east.one thousand. see Berger 1985) could exist attractive, but it would be difficult to catechumen the qualitative asymmetry of consequences from lumping and splitting errors into a quantitative loss function that would adequately summarize the relative severities of these two types of errors.

Of the four information criteria that we consider, the equation M314 is the to the lowest degree probable to brand lumping errors and nosotros conclude that the equation M315 is ordinarily a better selection than the other information criteria. Nonetheless, one of the notable features of the equation M316 is that it is consistent. Consistency in model pick implies that the probability to select the truthful model approaches 1 as sample size increases (Dziak et al. 2012). As well, consistency in partition selection guarantees selection of the true partition scheme when sample size (i.e. the number of sequence sites) is big. When equation M317 goes to infinity, the equation M318 penalty divided past equation M319 goes to zero whereas the penalty itself goes to infinity. This is frequently the situation when data criteria are consequent with respect to model option (Bozdogan 1987; Dziak et al. 2012). The equation M320 is also consistent only, considering the equation M321 is less decumbent to lumping errors than the BIC, the equation M322 might exist the all-time alternative among the four information criteria when statistical consistency is specially valued.

A conventional approach is to start brand a quick and crude approximation of the phylogenetic tree topology and to so search for the optimal sectionalization scheme by fixing the topology at this approximation (Lanfear et al. 2012). After settling upon and fixing the partition scheme, a thorough search of topologies can be carried out. This conventional approach is attractive because a joint search of all combinations of segmentation scheme and topology tin can be computationally prohibitive.

This conventional approach simplifies computation and seems to united states to also be sensible when employing the AIC equation M323 and the BIC equation M324 , especially for doing phylogenetics with genome-scale information. If two partitions are heterogeneous according to 1 combination of tree topology and nucleotide commutation model, they are probable to be heterogeneous according to another combination. To be sure, the choice of a combination could bear upon the power to observe heterogeneity just this effect is frequently small. This is because even if the causeless tree topology and commutation model are incorrect for some or all partitions, the resulting bias would besides be homogeneous (or heterogeneous) if evolutionary backdrop are really homogeneous (or heterogeneous) amongst partitions. Yet, we take not fully characterized this conventional approach here and doing so might be a good direction for future inquiry.

The application of information criteria to partitioning sequence information has mainly received attention with regard to touch on on phylogeny inference, but diverse other kinds of evolutionary inferences (e.g. divergence fourth dimension estimation, exam of how rates of molecular evolution take changed over time, and detection of diversifying positive selection) are likewise potentially influenced by partitioning. The potentially big effects of partitioning on inferred trajectories of evolutionary rates are illustrated by our findings with CO1 and ND4 3rd positions from the Colombo et al. (2015) notothenioid data.

The power to notice and quantify shifts of evolutionary rates over time is especially pertinent to the study of adaptation. By studying a phylogeny of diverse terrestrial and marine mammals and then identifying genes having evolutionary rates on the tree that correlate with marine/terrestrial status, Chikina et al. (2016) identified a biologically plausible grouping of candidate genes that might be associated with adaptation to marine environments. This promising strategy for illuminating the genetic underpinnings of evolutionary adaptation can potentially be applied to diverse other sorts of adaptation, including adaptation to the farthermost Antarctic environment that may accept been associated with the notothenioid fish radiations studied by Colombo et al. (2015). However, ability to utilize rate change to identify genes associated with adaptation to extreme environments or other sorts of adaptation will be influenced past sectionalization decisions. As genomic data of not-model organisms becomes increasingly available, the ability to characterize trajectories of evolutionary rates should amend and and then should the ability to place interesting changes in evolutionary rates. Success of these studies is likely to exist influenced by the availability of sound methods for partitioning.

Funding

This work was supported by the Korea Polar Inquiry Plant (PE17090) and by NIH grant GM118508.

Acknowledgements

We thank Marker Holder, Stephen Smith, Xiang Ji, and two anonymous reviewers.

Appendix

1. Basic assumptions

Let us define operators equation M325 , equation M326 , and equation M327 as follows

equation M328

equation M329 , equation M330 and equation M331 are defined in a similar manner.

The adopted model may not exist correct. When the MLE with an wrong model converges to a certain value that we will denote equation M332 , its asymptotic distribution is normal under suitable regularity conditions (White 1982). That is,

equation M333

(A.one)

equation M334

(A.2)

and the asymptotic distribution of equation M335 is expressed in a similar mode. When the assumed model is correct, equation M336 in Equation (A.1) and the covariance matrix of Equation (A.1) is reduced to the inverse of equation M337 .

In our derivations, we assume that the adopted model may not be right but it is close to the truth so that equation M338 and the covariance matrix of Equation (A.1) is approximately the changed of equation M339 . Besides, we assume partition homogeneity when deriving AIC equation M340 'southward penalization.

two. Partitioning homogeneous sequence data is non harmful when the number of sequence sites per partitioning is large

We define

equation M341

If we assume partition homogeneity, so equation M342 and

equation M343

where equation M344 and equation M345 is identity matrix. Nosotros consider the following natural estimators

equation M346

and annotation that equation M347 is a consistent computer of equation M348 .

Now, consider the post-obit outset derivative and its asymptotic expansion.

An external file that holds a picture, illustration, etc. Object name is syx097um1.jpg

where equation M349 implies that equation M350 is bounded in probability (Bishop et al. 2007). This leads to

equation M351

(A.iii)

where we use the fact equation M352 for large equation M353 'south in the approximation of the last line. Nosotros denote equation M354 as equation M355 . Then, Equation (A.3) implies equation M356 for large equation M357 . On the other mitt,

equation M358

Therefore, equation M359 fifty-fifty for large equation M360 . However,

equation M361

Therefore, equation M362 for large equation M363 .

Now consider the variances of equation M364 and equation M365 . From Equation (A.3),

equation M366

Taking the expectation of both sides, we get

equation M367

which implies that the variances of equation M368 and equation M369 are similar to each other when there are large amounts of information. The similarity of these variances suggests that partitioning homogeneous data is non harmful when the number of sequence sites per partition is large.

3. Proof of Equation (15)

For the simplicity and convenience of mathematical note, we omit the ' equation M370 ' subscript below so that equation M371 and equation M372 respectively stand for equation M373 and equation M374 .

To further simplify Equation (A.ii), we define

equation M375

If nosotros rewrite Equation (A.2), the MLE from the equation M376 th segmentation asymptotically follows a normal distribution,

equation M377

At present, let us define the vector of all partition's MLEs as follows,

equation M378

where equation M379 and equation M380 are equation M381 dimensional cavalcade vectors. The vector equation M382 asymptotically follows a multivariate normal distribution,

equation M383

(A.4)

where equation M384 is a diagonal block matrix due to the independence of partitions,

equation M385

Now, we investigate the relationship between equation M386 and equation M387 for the equation M388 th segmentation. We define the ( equation M389 )-dimensional matrix equation M390 as

equation M391

where equation M392 is the equation M393 -dimensional identity matrix. And then,

equation M394

where the approximation is from Equation (A.3). Applying Equation (A.iv), we notice

equation M395

The equation M396 can be obtained with covariance matrices of private partitions,

equation M397

where the first approximation results from partition homogeneity and the second approximation results from the assumption of equation M398 .

The quadratic summation of the elements of equation M399 follows a equation M400 distribution,

equation M401

(A.5)

And then, the log-likelihood role at the equation M402 th partition has the post-obit relationship,

An external file that holds a picture, illustration, etc. Object name is syx097m1.jpg

(A.half dozen)

where the last approximation holds in the sense of expectation. Therefore,

equation M403

which proves Equation (15). While Equation (15) is a direct event of the asymptotic beliefs of likelihood ratio tests when the null hypothesis is true, we note that Equations (A.5) and (A.6) are the critical steps in this proof and they are valid so long equally the adopted model does not severely deviate from the truth.

4. Proof of Equation (20): derivation of BIC

To brainstorm, nosotros overview an approximation of the posterior probability density to testify the origin of equation M405 in Equation (xix) (due east.1000. see Robert 2007). As in Role 3 of the Appendix, we volition omit the ' equation M406 ' subscript below when doing then does not bear on clarity.

Ascertain

equation M407

Then, the probability of information equation M408 for a given prior equation M409 is

equation M410

(A.7)

Nosotros use the following results,

equation M411

(A.8)

which can be derived from the probability density function of a multivariate normal distribution.

Thus,

equation M412

(A.9)

Multiplying the right side of Equation (A.nine) by equation M413 , we obtain Equation (19),

equation M414

In the conventional definition of BIC, equation M415 is ignored. Simply, in our study, we accept it into consideration for more accurate comparing (see Theory).

By using Equations (A.five) and (A.6), we can derive the post-obit relationships.

equation M416

(A.10)

where we used an approximation similar to Equation (A.8),

equation M417

Taking the logarithm of both sides of Equation (A.ten) followed by ignoring minor terms results in

equation M418

Therefore, by recovering model index equation M419 , we obtain the following approximation

equation M420

(A.11)

From the definition of Equation (19) and the approximation of Equation (A.11), we tin can consider the post-obit approximation and definition,

equation M421

where ' equation M422 ' implies that the right side of the equation is defined as the left side of the equation.

References

Adachi J.,, Waddell P.J.,, Martin Westward.,, Hasegawa Thou. 2000. Plastid genome phylogeny and a model of amino acrid substitution for proteins encoded by chloroplast. J. Mol. Evol. 50, 348–358. [PubMed] [Google Scholar]
Akaike H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Contr. xix, 716–723. [Google Scholar]
Anderson C.North.,, Liu L.,, Pearl D.,, Edwards S.5. 2012. Tangled trees: the claiming of inferring species trees from coalescent and noncoalescent genes. Methods Mol Biol. 856, 3–28. [PubMed] [Google Scholar]
Bell E.T. 1934. Exponential numbers. Amer. Math. Monthly 41, 411–419. [Google Scholar]
Berger J.O. 1985. Statistical determination theory and Bayesian assay. New York: Springer-Verlag. [Google Scholar]
Bishop Y.M.,, Fienberg S.E.,, Holland P.W. 2007. Discrete multivariate analysis. New York: Springer-Verlag; p. 475–484. [Google Scholar]
Bozdogan H. 1987. Model selection and Akaike'due south Data Benchmark (AIC): the general theory and its analytical extensions. Psychometrika 52, 345–370. [Google Scholar]
Burnham Grand.P.,, Anderson D.R. 2002. Model option and multimodel inference. 2d ed New York: Springer-Verlag; p. 64–66, 284–285. [Google Scholar]
Cao Y.,, Sorenson Thousand.D.,, Kumazawa Y.,, Mindell D.P.,, Hasegawa One thousand. 2000a. Phylogenetic position of turtles among amniotes: evidence from mitochondrial and nuclear genes. Factor 259, 139–148. [PubMed] [Google Scholar]
Cao Y.,, Fujiwara M.,, Nikaido N.,, Okada North.,, Hasegawa Thousand. 2000b. Interordinal relationships and timescale of eutherian evolution as inferred from mitochondrial genome information. Cistron 259: 149–158. [PubMed] [Google Scholar]
Chang J.T. 1996. Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Math. Biosci. 134:189–215 [PubMed] [Google Scholar]
Chikina M.,, Robinson J.D.,, Clark N.Fifty. 2016. Hundreds of genes experienced convergent shifts in selective pressure in marine mammals. Mol. Biol. Evol. 33 (ix):2182–2192. [PMC free article] [PubMed] [Google Scholar]
Colombo M.,, Damerau G.,, Hanel R.,, Salzburger West.,, Matschiner G. 2015. Diversity and disparity through time in the adaptive radiation of Antarctic notothenioid fishes. J. Evol. Biol. 28 (ii):376–394. [PMC complimentary article] [PubMed] [Google Scholar]
Draper D. 1995. Cess and propagation of model uncertainty. J. R. Statist. Soc. B 57, 45–97. [Google Scholar]
Dziak J.J.,, Coffman D.L.,, Lanza Due south.T.,, Li R. 2012. Sensitivity and specificity of information criteria. Technical Report Series #12-119. The Pennsylvania Land University. Land College, PA. [Google Scholar]
Felsenstein J. 1981. Evolutionary trees from Deoxyribonucleic acid sequences: a maximum likelihood arroyo. J. Mol. Evol. 17, 368–376. [PubMed] [Google Scholar]
Felsenstein J. 1989. PHYLIP—phylogeny inference package (version 3.2). Cladistics 5:164–166 [Google Scholar]
Hasegawa M.,, Kishino H.,, Yano T. 1985. Dating the human-ape splitting past a molecular clock of mitochondrial Deoxyribonucleic acid. J. Mol. Evol. 22, 160–174. [PubMed] [Google Scholar]
Hastie T.,, Tibshirani R.,, Friedman J. 2009. The elements of statistical learning. Affiliate vii New York: Springer-Verlag. [Google Scholar]
Jukes T.H.,, Cantor C.R. 1969. Evolution of protein molecules. In: Munro H.North., editors. Mammalian protein metabolism. New York: Academic Printing, p. 21–132. [Google Scholar]
Kimura K. 1980. A unproblematic method for estimating evolutionary rate of base of operations substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120. [PubMed] [Google Scholar]
Kolaczkowski B.,, Thornton J.W. 2004. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431, 980–984. [PubMed] [Google Scholar]
Konishi S.,, Kitagawa G. 2004. Information criteria (in Japanese). Tokyo: Asakura Publishing Co, p. 47–48. [Google Scholar]
Lanfear R.,, Calcott B.,, Ho Southward.Y.W.,, Guindon S. 2012. PartitionFinder: combined pick of partitioning schemes and commutation models for phylogenetic analyses. Mol. Biol. Evol. 29, 1695–1701. [PubMed] [Google Scholar]
Lanfear R.,, Calcott B.,, Kainer D.,, Mayer C.,, Stamatakis A. 2014. Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evol. Biol. fourteen, 82–95. [PMC free article] [PubMed] [Google Scholar]
Leigh J.W.,, Lapointe F.J.,, Lopez P.,, Bapteste E. 2011. Evaluating phylogenetic congruence in the mail-genomic era. Genome Biol Evol. three, 571–587. [PMC free article] [PubMed] [Google Scholar]
Lemmon A.R.,, Moriarty East.C. 2004. The importance of proper model assumption in Bayesian phylogenetics. Syst. Biol. 53: 265–277. [PubMed] [Google Scholar]
Li C.,, Lu Yard.,, Orti G. 2008. Optimal information segmentation and a test for Ray–Finned fishes (Actinopterygii) based on ten nuclear loci. Syst. Biol. 57, 519–539. [PubMed] [Google Scholar]
Lopez P.,, Casane D.,, Philippe H. 2002. Heterotachy, an of import process of protein evolution. Mol. Biol. Evol. xix, 1–7. [PubMed] [Google Scholar]
Nikaido, M.,, Cao, Y., Harada Grand.,, Okada N.,, Hasegawa M. 2003. Mitochondrial phylogeny of hedgehogs and monophyly of Eulipotyphla. Mol. Phylogenet. Evol. 28:276–284 [PubMed] [Google Scholar]
Nishihara H.,, Okada N.,, Hasegawa G. 2007. Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol. 8:R199.ane–R199.x [PMC costless article] [PubMed] [Google Scholar]
Pupko T.,, Huchon D.,, Cao Y.,, Okada N.,, Hasegawa 1000. 2002. Combining multiple information sets in a likelihood analysis: which models are the best? Mol. Biol. Evol. xix, 2294–2307. [PubMed] [Google Scholar]
Rambaut A.,, Grassly N.C. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution forth phylogenetic trees. Comput. Appl. Biosci. 13: 235–238. [PubMed] [Google Scholar]
Ripplinger J.,, Sullivan J. 2008. Does choice in model pick touch maximum likelihood analysis? Syst. Biol. 57 (1):76–85. [PubMed] [Google Scholar]
Robert C.P. 2007. The Bayesian choice. 2/due east New York: Springer-Verlag; p. 352. [Google Scholar]
Schwarz Grand. 1978. Estimating the dimension of a model. Ann. Stat. 6, 461–464. [Google Scholar]
Seo T.-1000. 2008. Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol. Biol. Evol. 25, 960–971. [PubMed] [Google Scholar]
Stamatakis A. 2014. RAxML Version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30 (ix):1312–1313. [PMC free article] [PubMed] [Google Scholar]
Tamura K. 1992. Estimation of the number of nucleotide substitutions when there are potent transition-transversion and G+C content biases. Mol. Biol. Evol. ix, 678–687. [PubMed] [Google Scholar]
Tamura K.,, Nei M. 1993. Interpretation of the number of nucleotide substitutions in the control region of mitochondrial Deoxyribonucleic acid in humans and chimpanzees. Mol. Biol. Evol. ten, 512–526. [PubMed] [Google Scholar]
Tamura K.,, Stecher Thou.,, Peterson D.,, Filipski A.,, Kumar Due south. 2013. MEGA6: molecular evolutionary genetics analysis version half-dozen.0. Mol. Biol. Evol. thirty (12):2725–2729. [PMC free article] [PubMed] [Google Scholar]
Wu C.H.,, Suchard M.A.,, Drummond A.J. 2012. Bayesian selection of nucleotide substitution models and their site assignments. Mol. Biol. Evol. thirty (3):669–699. [PMC costless article] [PubMed] [Google Scholar]
White H., 1982. Maximum likelihood estimation of misspecified models. Econometrica l, 1–25. [Google Scholar]
Yang Z. 1994a. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39, 105–111. [PubMed] [Google Scholar]
Yang Z. 1994b. Maximum likelihood phylogenetic estimation from Dna sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39, 306–314. [PubMed] [Google Scholar]
Yang Z. 1996. Maximum-likelihood models for combined analyses of Multiple sequence data. J. Mol. Evol. 42, 587–596. [PubMed] [Google Scholar]
Yang Z. 2007. PAML 4: Phylogenetic assay by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591. [PubMed] [Google Scholar]

Articles from Systematic Biology are provided hither courtesy of Oxford University Press

collinssest1938.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6005138/