## Evolutionary Forces

This article describes, in some technical detail, the way we model evolutionary forces. You don't need this information to get the program running, but it can be very helpful in understanding your results.

The forces supported at this time are:

For each force, we describe the parameter being estimated, and list some assumptions made in the analysis. If the assumptions are violated, the results will not necessarily be wrong, but should be interpreted cautiously.

### Coalescence

We estimate the parameter Theta for each population. While most papers describe Theta as 4Nemu (where Ne is the effective number of individuals and mu is the mutation rate in mutations per generation), this is specific to diploids. If you put haploid mtDNA into LAMARC, the Theta estimates will be estimates of 2Nfmu instead, where Nf is the effective number of females in the population. It is best to think of Theta as "number of heritable copies in population * 2 * mutation rate," since this definition works no matter what the ploidy is. (The "2" comes from the fact that two sequences that have diverged for time t are different by 2 * mu * t mutations, since both diverging lineages accumulate mutations.)

The units of mu are mutations per site per generation. Please note that many studies compute a (lower-case) theta whose units involve mutations per locus per generation. To convert, multiply the per- site Theta by the number of sites.

If the "multiple rate categories" option of the data model is used, the mutation rate is the weighted mean across all categories.

While it would be helpful to have separate estimates of population size and mutation rate, with genetic data from a single time point there is no way to separate them. If you have outside information about either population size or mutation rate, for example from outgroup analysis, you can then estimate the other parameter directly.

A Frequently Asked Question is how to interpret the parameter Theta when analyzing mitochondrial DNA or Y-chromosome DNA. In these cases Theta is proportional to the mtDNA or Y-chromosome effective population size. For example, when given mtDNA, the program estimates Theta=2Nfmu, and when given Y-chromosome DNA, the program estimates Theta=2Nmmu (where "f" and "m" refer to the female and male effective population sizes).

To combine estimates of Theta from regions with different Thetas, such as mtDNA and nuclear DNA, you can set the relative population sizes in the 'Effective Population Size' menu, or set these values directly in the XML input. Reported multi-region Theta estimates will be scaled accordingly.

#### Assumptions:

• (unless growth rates are being estimated) The population(s) have been the same size for a very long time (though they can be different from one another).

• All markers being considered are neutral, or at least the variation seen is neutral (a gene which undergoes purifying selection against harmful variants will work, but a gene under balancing selection will give distorted results).

• The markers are also not tightly linked to any segments undergoing directional or balancing selection.

• The population is significantly larger than the sample drawn from it. If not, the basic coalescent equations break down. Results will also be questionable if the population size is truly tiny--less than a few dozen. A rule of thumb is that the sample size should be no greater than the square root of the population size, so a population of 100 should have a sample no greater than 10.

• Each population is free of internal subdivisions which would impede gene flow.

• All regions and segments in the analysis either reflect the same underlying Theta (they do not vary in mutation rate or population size), have been correctly annotated with their relative mutation rates and population sizes, or have mutation rates drawn from a gamma distribution (if explicitly estimated).

### Migration

We estimate the immigration rate into each population from each other population (moving forward in time), so that a three-population case will estimate six rates. The immigration parameter M is expressed as m/mu where m is the immigration rate per generation and mu is the neutral mutation rate per site per generation. (As usual, if multiple mutation categories are used, mu is the weighted average.)

If you would prefer to consider migration in terms of 4Nm, multiply the given value by the recipient population's value of Theta. Please be careful of the distinction between immigration and emigration. LAMARC always estimates and reports immigration, so 'M12' indicates the rate at which individuals from population 1 have moved to population 2.

In the current version we cannot meaningfully combine genetic regions with different migration rates in the same analysis. We hope to provide this capability later.

Peter Beerli has argued that it may be useful to include a population in your analysis even if you have not sampled any individuals from it. This can guard against biases produced by unacknowledged populations. For example, if you believe that your real-life situation involves three populations exchanging migrants, but you have been able to sample only two of them, you might add the third population even with no individuals. The parameter estimates involving this population will be weak, but the estimates involving the other two populations may be more accurate than if the unsampled population were omitted entirely.

However, LAMARC is not very stable with unsampled populations, because the likelihood surface for the parameters of the unsampled population may be flat or multi-peaked. For this reason we have not allowed unsampled populations in v2. If you feel this capability would help you, consider using MIGRATE, and also please let us know. We will work on stabilizing estimates for unsampled populations if this is of general interest.

#### Assumptions:

• The current migration structure has been stable for a long time, and the populations have existed for a long time. (If one population is a recent offshoot of another, you will see spurious evidence of migration between them.)

• Since the migration situation is assumed to be stable, there must be some way for individuals to have reached all of the populations. Thus, all populations must be connected at least through one path, and at most one population can have no immigrants. If your data do contain several completely isolated populations, you will estimate small, but non-zero, migration rates into them. This is a result of assuming a common ancestor for all populations.

We do not assume that migration is symmetrical, though you can impose this assumption if you wish using constraints.

### Recombination

We estimate the recombination rate r = C/mu, where C is the recombination rate per inter-site link per generation and mu is the neutral mutation rate per site per generation. As usual, if multiple rate categories are in effect, mu is the weighted average of the categories.

In the current version we cannot meaningfully combine regions or populations with different recombination rates. We hope to provide this capability later.

We do not assume that all recombinations are visible. An advantage of LAMARC and its predecessor RECOMBINE over pairwise methods is that their recombination estimates are not dragged downwards if the data are nearly invariant (though the error bars, of course, increase greatly).

#### Assumptions:

• The recombination rate is constant across the region and has not changed for a long time.

• Recombination frequency is not affected by sequence divergence; highly similar and highly dissimilar sequences are equally likely to recombine.

• All recombination is homologous.

• There is no gene conversion. While double recombinants can produce events that resemble conversions, they have the dynamics of two independent events, not a single event like a true conversion.

• There is no interference among adjacent recombinations (this follows from the coalescent assumption that no two events happen at the same instant).

• Recombination events are selectively neutral.

### Growth

We estimate the exponential growth rate g as defined in the following equation, where t is a time before the present:

Thetat = Thetapresent time exp(-gt)

This means that positive values of g indicate a population which is growing, and negative values a population which is shrinking. They are not symmetrical in magnitude, however; g = 10 indicates rather slow growth, but g = -10 indicates significant shrinkage for most values of Theta.

When migration is also in effect, growth rates are computed for each population independently.

In the presence of growth, the reported value of Theta is the present-day Theta (at the time when the data were sampled). The equation above can be used to determine values of Theta for past times. Just remember that the time parameter t is measured in units of mutations (i.e. 1 t is the average number of generations it takes one site to accumulate one mutation), and g is measured in the inverse units of t. If mu is known, divide generations by mu to get units of t (conversely, t*mu is a number of generations).

#### Assumptions:

• The population has been growing or shrinking at the same exponential rate for a long time. (This is particularly questionable when the population is shrinking; except for microorganisms, most biological populations do not survive long periods of exponential shrinkage.)

• If migration is in effect, immigration rate does not depend on population size.

• The growth force cannot be used in combination with the gamma "force." Please see the comment there for an explanation.

Simulation studies show that estimation of growth tends to be biased upwards; this bias is reduced by having multiple unlinked regions and/or, in cases with recombination, long sequences. Be cautious in interpreting results from one or a few regions. The profile likelihoods are more reliable than the maxima but can also show the bias.

### Gamma-distributed relative mutation rate over regions

This isn't exactly a "force," but because LAMARC optionally allows you to estimate the parameter (α) of the gamma distribution which best fits data sets composed of multiple unlinked genomic regions, it's in the menu for evolutionary forces. More information about this is available here.

Because there need to be multiple genomic regions in order for there to be relative mutation rates to distribute among them, this "force" will not appear in the evolutionary forces menu if the data is annotated as coming from a single genomic region.

Due to a mathematical detail of how this feature is implemented in LAMARC, it can only be applied to maximum-likelihood analyses. If you wish to perform a Bayesian analysis, your best bet is to estimate the relative single-region mutation rates by some other means, then supply these to LAMARC as constants.

#### Assumptions:

• Each population being analyzed is constant in size. This is because a certain integrand containing the gamma distribution arises when the populations are assumed to be growing or shrinking exponentially, and this integrand cannot be integrated analytically. Hence LAMARC does not allow the gamma "force" to be used in combination with growth.

### Divergence

We estimate the time at which pairs of populations split from their common ancestor. Thus, a case with three populations will estimate two divergence times. For example, a human/chimp/gorilla case might estimate the human versus chimp divergence time and the human+chimp versus gorilla divergence time.

LAMARC does not infer the population tree. The information that we should be computing human+chimp and not chimp+gorilla as the first split must be provided by the user. [For a method that attempts to infer population trees, see the "*BEAST" program from Drummond's group.] Divergence requires not only the topology of the population tree, but its order of events. If we have a tree with an A+B split and a C+D split we must specify which split happened most recently.

Divergence times are scaled by the mutation rate. A time of 1.0 for the divergence of two populations means that it occured long enough ago that each character has an expectation of 1 mutation since then.

If migration is additionally turned on, migration is allowed between populations that exist at the same time. This leads to inference of more migration rates than a non-divergence case. For example, in the human/chimp/gorilla example the program could maximally infer bidirectional migration rates between human and chimp, human and gorilla, chimp and gorilla, and between gorilla and the human/chimp ancestor. Users should beware of inferring too many migration rates for the available amount of data. Unwanted rates can be fixed at zero. For example, in this case fixing the rates between human and gorilla to zero would seem reasonable.

An important caveat about models of divergence is that in some cases the data will say nothing about one or more parameters (the jargon term is "not identifiable"). For example, if a population split very recently there will be no power to infer the sizes of the two descendants or the migration rate between them. If they split very long ago there will be no power to infer the size of the ancestor or its migration rate. And if the migration rate between two populations is very high there will be no power to infer their divergence time. It is important to look at confidence intervals and, in Bayesian runs, the shape of the posterior distribution to detect cases in which some parameters are not identifiable.

Quick calculators for starting values (FST, Watterson) are not available in cases with divergence. We don't know how to use them for unsampled ancestral populations, so cravenly don't use them at all.

#### Assumptions:

• The population tree given by the user is correct. Grossly distorted estimates should be expected if the population tree is wrong, and there is no way for the program to detect this problem.

• The populations have been the same size (unless growth rates are being estimated) throughout their existance. No relationship is assumed between pre-split and post-split sizes (the size of the human/chimp ancestor has no fixed relationship with the sizes of the human and chimp populations).

• The migration rates are stable throughout the lifespan of a population.