As of LAMARC version 2.1, we now have the capacity to perform fine-scale mapping of traits and diseases that have been localized to a single stretch of DNA (in LAMARC terms, to one genomic region). The way it works is that as LAMARC searches among trees, it calculates the liklihood of the observed trait data being created at each site in the sequence. Assuming that each site has an equal chance of being the true site, it then converts these liklihoods into probabilities, an averages the resulting values over all the trees LAMARC collects.
We have one submitted paper and another in the works (as of early 2007) that take this approach. Until those appear, more detailed theory can be provided upon request.
To use LAMARC as a mapping tool, you'll have to collect the data, convert it into a LAMARC input file, set up the parameters for your search, and run the analysis. We'll go through each step in succession. During our discussion, we'll make use of example data file and converter command for a simple trait mapping case. Included are sample trait data (in migrate format), and a sample converter command file. (actual xml is here.)
One of the assumptions of LAMARC is that the samples are collected at random from the population. This affects the types of trees one would expect to see that relate your samples to one another. However, researchers that study a particular disease often have one set of samples from affected individuals, and another set of samples from unaffected individuals. This introduces an ascertainment bias that makes the true tree less likely to be produced than it should. We have not yet investigated how strongly this effects the accuracy of mapping results, but it is important to keep in mind.
The ideal trait for a LAMARC analysis would be one that is ubiquitously diverse in the studied population (like eye color in humans). A random sample from the population as a whole could then be classified into different phenotypic groups. More information on the ascertainment question will be provided as our own experiments proceed.
Once you have collected data (of any type--marker SNPs, full sequence information, or microsatellites), you will need to get the following information into lamarc:
At this writing, most of this information can only be entered into the lamarc file by writing a converter command file. You should refer to that document for the complete syntax of that file. Below is a brief overview of the mapping-specific components of that file.
<lamarc-converter-cmd> ... <traits> <trait-info> <name>funny-nose</name> <allele>normal</allele> <allele>affected</allele> </trait-info> <phenotype> ... </phenotype> </traits> ... </lamarc-converter-cmd>
This would define the 'funny-nose' trait, with its two alleles, normal and affected.
Phenotypic groups are collections of individuals who all display the same phenotype. This requires that one know or have a reasonable model for the mechanism and penetrance of the trait in question. In a simple dominant/recessive case, individuals displaying the dominant phenotype could be any of AA, Aa, or aA, while individuals displaying the recessive phenotype all must be aa.
If your data set includes many individuals with the same phenotype, the easiest way to specify them is to define <pheontype> tags in the converter command file. Below are the definitions for our simple example.
<lamarc-converter-cmd> ... <traits> ... <trait-info> ... </trait-info> <phenotype> <name>straight</name> <genotype-resolutions> <trait>funny-nose</trait> <haplotypes> <penetrance> 1.0 </penetrance> <alleles> normal normal </alleles> </haplotypes> </genotype-resolutions> </phenotype> <phenotype> <name>bent</name> <genotype-resolutions> <trait>funny-nose</trait> <haplotypes> <penetrance> 1.0 </penetrance> <alleles> affected normal </alleles> </haplotypes> </genotype-resolutions> <haplotypes> <penetrance> 1.0 </penetrance> <alleles> normal affected </alleles> </haplotypes> </phenotype> <phenotype> <name>broken</name> <genotype-resolutions> <trait>funny-nose</trait> <haplotypes> <penetrance> 1.0 </penetrance> <alleles> affected affected </alleles> </haplotypes> </genotype-resolutions> </phenotype> </traits> ... </lamarc-converter-cmd>
Note that the 'penetrance' in all instances is the chance that an individual with those alleles displaying the observed phenotype. If heterozygotes were 80% likely to display the same phenotype as all normal/normal individuals, and 20% likely to display the same phenotype as all broken/broken individuals, there would be only 2 phenotypes to define and they would be as follows:
<phenotype> <name>straight</name> <genotype-resolutions> <trait>funny-nose</trait> <haplotypes> <penetrance> 1.0 </penetrance> <alleles> normal normal </alleles> </haplotypes> <haplotypes> <penetrance> 0.8 </penetrance> <alleles> normal affected </alleles> </haplotypes> <haplotypes> <penetrance> 0.8 </penetrance> <alleles> affected normal </alleles> </haplotypes> </genotype-resolutions> </phenotype> <phenotype> <name>broken</name> <genotype-resolutions> <trait>funny-nose</trait> <haplotypes> <penetrance> 1.0 </penetrance> <alleles> affected affected </alleles> </haplotypes> <haplotypes> <penetrance> 0.2 </penetrance> <alleles> normal affected </alleles> </haplotypes> <haplotypes> <penetrance> 0.2 </penetrance> <alleles> affected normal </alleles> </haplotypes> </genotype-resolutions> </phenotype>
If additional information is available about an individual that provides a clue to their genotype, that information can also be included as a part of the phenotype. For example, an individual with a dominant phenotype who has a recessive parent or child must be a heterozygote, and should be classified separately from other individuals displaying a dominant phenotype.
<lamarc-converter-cmd> ... <individuals> <individual> <name>ind_a</name> <sample><name>s0</name></sample> <sample><name>s1</name></sample> <has-phenotype>broken</has-phenotype> </individual> ... </individuals> ... </lamarc-converter-cmd>
This indicates that the samples "s0" and "s1" belong to the same individual and that that individual displays the "broken" phenotype. (If your individuals are named in your data input file but do not have sample names, as in migrate-format microsattelite data, you do not need to provide extra sample names here.)
It is also possible to explicitly specify genotype resolutions for an individual instead of using the <has-phenotype> tag. See the section Specifying Relationships Between Individuals and Data Samples in the Converter Command File documentation.
Finally, the trait mapper needs to know in which genomic region to search for your trait. This is an additional tag inside the "region" tag.
<lamarc-converter-cmd> ... <regions> <region> <name>region1</name> ... <trait-location> <trait-name>funny-nose</trait-name> </trait-location> ... </region> ... </regions> </lamarc-converter-cmd>
Data from other regions may be included in your analysis, but will not affect the results from your mapping analysis one way or another. If you have two unlinked regions, either of which might contain the trait allele, LAMARC will be able to tell you the most likely place within either region where the trait is likely to map, but will not be able to provide any information about whether one region fits the data better than the other.
At this point, you are done creating a converter input file, and you can run the converter and output a LAMARC input file (lamarc-trait-input.xml [actual xml is here]). Run the converter in batch mode using this command file with:
lam_conv -b -c traitCmd.xml
You should also be able to further modify your run within the converter by leaving off the '-b' to run in interactive mode. Once you're done, select 'Write Lamarc File' from the 'File' menu.
Then, start up lamarc, using your newly-minted file as the input.
The settings for mapping are under the Analysis menu ('A', from the main menu), then 'A' again for the 'Trait Data analysis' menu. The next menu shows a list of all the traits you are mapping; the settings for each can be changed independently. Select the trait for which you want to change the settings.
There are two things you can do from this menu. The first is to restrict the range of possible sites for your trait. For some analyses, you might have sequence data in the same general region as the area where you've mapped your trait, but you know that the trait alleles are not there. In this case, you can Add ('A') or Remove ('R') sites from the genomic region from consideration as being a possible site for the trait you're mapping. The numbering scheme used here is the region-wide numbering, as defined by the 'map-position' numbers both in the converter and in the lamarc input file. Also note that by default, there is no 'site 0' in this scheme: the position to the left of site 1 is defined to be site -1. (This can be changed in the lamarc input file XML tag convert-output-to-eliminate-zero.
The second option here is to select the type of mapping analysis you wish to perform, a 'floating' ('F') or a 'jumping' ('J') analysis.
The 'floating' analysis (the default) will not affect the search through tree-space at all. Instead, as trees are collected, each is analyzed, and the likelihood that the observed trait data could have been produced at each site is calculated and stored. This can be a somewhat time-consuming process, but the results are more robust. The final result is a complete analysis, for every tree collected in the final chain, of the relative likelihoods of each site in the genomic region under analysis.
In the 'jumping' analysis, the trait alleles are actually placed at a particular site. A new arranger is turned on that moves the trait alleles from site to site based on the relative likelihoods of the data being created at each site. This arranger is a 'Gibbs' arranger, in that it calculates the likelihood at each site, and then chooses one according to their relative probabilities. In this way, it always accepts its new choice, without regard to where it used to be. The results of this analysis are an average of where the trait alleles were placed during the run.
Based on our preliminary analyses, we recommend the 'floating' analysis for most studies. This analysis is more thorough, and while the analysis takes somewhat longer while it is running, it only needs to run during the final chain, so on balance, the time difference is relatively small. In addition, by never incorporating the trait data into tree acceptance or rejection, the trait data itself does not influence the tree and 'lock' it into place by shoving the trait into a fairly reasonable position and then modifying the tree to fit the data. With known data, this is exactly what you want, but in this case, we want to know where on the sequence the allele resides, without unduly biasing the results.
Still, the 'jumping' analysis has its uses. Since its arranger is on at all times, it provides updates on the current state of the mapper at the end of each chain. And while normally we believe that not incorporating trait data into tree acceptance and rejection is good for mapping, there may be some cases (perhaps with sparse markers) where using the data from the trait may lead to finding trees with overall better likelihood at the correct position, with a corresponding lower likelihood at incorrect positions. In this case, the range of sites where the trait might be located would narrow (a good thing, when mapping).
When performing a Jumping analysis, two new rearrangers are created, and the relative amount of time spent on each can be set from the Rearrangers menu ('S', then 'R' from the main menu.) The basic arranger here is the Trait Location rearranger ('L'), which accomplishes the 'jumping' part of the analysis by moving the trait alleles to various positions in the sequence. The other arranger you might find here is the Trait haplotypes rearranger ('M', for no good reason), which will rearrange the unknown haplotypes (if any) of your trait data. This rearranger will take the various genotype resolutions you defined for your phenotypes and individuals, and swap among them in proportion to their relative likelihoods. If there are a lot of individuals in your data with unknown haplotypes, this can be another reason to use the 'jumping' analysis, since this analysis searches among possible haplotype resolutions, while the 'floating' analysis sums over all possible combinations of haplotype resolutions for each analyzed tree, which can be very slow.
Running LAMARC in interactive mode (i.e. non-batch mode) will result in output to the screen that looks something like this:
15:44:03 Most likely site(s) for funny-nose: 919:923. Relative data likelihood = 0.0025891 The top 5% of all sites in this region: 919:936, 946:947 The top 50% of all sites in this region: 678:709, 776:803, 811:953 The top 95% of all sites in this region: 1:212, 214:215, 217:220, 641:953 You have a total of 531 sites in your 95% range.
These results were for a mock data set for the 'funny-nose' trait, in a region 1000 base pairs long. The top scorers were sites 837:841 although each site in this range had only a 0.22% chance of actually being the correct one. The last line informs us that 566 sites are included at 95% confidence, so about half the sequence was excluded. The true site for this trait happened to be site 905, which was included in the top 50% of all sites. This was the result of a default run, which is on the short side; it may be the case that we'd get better results from spending more time on a longer run of LAMARC.
Note that these results are disjointed--the most likely sites are not guaranteed to be next to one another. This can be due to both random chance and shared heritage. Random chance can produce two different historical pathways which are both about equally good explanations of the trait. Shared genetic heritage can sometimes mean that distant segments of DNA might share a crucial bit of genetic history, while the intervening segment experienced a widely divergent history.
More detailed results are available in the mapping output file for each trait. For our example, the default name for this file is mapfile_funny-nose.txt. There, you will find a list like this:
Site Data likelihood 1 0.00090735 2 0.00090871 3 0.00090877 4 0.00091575 5 0.00091603 6 0.00091603 7 0.00091603 8 0.00091756 [...]
This column of numbers should add up to 1.0, and indicates the percent chance that your trait maps to each site (so, for the above, a 0.069191% chance of being at site 1, a 0.069390% chance for site 2, and so on.) The list can be imported into a graphing utility or spreadsheet program to better visualize your results.
(Previous | Contents | Next)