APPENDICES

 

A Bayesian Heterogeneous Analysis of Variance Approach to Inferring Recent Selective Sweeps

 


 

Introduction

Summary of Data Sets 1 and 2

Prior specifications for Data Sets 1 and 2

Data analysis in WinBugs

Posterior distributions of precision parameters for Data Sets 1 and 2

Coalescent simulations

Prior specifications for coalescent simulation data

Summary of analyses of coalescent simulation data

References

 


 

Introduction

 

Here additional data is presented that is supplemental to the article “A Bayesian heterogeneous analysis of variance approach to inferring recent selective sweeps” by Marshall and Weiss (2006). Depending on the reader’s interest, it is not necessary to read every section. The summary of Data Sets 1 and 2 may be of interest to vector biology researchers who are interested in the population genetics of Anopholes gambiae. The section on data analysis in WinBugs and prior specification sections are of interest to anyone wishing to perform a similar analysis on a different data set. The coalescent simulation code is provided for people wishing to perform analogous simulations, and to understand exactly how the simulations were performed in this article (Marshall and Weiss 2006). The summary of analyses of coalescent simulation data provides a more detailed look at how the Bayesian method performs under a variety of scenarios.

 


 

Summary of Data Sets 1 and 2

 

Two sets of microsatellite allele size data for An. gambiae populations in West Africa were analyzed. The first data set (Data Set 1) has been previously analyzed by Lanzaro et al. (1998) and consists of allele size data for 213 An. gambiae mosquitoes collected at 21 microsatellite loci dispersed throughout the mosquito genome. The second data set (Data Set 2) was recorded between 2002 and 2003 and consists of microsatellite allele size data for mosquitoes collected at 12 microsatellite loci dispersed throughout the third chromosome of the An. gambiae genome. Collection dates, locations, chromosomal forms and sample sizes are shown below:

 

Data Set

Location

Chromosomal form

Collection date

Number of individuals

1

Banambani

Bamako

1-15 July, 1993

16

Banambani

Mopti

1-15 July, 1993

47

Banambani

Savannah

1-15 July, 1993

6

Selenkenyi

Bamako

4-7 August, 1993

46

Selenkenyi

Mopti

4-7 August, 1993

50

2

Oure

Mopti

18 August, 2002

8

Oure

Savannah

18 August, 2002

46

Gono

Savannah

10 September, 2002

24

Kokouna

Savannah

8 August, 2002

72

Pimperena

Savannah

10 August, 2002

76

Soulouba

Savannah

9 September, 2003

68

Madina Daisra

Savannah

7 August, 2002

46

Dire

Mopti

16 November, 2002

32

Kondi

Mopti

17 October, 2002

18

Nampala

Mopti

21 July, 2002

42

Torkya

Mopti

21 November, 2002

26

Banikane

Mopti

15 November, 2002 & 19 October, 2003

46

List of populations studied and sample sizes

 

The raw data for these data sets is given in the text files below. Here, the data is formatted to be compatible with subsequent analysis in the software package WinBugs. In the file yDataSet1.txt the size of each measured microsatellite allele is recorded numerically. The files locusDataSet1.txt and locusDataSet2.txt index the locus at which each microsatellite allele has been measured, and the files populationDataSet1.txt and populationDataSet2.txt index the population that each microsatellite allele belongs to. The microsatellite allele size data for Data Set 2 (yDataSet2.txt) was not publicly available at the time of writing.

 

Data files for the Bayesian analysis

 

In these data files input to WinBugs, the loci are numbered from one up to the number of loci being analyzed and the populations are numbered from one up to the number of populations being analyzed. The correspondence between the locus index and number and population index and name are given in the tables below for Data Sets 1 and 2:

 

Index

Locus

1

007

2

008

3

025

4

026

5

038

6

079

7

083

8

088

9

095

10

119

11

125

12

128

13

135

14

143

15

156

16

158

17

175

18

293

19

577

20

603

21

637

Correspondence between locus index and number for Data Set 1

 

Index

Population

1

BnB

2

BnM

3

BnS

4

SeB

5

SeM

Correspondence between population index and name for Data Set 1

 

Index

Locus

1

119

2

127

3

242

4

249

5

312

6

555

7

577

8

059

9

746

10

812

11

817

12

093

Correspondence between locus index and number for Data Set 2

 

Index

Population

1

GoS

2

KoS

3

PiS

4

SoS

5

MaS

6

OuS

7

DiM

8

KoM

9

NaM

10

ToM

11

BaM

12

OuM

Correspondence between population index and name for Data Set 2

 

The locations of each of these loci in the An. gambiae genome are available on the Anobase database at http://www.anobase.org. The locations of the collection sites for both the 1993 and 2002-2003 collections are available on the map of Mali at  http://taylor0.biology.ucla.edu/~madiuk/images/Carnahan.jpg .

 


 

Prior specifications for Data Sets 1 and 2

 

The prior specifications used for the Bayesian computation for Data Sets 1 and 2 are only briefly discussed in the paper (Marshall and Weiss 2006) with most of the details being given in the Materials and Methods section. Specification of prior distributions is a delicate and subjective process, and so for the interested reader a more detailed description of how these specifications are calculated is given below:

 

 

Prior specifications for Data Set 1

The prior distributions for the mean and precision component parameters for Data Set 1 were estimated following the procedure described in the Materials and Methods section (Marshall and Weiss 2006) fairly closely. The parameter  approximately corresponds to the precision of the natural logarithm of effective population sizes for the populations being analyzed. We were not able to obtain estimates for all of the An. gambiae populations in this analysis, however Taylor and Manoukis (2003) have calculated approximate effective population sizes at the focal research site of Banambani for the Bamako chromosomal form (), the Savannah chromosomal form () and the Mopti chromosomal form (). The precision of the natural logarithm of these estimates is equal to  and can be used as an estimate of the mean for the prior distribution of  values. We now need one more parameter to complete the specification for the prior distribution of  values. Taking the mode of the distribution to be  times the mean, we find , taking the mode to be  times the mean, we find , and taking the prior information to be worth five observations, we find . We choose  since it is the middle of these values, from which it follows that .

 

The parameter  corresponds to the precision of the natural logarithm of microsatellite mutation rates for the microsatellite loci being analyzed. In the absence of data on microsatellite mutation rates for An. gambiae specifically, Harr et al. (1998) report a 10-fold or greater range of mutation rates in the genomes of the Drosophila sister species D. melanogaster and D. simulans. Assuming a similar range for An. gambiae corresponds to a range in  values of  and an estimate of the mean for the prior distribution of  values of . Again assuming the prior information to be worth five observations, we find that  and .

 

The parameter  is of great interest as it corresponds to the precision of the natural logarithm of fractional reductions in microsatellite allele size variance due to selection, which is the main parameter of interest that we are trying to estimate in this model. Postulating that 95% of the  values lie between 0.6 and one leads to an estimate for the mean of the distribution of  values of . Assuming again that  gives .

 

Considering the mean components part of the model, the parameters ,  and  represent the contribution of population effects, locus effects and population-locus interaction effects respectively to the precision of  values. Since the population and population-locus specific terms are expected to contribute very little to the  values, then in the absence of a good source of prior information, we can approximate the means of their prior distributions to be relatively large (). Once again, assuming prior information to be worth five observations, we have  and hence .

 

Consequently, locus effects will contribute almost entirely to  values and so we can estimate the precision component  by the overall precision in microsatellite allele sizes in the populations and loci being considered (). Assuming prior information to be worth five observations, we have  and hence .

 

For the portion of the model describing the mean of the microsatellite allele sizes, we require  to be approximately equal to the mean of the  values for our sample (). Similarly, for the portion of the model describing the variance in microsatellite allele sizes we require  to be approximately equal to the mean of the  values for our sample ().

 

 

Prior specifications for Data Set 2

Data Set 2 consists of microsatellite allele size data at 12 loci dispersed throughout chromosome three of An. gambiae and collected from 12 locations throughout Mali. While the loci and populations are for the most part different from those considered in Data Set 1, the same broader geographical region and An. gambiae chromosomal forms are being considered, and consequently the population and locus effects should be of similar magnitude across data sets.

 

Having already calculated posterior distributions for the precision parameters in Data Set 1, these were used as a source of prior information for Data Set 2. For each of the precision parameters , , , ,  and , the posterior means and variances were calculated for Data Set 1 and the prior parameters for Data Set 2 were calculated from the relations  and  for the gamma distribution from which we may deduce  and  where  and  parameterize the prior distributions of the precision parameters for Data Set 2. The results of these calculations are shown in Table 3 of Marshall and Weiss (2006). Again carrying over from Data Set 1 we take  and .

 


 

Data analysis in WinBugs

 

The data summarized and described above was analyzed using the Bayesian analysis of variance model in the software package WinBugs. WinBugs is the Windows version of the BUGS (Bayesian inference Using Gibbs Sampling) software package and is freely available on the Internet at http://www.mrc-bsu.cam.ac.uk/bugs. Following are the WinBigs model files for calculating posterior distributions of the  parameters for the microsatellite data given earlier for Data Sets 1 and 2:

                                                        

WinBugs model files for Data Sets 1 and 2

bayesSelectDataSet1.txt

bayesSelectDataSet2.txt

WinBugs files for calculation posterior distributions of  parameters

 

WinBugs is an excellent program for computing posterior distributions under a defined Bayesian model, however it is not entirely straightforward to use and has a few quirks in its execution. If you want to sidetrack the Users’ Manual then the following instructions are given on how to alter and execute the WinBugs code from Data Sets 1 and 2, and how to modify the initial values and prior specifications to perform a new analysis.

 

 

Model specification

The model section of the code essentially specifies the two-way analysis of variance model as defined in equations (3) through (8) in the Marshall and Weiss (2006). The only alterations that need to be made for a new data set are correcting the number of alleles being analyzed (numAlleles, line 3), the number of populations being analyzed (numPops, line 4) and the number of loci in the genome at which microsatellite allele size data has been determined (numLoci, line 5).

 

model

{

        numAlleles = 6554

        numPops = 5

        numLoci = 21

        for(k in 1:numAlleles) {

                y[k] ~ dnorm(mu[population[k], locus[k]], Vinv[population[k], locus[k]])

        }

        for(i in 1:numPops) {

                for(j in 1:numLoci) {

                        mu[i, j] <- mu0 + gamma[i] + delta[j] + rho[i, j]

                        Vinv[i, j] <- 1/(exp(lnV0 + psi[i] + phi[j] + alpha[i, j]))

                        alpha[i, j] ~ dnorm(0,tau.alpha)

                        rho[i, j] ~ dnorm(0,tau.rho)

                }

                gamma[i] ~ dnorm(0, tau.gamma)

                psi[i] ~ dnorm(0, tau.psi)

        }

        for(j in 1:numLoci) {

                delta[j] ~ dnorm(0, tau.delta)

                phi[j] ~ dnorm(0, tau.phi)

        }

        tau.gamma ~ dgamma(tau.gamma.a,tau.gamma.b)

        tau.delta ~ dgamma(tau.delta.a,tau.delta.b)

        tau.psi ~ dgamma(tau.psi.a,tau.psi.b)

        tau.phi ~ dgamma(tau.phi.a,tau.phi.b)

        tau.alpha ~ dgamma(tau.alpha.a,tau.alpha.b)

        tau.rho ~ dgamma(tau.rho.a,tau.rho.b)

}

 

 

Prior specification

The Prior section of the code specifies prior parameters for the model as suggested in the “Sources of prior information” section of Materials and Methods and the “Prior distributions of precision parameters” section of Results in Marshall and Weiss (2006). Most of these priors are applicable across data sets. Exceptions are:

  • mu0 which should be set to the mean of the microsatellite allele sizes for the data set,
  • lnV0 which should be set to the mean of the natural logarithm of variance in microsatellite allele size for each locus and population in the data set,
  • tau.delta.b which should be set to 2.5 times the overall variance in microsatellite allele sizes in the data set, and
  • tau.psi.b which should be set to 2.5 times the variance of the natural logarithm of effective population sizes for the populations being analyzed.

 

Prior:

 

list(mu0=115, lnV0=2.58, tau.gamma.a=2.5, tau.gamma.b=2.5, tau.delta.a=2.5, tau.delta.b=3025, tau.psi.a=2.5, tau.psi.b=0.365, tau.phi.a=2.5, tau.phi.b=0.828, tau.alpha.a=2.5, tau.alpha.b=0.0408, tau.rho.a=2.5, tau.rho.b=2.5)

 

 

Initial values

The Initial values section of the code specifies starting values for the Markov chain used to generate posterior distributions for the model parameters. The only requirement is that the initial values given are within the distribution space of these parameters, and usually zeros and ones are sufficient. The values given below are transferable to other data sets and the only alteration that needs to be made is to change the dimensions of the vectors and matrices so that they reflect the number of loci and populations in the data:

  • delta and phi should be vectors of zeros with length equal to the number of loci being analyzed,
  • gamma and psi should be vectors of zeros with length equal to the number of populations, and
  • alpha and rho should be matrices of zeros with dimensions equal to numPops by numLoci.

 

Initial values:

 

list(

delta=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),

phi=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),

gamma=c(0,0,0,0,0),

psi=c(0,0,0,0,0),

alpha=structure(.Data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), .Dim=c(5,21)),

rho=structure(.Data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), .Dim=c(5,21)),

tau.gamma=1,tau.delta=1,tau.psi=1,tau.phi=1,tau.alpha=1,tau.rho=1)

 

 

Data structure

The data should be formatted analogously to that given for Data Set 1. There should be one file recording all of the microsatellite allele sizes that have been measured (in this case, yDataSet1.txt). Each measured microsatellite allele should be recorded numerically and given its own line. The file should begin with the text y[] and end with the text END. Then there should be one file (in this case, locusDataSet1.txt) indexing the locus at which each microsatellite allele has been measured. This file should begin with the text locus[] and end with the text END. Finally, there should be another file (in this case, populationDataSet1.txt) indexing the population that each microsatellite allele belongs to. This file should begin with the text population[] and end with the text END. Loci should be numbered from one to numLoci and populations should be numbered from one to numPops.

 

 

Running WinBugs

Once the model, prior parameters and initial values have been specified and the data has been formatted correctly then we are ready to estimate the posterior distributions of the population-locus interaction terms using WinBugs and to estimate which loci may have been the targets of selection. To do this, install and run WinBugs. Then open the model file (bayesSelectDataSet1.txt), allele size file (yDataSet1.txt), locus file (locusDataSet1.txt) and population file (populationDataSet1.txt) using the File>Open tab. Next, bring up the Specification Tool window by using the Model>Specification tab.

 

To specify the model, open the model file window (bayesSelectDataSet1.txt) and highlight the word "model," then return to the Specification Tool window and click on "check model." To specify the prior parameters, return to the model file window and highlight the word "list" under "Prior:," then return to the Specification Tool window and click on "load data." To load the microsatellite data, open the microsatellite allele size window (yDataSet1.txt) and highlight the letter "y," then return to the Specification Tool window and click on "load data." Repeat this procedure for the loci and populations being analyzed - open the locus window (locusDataSet1.txt) and highlight the word "locus," then return to the Specification Tool window and click on "load data," then open the population window (populationDataSet1.txt) and highlight the word "population," return to the Specification Tool window and click on "load data." The model is now ready to be compiled by WinBugs and so you can click on "compile." Finally, to specify the initial values for the Markov chain, return to the model file window and highlight the word "list" under "Initial values:," then return to the Specification Tool window and click on "load inits." The Markov chain is now initialized and ready to be run.

 

Before running the iterations of the Markov chain, first it is necessary to specify the parameters that we want to monitor. Bring up the Sample Monitor Tool window using the Inference>Samples tab. To monitor the  parameters, enter the word "alpha" in the node field, click on "set," enter "*" in the node field and enter "4001" in the beg field. Now the Markov chain iterations are ready to be run. Bring up the Update Tool window using the Model>Update tab, enter "200000" in the updates field and click on "update" to begin the iterations. For a data set of this scale, 200,000 iterations will take between one and two hours so this would be a good time to leave the computer for a while. The updating process can be paused at any time by clicking on the "update" button and then be resumed by clicking on the "update" button once again.

 

 

Making inferences in WinBugs

Although WinBugs is primarily designed to estimate posterior distributions of model parameters, it also has some features that allow you to make inferences. To check whether the model has been run through enough iterations of the Markov chain, you can look at the posterior densities of the parameters being monitored by bringing up the Sample Monitor Tool and clicking on “density.” The posterior densities should be smooth and normally distributed, and if not then the number of iterations should be increased and the model should be updated again. To generate summary statistics of the posterior distributions for each of the monitored parameters, bring up the Sample Monitor Tool and click on “stats.” This will generate a table with the mean, standard deviation, median and 2.5% and 97.5% quartiles for each of the posterior distributions being monitored. To generate caterpillar plots of the posterior distributions for a set of monitored parameters use the Inference>Compare tab to bring up the Comparison Tool window, enter "alpha" in the node field, "4001" in the beg field and click on “caterpillar.” This will make it easy at a glance to see which loci in which populations are candidates for selection. Also, to narrow the analysis down to a single locus (say, locus one), you can enter "alpha[,1]" in the node field and click on “caterpillar.” This will generate the same caterpillar plot as before but only show posterior distributions for the first locus, as appropriate for making inferences on selection acting at this locus.

 

The analysis and inference options in WinBugs are restrictive, and to obtain other information from the posterior densities, or to display the posterior densities differently, it is necessary to analyze the data in another software package (such as R). To output the data from WinBugs, bring up the Sample Monitor Tool window and click on “coda.” This will generate two new windows – a CODA index window and a CODA for chain 1 window. These can be saved simply by using the File>Save As tab. To use the CODA files with the software package R, then it is a good idea to save the CODA index file as “codaIndex.txt” and the CODA for chain 1 file as “coda1.txt.” These can then be loaded in R using the commands described in the next section.

 


 

Posterior distributions of precision parameters for Data Sets 1 and 2

 

The posterior distributions of the  parameters are of most interest to us as they are used directly in the inference of selection, however posterior distributions of the precision parameters for Data Sets 1 and 2 are also relevant to model convergence, accuracy of  posteriors and sufficiency of prior specifications. These posterior distributions are only discussed briefly in terms of their smoothness and consistency with a gamma distribution in the Results section (Marshall and Weiss 2006) and so here we go into more detail.

 

 

Posterior distributions of precision parameters for Data Set 1

The posterior distributions of the precision parameters for Data Set 1 are shown below, along with the corresponding prior distributions. The prior and posterior distributions for the mean component parameters ,  and  are very similar, suggesting there is little information in the data set for estimating these parameters and so it is important to have good priors in these cases. The prior and posterior distributions for  also correspond very well, suggesting there is little information in the data set for this parameter as well.

 

The posterior distribution of the precision parameter  differs moderately from the prior distribution in that  has a posterior mean of approximately 0.883 whereas it has a prior mean of 3.02. This suggests either that microsatellite mutation rates for An. gambiae vary by slightly more than a factor of 10, or that the locus-specific variance component in microsatellite allele size is not completely accounted for by variance in microsatellite mutation rates. The former interpretation seems likely as Harr et al. (1998) report that microsatellite mutation rates vary by at least an order of magnitude in the genomes of D. melanogaster and D. simulans and hence 10-fold variation is a conservative estimate for An. gambiae. Nevertheless, the data set seems to be very informative for this parameter and so such a small discrepancy is acceptable.

 

Finally, the prior and posterior distributions for the precision parameter  are significantly different in terms of their mean, however the posterior distribution is much more precise, suggesting that there is much more information in the data set for this parameter than in the prior. The  precision parameter is of great interest to us since the posterior distributions of the  parameters are distributed as  and are ultimately used in determining which loci have likely been the targets of recent selective events.

 

Prior and posterior distributions of  parameters for Data Set 1

 

 

Posterior distributions of precision parameters for Data Set 2

The posterior distributions for the precision parameters are shown below, along with the corresponding prior distributions. The prior and posterior distributions for all of the precision parameters are very similar this time, thus confirming the comparability of loci and populations across data sets. Of special note, the posterior distribution of the precision parameter  is moderately smaller than the prior distribution taken from the Data Set 1 posterior, suggesting that there is a slightly larger range of  values in Data Set 2 than in Data Set 1. The  precision parameter is again of great interest to us since the posterior distributions of the  parameters are ultimately used to determine which loci have likely been the targets of recent selective events.

 

Prior and posterior distributions of  parameters for Data Set 2

 

 

Making inferences from posterior distributions in R

As previously mentioned, the analysis and inference options in WinBugs are restrictive, and so in order to obtain a wider variety of inferences and plots, it is necessary to analyze and process the output data in another software package. The statistical computing software package R is an excellent program for statistical analysis and plotting data and is freely available online at  http://www.r-project.org/. Once the posterior distributions from a Bayesian analysis in WinBugs are saved as a CODA index file and chain 1 file then R script files can be used to compare posterior distributions by plotting graphs or performing calculations. The following table lists scripts that were used in R to:

  • plot posterior distributions of  parameters,
  • calculate posterior probabilities that candidate  parameters are the smallest for a particular locus j, and
  • plot prior and posterior distributions of precision parameters for Data Set 1.

 

R script files for analyzing Data Set 1

catPlotDataSet1.r

prMinAlphaDataSet1.r

priorPostPlotDataSet1.r

R files for plotting posterior distributions and calculating posterior probabilities

 

These files are only given as an example of how you could analyze your own output data in R and there are many other commands and libraries that you could draw upon to do so. The R script catPlotDataSet1.r is a basic script that loads CODA data from WinBugs using the rbugs library file getBugsOutput and then plots a posterior density distribution at one of the candidate loci using the plot and lines commands in the car library. The script then plots three caterpillar plots comparing  parameters at three candidate loci using the segments command. The candidate loci are determined by first looking at the WinBugs output.

 

The R script prMinAlphaDataSet1.r is a basic script that loads CODA data from WinBugs using the rbugs library file getBugsOutput and then calculates posterior probabilities that certain populations have the smallest  parameters at their respective candidate loci. The posterior probabilities are calculated by summing all of the entries for which the minimum  condition is satisfied, and then dividing by the number of entries in the CODA file. Once again, the candidate loci and populations are determined by looking at caterpillar plots in WinBugs beforehand.

 

Finally, the R script priorPostPlotDataSet1.r loads CODA data from WinBugs using the getBugsOutput file and then calculates prior and posterior densities for each of the precision parameters using the plot and lines commands in the car library. Here, the parameters of the prior distributions have been input manually. Once again, there are many other ways that the same analyses could be done in R, but these files are just given as an example that you could alter to sketch your own prior and posterior distributions.

 


 

Coalescent simulations

 

Coalescent simulations were performed following the framework proposed by Hudson (1990). Hudson (1990) used Monte Carlo methods to simulate lineages of sampled alleles. In these simulations, first the genealogy is generated and then a Poisson number of mutations are added along the branch proportional to the length of the branch above the node. The simulations are computationally efficient because only nodes representing sampled microsatellite alleles are involved in the simulations.

 

Modifications to these simulations were made so that the output would be consistent with the requirements of the Bayesian selection article (Marshall and Weiss 2006). For example, the number of populations is also a variable, selection can be specified in addition to population bottlenecks, and the results of the simulations are output as raw allele size data with corresponding locus and population indices. The bottleneck scheme of Hudson (1990) is used, and selection is modeled as a bottleneck at a single locus in a single population. Some of the modifications made to Hudson’s (1990) code follow closely those made by Schlötterer (2001). For example, the microsatellite mutation scheme and the variation of  to model locus-specificity of mutation rate both closely follow adaptations to Hudson’s (1990) code used by Schlötterer (2001) to test the power of the lnRV statistic.

 

Inputs to the simulation program include the number of populations, number of loci, sample size per population, and average -values of every population being simulated. The sample size of each population is assumed to be identical. Also, the factor by which diversity is reduced during selection is a parameter, along with the time since the selective sweep (in units of 2Ne generations, where Ne is the effective population size of the selected population). The number of selected populations is also a parameter (selection occurs at locus one in consecutively numbered populations beginning with population one). With regard to demographic events, the factor by which diversity is reduced due to a population bottleneck is a parameter, along with the time since the population bottleneck (also in units of 2Ne generations), and the number of populations affected by the bottleneck. In this case, the bottleneck affects consecutively numbered populations beginning with population two, and all populations are assumed to be affected by a bottleneck of the same strength and time.

 

Other parameters are not given as options, but are instead set constant within the code. For example, -values are varied by a factor of 10 across loci to reflect the locus specificity or microsatellite mutation rates and following studies by Harr et al. (1998), Weber and Wong (1993) and Dib et al. (1996) that mutation rates differ by an order of magnitude even within a single species. 80% of mutations were simulated using the unbiased stepwise microsatellite mutation model (Ohta and Kimura 1973; Goldstein et al. 1995) and the remaining 20% of mutations were simulated using the two-phase model of microsatellite mutation (Di Rienzo et al. 1994). In the two-phase model, the number of repeats gained or lost was uniformly distributed between one and three to reflect observations that microsatellite mutations are not confined to single repeat unit changes (Wierdl et al. 1997; Brinkmann et al. 1998; Harr and Schlötterer 2000).

 

The coalescent simulations described in this section are relatively simplistic yet they are sufficient to test the functionality of the Bayesian method of detecting recent selective sweeps. There are a number of restrictions however, for example the program can only simulate microsatellite loci under this mutation model, and there is no allowance for recombination between loci and the inclusion of complex demographic events such as migration and population admixture.

 

C source code and executable file for generating coalescent simulation data

coalSimBayesSelect.c

coalSimBayesSelect.exe

Files for generating coalescent simulation data

 

The default simulation parameters for the Bayesian selection simulations (Marshall and Weiss 2006) consisted of five independent populations with five loci and 40 individuals sampled from each population. The average -values for each population were varied to reflect differing effective population sizes (). Then a single selective sweep was simulated at locus one in population one occurring  generations prior to the time of sampling and reducing the diversity at that locus to 0.01 times its original diversity. Sample data generated under these parameters is given below. Inspecting the first 40 lines of the allele size file mslength_file.txt will show the allele sizes for population one at locus one which will have relatively little diversity compared to the rest of the microsatellite allele size data due to the recent selection event.

 

Sample data set generated by coalescent simulation program

loci_file.txt

pop_file.txt

mslength_file.txt

Simulated allele size data under default model parameters

 


 

Prior specifications for coalescent simulation data

 

Again, the prior specifications used for the Bayesian computation for the coalescent simulation data are only briefly discussed, with general details being given in the Materials and Methods section (Marshall and Weiss 2006). For the interested reader, a more detailed description of how these specifications were calculated is given below:

 

The parameter  approximately corresponds to the precision of the natural logarithm of effective population sizes for the populations being analyzed. Noting that effective population size, , is proportional to the  parameter in the coalescent simulations and that when there were five populations in the coalescent simulations  followed the proportions  then the precision parameter for these simulations is equal to  and can be used as an estimate of the mean for the prior distribution of  values. We now need one more parameter to complete the specification for the prior distribution of  values. Taking the mode of this distribution to be  times the mean, we find , taking the mode to be  times the mean, we find , and taking the prior information to be worth five observations, we find . We choose  since it is the middle of these values, from which it follows that .

 

The parameter  corresponds to the precision of the natural logarithm of microsatellite mutation rates for the microsatellite loci being analyzed. In the coalescent simulations performed, microsatellite mutation rates were allowed to vary tenfold (Harr et al. 1998; Weber and Wong 1993; Dib et al. 1996). This corresponds to a range in  values of  and an estimate of the mean for the prior distribution of  values of . Assuming the prior information to be worth five observations, we find that  and .

 

The parameter  is of great interest as it corresponds to the precision of the natural logarithm of fractional reductions in microsatellite allele size variance due to selection, which is the main parameter of interest that we are trying to estimate in this model. Postulating that 95% of the  values lie between 0.6 and one leads to an estimate for the mean of the distribution of  values of . Assuming again that  gives .

 

Considering the mean components part of the model, the parameters ,  and  represent the contribution of population effects, locus effects and population-locus interaction effects respectively to the precision of  values. Since the population and population-locus specific terms are expected to contribute very little to the  values, then in the absence of a good source of prior information we can approximate the means of their prior distributions to be relatively large (). Once again, assuming prior information to be worth five observations, we have  and hence .

 

Consequently, locus effects will contribute almost entirely to the  values and so we can estimate the precision component  by the overall precision in microsatellite allele sizes in the populations and loci being considered (). Also, since we are dealing with simulation data, then the data used to calculate the estimate for this component can be from a different simulation than that being analyzed. Assuming prior information to be worth five observations, we have  and hence .

 

For the portion of the model describing the mean of the microsatellite allele sizes, we require  to be approximately equal to the mean of the  values for our simulation data (). Similarly, for the portion of the model describing the variance in microsatellite allele sizes we require  to be approximately equal to the mean of the  values for our simulation data (). Summarizing these prior specifications in the form of a table, we have:

 

Parameter

Prior information source

Point estimate for prior distribution

Gamma distribution shape parameter

Gamma distribution rate parameter

Simulation data

100

N/A

N/A

Slatkin 1995

1

2.5

2.5

Simulation data

0.0143

2.5

174

Slatkin 1995

1

2.5

2.5

Simulation data

3.16

N/A

N/A

Simulation parameters

5.87

2.5

0.426

Simulation parameters

3.02

2.5

0.828

-

61.3

2.5

0.0408

Prior parameters for precision parameters for coalescent simulations

 


 

Summary of analyses of coalescent simulation data

 

To test the ability of the Bayesian method to detect recent selective sweeps it would be ideal to perform a power analysis in which the fraction of analyses correctly detecting the selected population is recorded. Unfortunately, due to the large number of parameters that characterize the coalescent simulations and the time taken for a Bayesian analysis to converge (~20 minutes on a 1.73GHz Intel Pentium processor for the default parameter set), this was not possible. Therefore, instead of performing a comprehensive power analysis, a single Bayesian analysis was performed for each set of simulation parameters considered. From these results, general trends of the ability of the method to detect selection can be seen, however these results are only crude since a single simulation could be anomalous and this would be identified in a more comprehensive power analysis.

 

Simulation data was analyzed for sample sizes between ten and 40 with between two and 20 loci and between two and 20 populations. The default simulation consisted of five loci, five populations and 40 individuals per population. Selection occurred in population one 0.02Ne generations in the past reducing diversity at locus one by multiplying it by a factor of 0.01. In other simulations, the time of selection was varied between 0.02Ne and 0.2Ne generations in the past, reducing diversity at locus one by multiplying it by a factor between 0.01 and 0.1 and affecting between one and three populations. The number of bottlenecks was varied between zero and four, and the time and strength of the bottleneck was also varied. The following sections summarize the results of these simulations:

 

 

Dependence on the number of sampled individuals

To investigate the effect of sample size on the probability of detection, we varied the number of individuals sampled from each population between ten and 40. Reducing the number of sampled individuals per population tended to increase the posterior standard deviation  for the selected population and locus and resulted in the posterior  parameters for the unselected populations and loci being distributed more widely as signified by a larger value of  for smaller sample sizes. This means that inference of selection is more difficult as sample size is decreased. Despite this, the posterior  distribution at the selected population and locus was sufficiently negative in all cases that, even with as few as ten sampled individuals per population, selection was still evident (e.g. Figure 1A in Marshall and Weiss (2006)).

 

Number of sampled individuals

 

10

-3.79

0.906

-0.159

1.43

1

20

-4.24

0.764

-0.0265

1.04

1

40

-3.34

0.669

0.000676

0.764

0.9997

Summary of coalescent simulations showing dependence on number of sampled individuals

 

 

Dependence on the number of loci and populations

To investigate the effect of the number of loci and populations on the probability of detection we varied the number of loci and populations between two and 20. From the simulated data analyzed, the Bayesian model performed better with five or ten loci and two, five or ten populations. Within this parameter range, selection was consistently identified and there was an absence of false positive results (e.g. Figure 1B in Marshall and Weiss (2006) shows a convincing case for selection with ten loci and five populations). Outside this parameter range, the selected loci and populations still tended to be detected, however unselected populations also began to show hints of selection. When two loci were used in the simulations, the posterior unselected deviation for unselected populations and loci  was high, leading to an increase in the false positive rate. For example, with two populations and two loci selection was correctly detected at locus one in population one but also suggested at locus two in population two. Also, with five populations and two loci the selected locus was correctly detected but selection was also suggested at locus two in population two. The Bayesian method seems to be able to handle data with only two populations, because with two populations and five or ten loci selection was correctly identified and there was an absence of false positive results. This is reassuring with respect to the lnRV statistic, which is pair-wise and so only utilizes data from two populations. As more and loci are added (up to 20) and more populations (up to 20) then the posterior unselected deviation of the  parameters tends to increase again and more unselected populations begin to show hints of selection, suggesting an optimum number of loci greater than two and less than 20 and an optimum number of populations less than 20.

 

 

Number of populations

2 loci

5 loci

10 loci

20 loci

2

-1.81

-2.94

-1.62

 

5

-5.09

-3.34

-4.74

-2.89

10

-5.10

-4.77

-3.01

 

20

 

-3.23

 

 

2

0.656

1.09

0.756

 

5

0.817

0.669

0.537

0.434

10

0.726

0.538

0.407

 

20

 

0.450

 

 

2

0.344

0.0649

0.0954

 

5

0.0994

0.000676

0.0321

-0.0129

10

-0.0452

0.0952

0.0230

 

20

 

0.0266

 

 

2

1.02

0.851

0.711

 

5

1.31

0.764

0.934

0.892

10

1.02

0.793

0.760

 

20

 

0.833

 

 

2

0.9998

0.998

0.9999

 

5

0.9997

0.9997

1

1

10

1

1

0.99995

 

20

 

1

 

 

Summary of coalescent simulations showing dependence on number of loci and populations

 

 

Dependence on the time and strength of selection

Over time, the signature of a selective sweep is obscured by mutation (Wiehe 1998) and so it is of interest to know how recent a selective sweep must be in order to be detected by the Bayesian method and how this time varies with the strength of the selective sweep. To investigate this, we ran coalescent simulations varying the fraction of locus diversity retained following selection (i.e. the strength of selection) between 0.01 and 0.1 and the time of selection between 0.02Ne and 0.2Ne generations ago. As expected, when the strength of selection was 0.01 and the selective event occurred 0.02Ne generations ago, the case for selection was very clear (Figure 1E in Marshall and Weiss (2006)). For our simulations, this was also the case for a selective event 0.1Ne generations in the past (Figure 1D in Marshall and Weiss (2006)), however, as the selective sweep became weaker and older the case for selection became weaker and a number of false positive cases for selection emerged. For a selective sweep of strength 0.05 that occurred 0.02Ne generations ago the signature of selection at locus one was still very strong albeit a little weaker (Figure 1F in Marshall and Weiss (2006)) and there were a number of false candidates for selection at other loci such as locus three in population two. As the strength of selection was reduced to 0.1 for a selective event 0.02Ne generations ago, selection at locus one was still detected, but the posterior distribution of  was smaller while not even being selected. From the table below, it can be seen that the posterior mean of the selected population and locus  becomes less divergent as the time since selection increases and the strength of selection decreases, and hence selection is more difficult to detect. With selection of strength 0.05 occurring 0.1Ne generations ago there is no case for selection, and with selection occurring a long time ago (0.2Ne generations in the past) of strength 0.01, the case for selection is again weak ( is smallest for locus one, but  is even smaller and not even selected and there are another two false candidates for selection). This trend continues as selection occurs further in the past and the strength of selection is reduced.

 

 

Strength of selection

Selection 0.02Ne generations ago

Selection 0.1Ne generations ago

Selection 0.2Ne generations ago

0.01

-3.34

-2.43

-1.60

0.05

-2.70

-0.105

-1.73

0.1

-1.13

-1.37

-2.39

0.01

0.669

0.553

0.556

0.05

0.570

0.345

0.519

0.1

0.455

0.478

0.543

0.01

0.000676

0.0122

-0.0117

0.05

0.0132

0.0239

-0.081

0.1

-0.128

-0.0129

-0.0539

0.01

0.764

0.875

0.905

0.05

0.844

0.549

0.839

0.1

0.768

0.838

0.943

0.01

0.9997

0.9993

0.991

0.05

1

0.157

0.977

0.1

0.939

0.994

0.998

Summary of coalescent simulations showing dependence on time and strength of selection

 

 

Dependence on the number of selected populations

To investigate the ability of the Bayesian method to detect more than one population that a locus has been selected in we ran simulations varying the number of selected populations between one and three from a total of five populations. When the locus was selected in one or two populations, then this was detected perfectly by the Bayesian method. With two selected populations, the posterior selected  distributions were slightly less negative and the posterior unselected  distributions had a larger spread about their mean  but selection was still clear and there were no false positives. This is shown nicely in Figure 1G of Marshall and Weiss (2006). By contrast, when locus one was selected in three populations, selection was only suggested by the Bayesian method in two of these (populations two and three). The suggestion of these analyses, although it cannot be made authoritatively in the general case, is that selection can only be reliably inferred in half or less than half of the populations being analyzed.

 

Populations (I) where locus 1 is selected

 

1

-3.34

0.669

0.000676

0.764

0.9997

1,2

-2.37, -2.93

0.648, 0.591

-0.00336

1.19

0.99996

1,2,3

0.349, -1.85, -1.39

0.519, 0.560, 0.519

0.0307

0.748

0.714

Summary of coalescent simulations showing dependence on number of selected populations

 

 

Dependence on the number of bottlenecked populations

To investigate the ability of the Bayesian method to detect selection within a background of demographic events, we performed simulations in which population one underwent a selective sweep and between zero and four of the remaining populations were subjected to a recent and strong population bottleneck. The most encouraging result of these analyses was that selection was identified in all cases. When two populations underwent a bottleneck, there was a single false positive case for selection, but when one, three and four populations were bottlenecked, selection was clear and there were no false positives. The case of three recent and severe population bottlenecks is shown in Figure 1H (Marshall and Weiss 2006). The posterior unselected deviation is a little larger for the case of three population bottlenecks, but despite this, selection is still clearly evident.

 

Number of bottlenecked populations

 

0

-3.34

0.669

0.000676

0.764

0.9997

1

-2.92

0.604

0.151

1.04

0.995

2

-1.81

0.482

0.0203

0.666

0.996

3

-3.17

0.630

-0.00802

0.999

1

4

-3.07

0.711

-0.0753

1.14

1

Summary of coalescent simulations showing dependence on number of bottlenecked populations

 

 

Dependence on the time and strength of a population bottleneck

Another issue with population bottlenecks is whether selection is still detectable when a population bottleneck occurs in the same population as selection does. A series of simulations were run where selection occurred in population one and selection and a bottleneck occurred in population two. Selection was constant and occurred 0.02Ne generations in the past with strength 0.01 while the fraction of allelic diversity retained at all loci following a bottleneck (i.e. the strength of the bottleneck) was varied between 0.01 and 0.1 in proportion to the time of the bottleneck which was varied between 0.02Ne and 0.2Ne generations in the past. Consistent with expectation, selection in population two was obscured as the bottleneck became stronger and more recent in this population, however selection in population one was relatively unaffected by the bottleneck in population two. When the strength of the bottleneck was 0.01 there was no evidence for selection in population two, when the strength was 0.05 there was moderate evidence for selection in population two (Figure 1I in Marshall and Weiss (2006)), and when the strength of the bottleneck was 0.1 there was moderate evidence for selection in population two. There was a clear case for selection in population one throughout the simulations.

 

Time and strength of bottleneck

 

0.01 strength bottleneck 0.02Ne generations ago

-2.42, 0.259

0.636, 0.718

-0.0380

1.18

0.261

0.05 strength bottleneck 0.1Ne generations ago

-3.09, -1.34

0.675, 0.710

0.00217

0.905

0.988

0.1 strength bottleneck 0.2Ne generations ago

-2.70, -1.43

0.733, 0.711

0.123

0.926

0.982

Summary of coalescent simulations showing dependence on time and strength of bottleneck

 


 

References

 

  • Brinkmann, B., M. Klintschar, F. Neuhuber, J. Huhne and B. Rolf, 1998 Mutation rate in human microsatellites: influence of the structure and length of the tandem repeat. Am. J. Hum. Genet. 62: 1408-1415.
  • Dib, C., S. Faure, C. Fizames, D. Samson, N. Drouot et al., 1996 A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature 380: 152-154.
  • Di Rienzo, A., A. C. Peterson, J. C. Garza, A. M. Valdes, M. Slatkin et al., 1994 Mutational processes of simple-sequence repeat loci in human populations. Proc. Natl. Acad. Sci. USA 91: 3166-3170.
  • Goldstein, D. B., A. Ruiz Lineares, L. L. Cavalli-Sforza, and M. W. Feldman, 1995 An evaluation of genetic distances for use with microsatellite loci. Genetics 139: 463-471.
  • Harr, B., and C. Schlötterer, 2000 Long microsatellite alleles in Drosophila melanogaster have a downward mutation bias and short persistence times, which cause their genome-wide underrepresentation. Genetics 155: 1213-1220.
  • Harr, B., B. Zangerl, G. Brem, and C. Schlötterer, 1998 Conservation of locus specific microsatellite variability across species: a comparison of two Drosophila sibling species D. melanogaster and D. simulans. Mol. Biol. Evol. 15: 176-184.
  • Hudson, R. R., 1990 Gene genealogies and the coalescent process, pp. 1-44 in Oxford surveys in evolutionary biology, edited by D. Futuyama and J. Antonovics. Oxford University Press, Oxford.
  • Lanzaro, G., Y. Toure, J. Carnahan, L. Zheng, G. Dolo et al., 1998 Complexities in the genetic structure of Anopheles gambiae populations in West Africa as revealed by microsatellite DNA analysis. Proc. Natl. Acad. Sci. USA 95: 14260-14265.
  • Marshall, J., and R. Weiss, 2006 A Bayesian Heterogeneous Analysis of Variance Approach to Inferring Recent Selective Sweeps. In press.
  • Ohta, T., and M. Kimura, 1973 A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet. Res. 22: 201-204.
  • Schlötterer, C., 2001 A microsatellite-based multilocus screen for the identification of local selective sweeps. Genetics 160: 753-763.
  • Taylor, C., and N. Manoukis, 2003. Effective population size in relation to genetic modification of Anopholes gambiae sensu stricto, pp. 133-146 in Ecological aspects for application of genetically modified mosquitoes, edited by W. Takken and T. W. Scott. Wageningen, The Netherlands.
  • Weber, J. L., and C. Wong, 1993 Mutation of human short tandem repeats. Hum. Mol. Genet. 2: 1123-1128.
  • Wiehe, T., 1998 The effect of selective sweeps on the variance of allele distribution of a linked multi-allele locus-hitchhiking of microsatellites. Theor. Popul. Biol. 53: 272-283.
  • Wierdl, M., M. Dominska and T. D. Petes, 1997 Microsatellite instability in yeast: dependence on the length of the microsatellite. Genetics 146: 769-779.

 


 

John Marshall, Department of Biomathematics, University of California, Los Angeles

 

Last edited on 12.05.06