Estimating Pi (nucleotide diversity) for a large number of populations

If you want to estimate diversity statistics such as ‘pi’ for a large number of populations in a contiguous data set, this tutorial may help you.

What we have:
– .vcf format data file
– .pop file (unique names of pops, one per line)
– .map file (individual to population mapping file — 2 columns)

If you want to use vcftools, your first thought might be to subset the data into multiple vcfs, one per population. But this is entirely unnecessary. Let’s see how we can combine different flags to achieve the same result. This makes use of simple bash scripting and vcftools.

Subset popfile for indv pops


cat plants.pop | while read line;
do
 grep "$line" plants.map > $line.pop
done

If this worked, you now should have one .pop file per population containing mappings for individuals in that population.

Estimate ‘pi’ diversity stat


for p in *.pop
do
 vcftools --vcf input.vcf --keep $p --site-pi --out $p
done

This will subset the the input vcf for a given population on the fly and estimate pi statistic.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s