Estimating Pi (nucleotide diversity) for a large number of populations

If you want to estimate diversity statistics such as ‘pi’ for a large number of populations in a contiguous data set, this tutorial may help you.

What we have:
– .vcf format data file
– .pop file (unique names of pops, one per line)
– .map file (individual to population mapping file — 2 columns)

If you want to use vcftools, your first thought might be to subset the data into multiple vcfs, one per population. But this is entirely unnecessary. Let’s see how we can combine different flags to achieve the same result. This makes use of simple bash scripting and vcftools.

Subset popfile for indv pops

cat plants.pop | while read line;
 grep "$line" > $line.pop

If this worked, you now should have one .pop file per population containing mappings for individuals in that population.

Estimate ‘pi’ diversity stat

for p in *.pop
 vcftools --vcf input.vcf --keep $p --site-pi --out $p

This will subset the the input vcf for a given population on the fly and estimate pi statistic.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s