If you want to estimate diversity statistics such as ‘pi’ for a large number of populations in a contiguous data set, this tutorial may help you.
What we have:
– .vcf format data file
– .pop file (unique names of pops, one per line)
– .map file (individual to population mapping file — 2 columns)
If you want to use vcftools, your first thought might be to subset the data into multiple vcfs, one per population. But this is entirely unnecessary. Let’s see how we can combine different flags to achieve the same result. This makes use of simple bash scripting and vcftools.
Subset popfile for indv pops
cat plants.pop | while read line;
grep "$line" plants.map > $line.pop
If this worked, you now should have one .pop file per population containing mappings for individuals in that population.
Estimate ‘pi’ diversity stat
for p in *.pop
vcftools --vcf input.vcf --keep $p --site-pi --out $p
This will subset the the input vcf for a given population on the fly and estimate pi statistic.