Nextgen data: How to keep monomorphic loci out

Working on population genomic questions with nextgen (GBS to be specific) data, I find my unix server full of different iterations and sizes of the data set (in vcf format) based on various filtering strategies etc. I have always religiously applied an arbitrary MAF filter (0.001) as the last step before converting a vcf file to the format needed for popgen analysis (structure, bayescan etc.).

When my data sets were large, this approach worked fine. So I didn’t think much of it. But some of the hypotheses I am recently testing required exclusion of a large number of individuals. Suddenly, even after applying the MAF filter, monomorphic loci started showing up in my downstream analysis files.

If only I had given the MAF filter a second thought. Even if you do not want to get rid of any rare alleles, the MAF threshold needs to be calculated fresh every time remove/add individuals.

MAF = 1/n*2

Say, if you have 156 individuals: 1/(156*2) = 0.003205128

This number is very different from an arbitrary 0.001. As soon as I applied MAF=0.0032, vcftools dropped 2678 loci (more than 50% of the original data set).

Lesson learned.

vcftools --vcf --input.vcf --min-alleles 2 --max-alleles 2 --maf 0.0032 --recode


One thought on “Nextgen data: How to keep monomorphic loci out”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s