Working on population genomic questions with nextgen (GBS to be specific) data, I find my unix server full of different iterations and sizes of the data set (in vcf format) based on various filtering strategies etc. I have always religiously applied an arbitrary MAF filter (0.001) as the last step before converting a vcf file to the format needed for popgen analysis (structure, bayescan etc.).
When my data sets were large, this approach worked fine. So I didn’t think much of it. But some of the hypotheses I am recently testing required exclusion of a large number of individuals. Suddenly, even after applying the MAF filter, monomorphic loci started showing up in my downstream analysis files.
If only I had given the MAF filter a second thought. Even if you do not want to get rid of any rare alleles, the MAF threshold needs to be calculated fresh every time remove/add individuals.
MAF = 1/n*2
Say, if you have 156 individuals:
1/(156*2) = 0.003205128
This number is very different from an arbitrary 0.001. As soon as I applied MAF=0.0032, vcftools dropped 2678 loci (more than 50% of the original data set).
vcftools --vcf --input.vcf --min-alleles 2 --max-alleles 2 --maf 0.0032 --recode