Preparing BGC data files for Parental Pops: Allele Counts

For BGC website, see: Bayesian Genomic Clines

The genotype data for parental and admixed populations for BGC is in the form of counts of alleles per locus. A separate file for each population is needed. First convert your data to structure format and read into R as an adegenet object. Then convert genind to genpop to obtain allele counts.

library(adegenet)

mygenind <- import2genind('input.str', onerowperind=F, n.ind=10, n.loc=20, row.marknames=1, col.lab=1, col.pop=2, ask=F)

mygenpop <- genind2genpop(mygenind)

mygenpop@tab contains a table of allele counts per locus and per population. Now transpose this table, save it, then read it back (probably not necessary, but I hate dealing with atomic vector errors).

write.table(mygenpop@tab, 'AllCountTable.txt', quote=F, sep='\t')
ct <- read.table('AllCountTable.txt', header=T)
head(ct)

   Allele pop1 pop2 
1 L0001.1    7    3 
2 L0001.2    3    1 
3 L0002.1   10    4 
4 L0002.2    0    0 

We would like to retain the locus names from the first column (ct$Allele), which R probably does not recognize as a character vector yet. Let’s get that out of the way first.
ctall <- as.character(ct$Allele)

Split the locus names from the composite loc+allele names.
ctall_split <- lapply(strsplit(ctall, '.', fixed=TRUE), '[[', 1)

head(ctall_split)
[[1]]
[1] "L0001"

[[2]]
[1] "L0001"

[[3]]
[1] "L0002"

[[4]]
[1] "L0002"

We will need to get rid of duplicate entries. We are interested in only every other entry.

locnames <- ctall_split[c(TRUE, FALSE),]
head(locnames)
[[1]]
[1] "L0001"

[[2]]
[1] "L0002"

Now select only odd rows (containing allele 1 count at each locus in each pop), then even rows. This assumes that you do not have any monomorphic sites in your data. If you do, this and the previous steps will generate errors. Make sure you have exactly twice as many rows as you have loci. Should you find odd number of rows, that’s a clear indication that you have at least one locus with only one allele in your data. You will need to recreate structure file to get rid of such loci before proceeding again.

ct_odd <- ct[seq(1, length(ct$Allele), 2), ]
ct_evn <- ct[seq(2, length(ct$Allele), 2), ]

At this point, we have all the data we need i.e. locus names and allele counts in each population. I will demonstrate putting all this information together for one population.

1. Create an vector of length equal to number of loci and fill it with any symbol.
filler <- rep('#', length(locnames))

2. Create a new data frame by cbinding all components together.
df <- data.frame(locnames, filler, ct_odd$pop1, ct_evn$pop1)

3. head(df)
locnames filler pop1 pop1
1 loc01 # 7 3
2 loc02 # 8 7
3 loc03 # 3 5
4 loc04 # 4 2
5 loc05 # 2 6

4. Save this table
write.table(df, 'pop1_allct_bgc.txt', row.names=F, col.names=F, quote=F, sep='\t')

5. Finally, open the saved table in vi and perform this final operation:
vim pop1_allct_bgc.txt
:%s/\t#\t/^M/g
:%s/\t/ /g

You can do ^M by first pressing Ctrl-V, then quickly hitting ENTER.

That’s it. Check the file to make sure everything looks ok.
loc01
7 3
loc02
8 7
loc03
3 5
loc04
4 2
loc05
2 6

At this point, you should be done coding the parental population files. Remember, only two parental populations are allowed. Thus, if you have data from two species with multiple pops each, you will need to collapse individual pops into a composite population for each species.

In the next blog post, I will summarize creating allele count data files for admixed populations, the format for which is somewhat different. Stay tuned.

7 thoughts on “Preparing BGC data files for Parental Pops: Allele Counts”

  1. This is an excellent post as programs like PGDSpider can’t convert to bgc format. Is there still going to be a post for preparing a bgc infile for admixed populations?

    Like

    1. It’s been a while since I did this and was under the impression that admixed population data prep was covered. Let me check my files and get back to you.

      Like

      1. Thanks, you mentioned in this post you’d write one for admixed populations but I don’t see an actual post for it.

        Like

      1. No the bump wasn’t from me, however we got a custom R script working that works well for the admixed populations.

        Like

Leave a comment