This document describes how SNP genotypes from the Spanish Barley Core Collection (SBCC) were checked and processed for further use with BayPass v2.1 to obtain XtX statistics about subpopulation differentiation.

In the next steps we convert the SNPs to fit the appropriate formats for downstream analyses. SNPs conserve sample order of climate data.

We shall write a SNPSFILE containing allele counts across populations/barleys, where each SNP is represented by two lines in the file, with the counts of allele 1 on the first line and the counts for allele 2 on the second, and son on. The counts of allele 1 and allele 2 are assumed to sum to the sample size typed at this SNP in this population (i.e. the total sample size excluding missing data).

This means that PAV markers must be converted to SNP-like markers if they are to be used.

First, we shall write a genotyping file containing allele counts across populations/barleys, where each biallelic SNP is represented by two columns per row, with the counts of allele 1 on the first line and the counts for allele 2 on the second. The counts of allele 1 and allele 2 are assumed to sum to the sample size typed at this SNP in this population (i.e. the total sample size excluding missing data).

Second, SNPs will be used to compute a covariance matrix which captures the background similarity between landraces.

Formatting SNPs

We wrote a Perl script (SNP2BAYPASS.pl) to carry out these formatting tasks. The initial set of 9,920 Infinium and GBS markers in file 9920_SNPs_SBCC_50K.tsv, was converted as follows, accepting 10% missing data per position and accepting only biallelic loci with \(MAF \geq 0.05\) (n=8457):

./SNP2BAYPASS.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order.txt \
  SBCC_9K_BayPass.tsv 2> SBCC_9K_BayPass.log 

head SBCC_9K_BayPass.log 
echo ...
tail SBCC_9K_BayPass.log
# MAXMISSINGRATIO=0.1 MISSINGVALUE=0
# total samples in SBCC_order.txt: 135
# creating SBCC_9K_BayPass.annot.tsv
# skip sample SBCC038
# skip sample SBCC138
# total valid samples: 135
#   snpname  allele1 allele2 missing MAF
1   BOPA1_2511-533  A   G   0   0.052
2   BOPA1_3107-422  A   G   0   0.096
3   BOPA1_8670-388  C   G   0   0.104
...
8449    SCRI_RS_81903   A   G   0   0.119
8450    SCRI_RS_8250    A   C   0   0.052
8451    SCRI_RS_8401    C   T   0   0.415
8452    SCRI_RS_88466   G   T   1   0.157
8453    SCRI_RS_88710   A   G   0   0.096
8454    SCRI_RS_9327    C   T   0   0.281
8455    SCRI_RS_9584    G   T   0   0.385
8456    SCRI_RS_9618    C   T   3   0.091
8457    SCRI_RS_994 C   T   0   0.007
# total valid markers=8457

The resulting SBCC_9K_BayPass.tsv file is the genotyping file required by BayPass. Note that file SBCC_9K_BayPass.annot.tsv with matching fullnames of SNPs (columns) is produced alongside.

As BayPass does not handle missing covariate (environmental) data, produce also a subset of accessions for which complete env data is available, excluding SBCC036:

./SNP2BAYPASS.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order_complete_env.txt \
  SBCC_9K_BayPass_complete_env.tsv 2> SBCC_9K_BayPass.complete_env.log 

head SBCC_9K_BayPass.complete_env.log 
echo ...
tail SBCC_9K_BayPass.complete_env.log

# skip column 30 in environfile, which corresponded to SBCC036
perl -lane 'print join("\t",@F[0 .. 28, 30 .. $#F])' SBCC_environfile.tsv \
  > SBCC_environfile_complete_env.tsv
# MAXMISSINGRATIO=0.1 MISSINGVALUE=0
# total samples in SBCC_order_complete_env.txt: 134
# creating SBCC_9K_BayPass_complete_env.annot.tsv
# skip sample SBCC036
# skip sample SBCC038
# skip sample SBCC138
# total valid samples: 134
#   snpname  allele1 allele2 missing MAF
1   BOPA1_2511-533  A   G   0   0.052
2   BOPA1_3107-422  A   G   0   0.097
...
8454    SCRI_RS_81903   A   G   0   0.119
8455    SCRI_RS_8250    A   C   0   0.052
8456    SCRI_RS_8401    C   T   0   0.418
8457    SCRI_RS_88466   G   T   1   0.158
8458    SCRI_RS_88710   A   G   0   0.097
8459    SCRI_RS_9327    C   T   0   0.284
8460    SCRI_RS_9584    G   T   0   0.388
8461    SCRI_RS_9618    C   T   3   0.092
8462    SCRI_RS_994 C   T   0   0.007
# total valid markers=8462

Obtaining a covariance matrix (omega)

As mentioned earlier, we need to estimate a covariance matrix by invoking
a system-available binary of BayPass, called ‘baypass’. The following command computes a matrix based on all 9K Infinium/GBS markers:

baypass -npop 135 -gfile SBCC_9K_BayPass.tsv -outprefix SBCC_9K_BayPass \
  -nthreads 30 -seed 12345 > SBCCmatrix.BayPass.log

mv SBCC_9K_BayPass_* SBCCmatrix.BayPass.log BayPass

# after leaving out SBCC036 to get a complete environmental data set
baypass -npop 134 -gfile SBCC_9K_BayPass_complete_env.tsv -outprefix SBCC_9K_BayPass_complete_env \
  -nthreads 30 -seed 12345 > SBCCmatrix.BayPass.complete_env.log

mv SBCC_9K_BayPass_complete_env_* SBCCmatrix.BayPass.complete_env.log BayPass

Population differentiation XtX analysis

A similar Perl script (SNP2BAYPASSsubpops.pl) was written to format the same SNP set for the analysis of the 4 subpopulations of SBCC barleys defined by STRUCTURE (see protocol HOWTOstructure). The set of 9,920 Infinium and GBS markers in file 9920_SNPs_SBCC_50K.tsv, was converted as follows, accepting 10% missing data per position and accepting only biallelic loci with \(MAF \geq 0.05\) (n=8457):

./SNP2BAYPASSsubpops.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order.txt \
  raw/SBCC_Kinship.full.tsv SBCC_9K_BayPass.subpop.tsv 2> SBCC_9K_BayPass.subpop.log 

head SBCC_9K_BayPass.subpop.log 
echo ...
tail SBCC_9K_BayPass.subpop.log
# MAXMISSINGRATIO=0.1 MISSINGVALUE=0
# total samples in SBCC_order.txt: 135
# subpop 1 contains 15 accessions
# subpop 2 contains 10 accessions
# subpop 3 contains 49 accessions
# subpop 4 contains 61 accessions

# skip sample SBCC038
# skip sample SBCC138
# total valid samples: 135
...
8449    SCRI_RS_81903   A   G   0   0.119
8450    SCRI_RS_8250    A   C   0   0.052
8451    SCRI_RS_8401    C   T   0   0.415
8452    SCRI_RS_88466   G   T   1   0.157
8453    SCRI_RS_88710   A   G   0   0.096
8454    SCRI_RS_9327    C   T   0   0.281
8455    SCRI_RS_9584    G   T   0   0.385
8456    SCRI_RS_9618    C   T   3   0.091
8457    SCRI_RS_994 C   T   0   0.007
# total valid markers=8457

The resulting SBCC_9K_BayPass.subpop.tsv file can then be further processed with BayPass. In the example 30 parallel threads are created:

baypass -npop 4 -gfile SBCC_9K_BayPass.subpop.tsv -outprefix SBCC_9K_BayPass_subpop \
  -nthreads 30 -seed 12345 > SBCCmatrix.BayPass.subpop.log

mv SBCC_9K_BayPass_subpop* SBCCmatrix.BayPass.subpop.log BayPass

This command takes only a few minutes to run with defaults params:

tail BayPass/SBCCmatrix.BayPass.subpop.log
   iteration=         960
   iteration=         970
   iteration=         980
   iteration=         990
   iteration=        1000
      Mean Final Acceptance Rate Pij  =  0.31143   mean delta_pij=  0.12608
      Mean Final Acceptance Rate Pi   =  0.37650   mean delta_pi=  0.48305
      Mean Final Acceptance Rate b_mu =  0.29460        delta  =  0.03200
      Mean Final Acceptance Rate b_phi=  0.27976        delta  =  0.05498
Analysis took:  0 days  0 h 11 min 22 sec

We can now read the produced omega covariance matrix and plot the Xtx subpopulation differentiation estimates as follows:

omega_subpop = as.matrix(x=
  read.table(file="BayPass/SBCC_9K_BayPass_subpop_mat_omega.out", header=F) )
subpop.names=c("1","2","3","4")
dimnames(omega_subpop)=list(subpop.names,subpop.names)

XtX_BP = read.table(file="BayPass/SBCC_9K_BayPass_subpop_summary_pi_xtx.out",
                    header = T)
plot(XtX_BP$M_XtX)

Now, we can use the same omega covariance matrix to generate theoretically neutral SNPs to help in the interpretation of results and call outliers:

source('./BayPass/baypass_utils.R')

# get estimates of the Pi Beta distribution inferred from real data
pi.beta.coef=read.table("BayPass/SBCC_9K_BayPass_subpop_summary_beta_params.out",
                        h=T)$Mean

#upload the real SNP dataset to obtain total allele count
real.data<-geno2YN("SBCC_9K_BayPass.subpop.tsv")

sim = simulate.baypass(omega.mat=omega_subpop,nsnp=1000,sample.size=real.data$NN,
                  beta.pi=pi.beta.coef,pi.maf=0,suffix="sim")

We shall now run BayPass in “core mode” with the resulting neutral genotypes:

mv *.sim BayPass
baypass -npop 4 -gfile BayPass/G.sim -outprefix sim \
  -nthreads 30 -seed 12345 > sim.subpop.log

mv sim* BayPass

Finally we can take XtX values of neutral, simulated SNPs to infer a significance threshold and then plot the real XtX estimates:

maptype = 'cM'; 
#maptype = 'bp';
outTSV = paste("SBCC_9K_BayPass_subpop.xtx.",maptype,".tsv", sep="")

#get the pod XtX
sim.xtx=read.table("BayPass/sim_summary_pi_xtx.out",h=T)$M_XtX

#compute the 99th percentile
sim99=quantile(sim.xtx,probs=0.99)
sim99 
     99% 
9.560921 
# read real XtX estimates
XtX_BP = read.table(file="BayPass/SBCC_9K_BayPass_subpop_summary_pi_xtx.out",
                    header = T)

# read in SNP names and metadata
metadata = read.table(file="SBCC_9K_BayPass.subpop.log",header=F)
colnames(metadata) = c("MRK","SNPidentifier","allele1","allele2","missing","MAF")

# read genetic OR bp positions of 9K markers
if(maptype == "bp"){
  genmap = read.table(file="raw/9920_SNPs_SBCC_bp_map2017.curated.tsv",header=T)
} else{
  genmap = read.table(file="raw/9920_SNPs_SBCC_cM_map2017.curated.tsv",header=T)
}

# merge and export data
XtX_BP = merge( XtX_BP, metadata, by="MRK")
XtX_BP = merge( XtX_BP, genmap, by="SNPidentifier")
sortedXtXs = XtX_BP[with(XtX_BP, order(chr, cM, decreasing = F)),]
#sortedXtXs = XtX_BP[with(XtX_BP, order(chr, bp, decreasing = F)),]
write.table(sortedXtXs[,c("SNPidentifier","M_XtX","chr",maptype)],file=outTSV,sep="\t",row.names=F,col.names=T,quote=F)