Chapter 15 Genome-wide association (GWAS) data

New tools for genetic/genomic data analysis are being created at a mind-boggling rate. It seems like a new paper in this area comes out each week – which makes it a challenge to summarize the tools available for working in this area. With this in mind, consider the set of tools and tips provided here as select suggestions from my own experience rather than an exhaustive list.

A tutorial for GWAS in R is available here.

15.1 Principal component analysis (PCA)

One concept that comes up often in GWAS data is principal components – this CrossValidated post is the best explanation I have read on PCA. Another site that may be helpful is this one, which has examples of visualizing PCA in R.

15.2 R packages

15.2.1 for analyzing one genetic marker at a time

  • bigsnpr analyses GWAS scale data, includes PCA, fitting lasso models, doing some quality control

  • qqman creates Manhattan plots and QQ plots

15.2.2 for analyzing markers together in an additive model

  • ncvreg fits nonconvex penalized linear mixed models in R

  • plmmr does penalized linear mixed models in R

15.2.3 for summary-level GWAS data

  • PLACO: this isn’t a ‘package’ per se, but this method for assessing pleiotropy between traits is implemented in R.

15.3 Bioconductor packages

15.4 Command-line tools

  • PLINK is arguably the most established tool for managing and analyzing genetic/genomic data. Definitely worth learning if you are interested in working with this kind of data. This is probably the best tool to start learning if you are new to the field.

  • LDSC is a tool for estimating heritability and genetic correlation from GWAS summary statistics.

  • HEELS does heritability estimation with high-efficiency using LD and summary statistics.