site banner

GWAS in 2022

The largest GWAS of all time (of all time!) dropped a few weeks ago to little fanfare, at least in these spaces. In a nutshell: 5.4 million participants measuring height and 1.4 million SNPs per participant, so about 7 trillion data points if I’m not mistaken. If you submitted 23andme samples, congratulations! You contributed to the (current) record holder for largest GWAS in history. In total, the study accounts for 40-45% of the phenotypic variance of height, and furthermore, the authors claim this is saturating: adding more samples won’t increase the fraction of heritability that they can account for.

What you can do with this data:

  1. Generate some robust polygenic scores (PGS)

  2. ‘Risk prediction’ if you have a burning desire to know how tall someone will be (with large error bars)

  3. ???

What you can’t do with this data:

  1. Understand the phenomenon of ‘height’ in any meaningful way

  2. Genetic engineering a la Oryx and Crake, which is how most people see using CRISPR to make designer babies.

  3. Develop any kind of treatment or therapeutic that would improve the human condition.

So, to put it in some context: the criticism of GWAS has always been that these studies are large, expensive, rarely teach us anything about the underlying biology and explain little of the actual heritability (‘missing heritability’ problem). The ‘mechanistic’ biologists interested in curing disease or engineering biology generally dislike GWAS. It’s interesting in the way that astrobiology is interesting; good to know that planet XYZ792 150 light years away may have liquid water on it’s surface, but not really of practical use. What they (and I, being very much of this pedigree) missed is that PGS are of use if you’re in the business of embryo selection and I was corrected on that point a few years ago (conversation here if you want to see me being wrong). So if your goal is having really tall (or short!) children, this paper is good news for you, but you’ll probably still be dissatisfied with the current low-throughputness of embryo selection.

That being said, these criticisms are still salient and, to some extent, I think have been validated: saturating the SNP space with an absurd number of samples (for context: there are only 1.5 million Americans with type 1 diabetes! Good luck saturating that GWAS in our lifetime) only explains 45% of the variance, and this number will undoubtedly vary from trait to trait. Presumably the rest is coming from rare variants (the cutoff in this study is a minor allele frequency (MAF) of < 1% which is quite high), structural variants, or some genetic dark matter implying that our heritability estimates are too high or not being driven by DNA (?).

I think this also has something to say about the omnigenic model. Even with a very high-powered study most of the SNPs are still clustering around genes with known functions related to growth, bone structure, etc. About a third aren’t near anything at all and we have no idea what they might be doing. But again, the low heritability explained would argue that rare variants may play a much larger role than previously appreciated, which may hew closer to Jim Lupski’s Clan Genomics model. And, this is much more speculative, but perhaps this is hinting at the biological underpinnings of ‘interindividual variation is larger than population level differences,’ i.e., rare variants (and the rarer end of SNPs) unique to your ‘clan’ have a similar or larger effect size than the very common SNPs shared by populations. Eager to see what people think or if they have any corrections.

By the way, how does one use superscripts around these parts? Would have been useful to clean up some of these asides with footnotes. Also, how to use tilde without getting strikethrough?

6
Jump in the discussion.

No email address required.

In doing GWAS, it's normal to use extremely low p thresholds to correct for the problem of multiple comparisons, right? I understand why it's important to do this, but doesn't this lead to exclusion of many SNPs with real but small impact? Or by "saturated" do they mean that they have a sample large enough to render this concern negligible?

The problem is not with frequent variants with small effects, but, as OP states, with rare variants.

Rare variants are often rare because they are harmful (novel mutations not tested by selection).

Variant with 50% frequency probably doesn't affect trait studied either way. Rare variant probably would be bad.

But GWAS will assume that rare variant has zero effect. And, since GWAS is currently done with tag SNPs it will miss any novel mutational load.

We'd have to dig pretty deep into their data to get the true answer to this, but their study is sufficiently powered to detect extremely low effect sizes. For example, if they're detecting 12,000 significant associations and they've explained 45% of the heritability, they're sufficiently powered to detect variants that explain much less than 0.004% of the heritability.

Someone else with a better handle on the math could give you a more robust answer though.

As for saturation, they split their data into significant and non-significant SNPs and find that the former explain 'around 100%' of the SNP based heritability.

We estimated the variance explained by GWS SNPs using the genetic relationship-based restricted maximum likelihood (GREML) approach implemented in GCTA1,7. This approach involves two main steps: (i) calculation of genetic relationships matrices (GRM); and (ii) estimation of variance components corresponding to each of these matrices using a REML algorithm. We partitioned the genome in two sets containing GWS loci on the one hand and all other HM3 SNPs on the other hand. GWS loci were defined as non-overlapping genomic segments containing at least one GWS SNP and such that GWS SNPs in adjacent loci are more than 2 × 35 kb away from each other (that is, a 35-kb window on each side). We then calculated a GRM based on each set of SNPs and estimated jointly a variance explained by GWS alone and that explained by the rest of the genome. We performed these analyses in multiple samples independent of our discovery GWAS, which include participants of diverse ancestry. Details about the samples used for these analyses are provided below. We extended our analyses to also quantify the variance explained by GWS loci using alternative definitions based on a window size of 0 kb and 10 kb around GWS SNPs (Supplementary Figs. 18 and 19).

Again, someone else with better stats skills could better answer the question. It's something I should work on but it's not terribly relevant to my day job...