site banner

GWAS in 2022

The largest GWAS of all time (of all time!) dropped a few weeks ago to little fanfare, at least in these spaces. In a nutshell: 5.4 million participants measuring height and 1.4 million SNPs per participant, so about 7 trillion data points if I’m not mistaken. If you submitted 23andme samples, congratulations! You contributed to the (current) record holder for largest GWAS in history. In total, the study accounts for 40-45% of the phenotypic variance of height, and furthermore, the authors claim this is saturating: adding more samples won’t increase the fraction of heritability that they can account for.

What you can do with this data:

  1. Generate some robust polygenic scores (PGS)

  2. ‘Risk prediction’ if you have a burning desire to know how tall someone will be (with large error bars)

  3. ???

What you can’t do with this data:

  1. Understand the phenomenon of ‘height’ in any meaningful way

  2. Genetic engineering a la Oryx and Crake, which is how most people see using CRISPR to make designer babies.

  3. Develop any kind of treatment or therapeutic that would improve the human condition.

So, to put it in some context: the criticism of GWAS has always been that these studies are large, expensive, rarely teach us anything about the underlying biology and explain little of the actual heritability (‘missing heritability’ problem). The ‘mechanistic’ biologists interested in curing disease or engineering biology generally dislike GWAS. It’s interesting in the way that astrobiology is interesting; good to know that planet XYZ792 150 light years away may have liquid water on it’s surface, but not really of practical use. What they (and I, being very much of this pedigree) missed is that PGS are of use if you’re in the business of embryo selection and I was corrected on that point a few years ago (conversation here if you want to see me being wrong). So if your goal is having really tall (or short!) children, this paper is good news for you, but you’ll probably still be dissatisfied with the current low-throughputness of embryo selection.

That being said, these criticisms are still salient and, to some extent, I think have been validated: saturating the SNP space with an absurd number of samples (for context: there are only 1.5 million Americans with type 1 diabetes! Good luck saturating that GWAS in our lifetime) only explains 45% of the variance, and this number will undoubtedly vary from trait to trait. Presumably the rest is coming from rare variants (the cutoff in this study is a minor allele frequency (MAF) of < 1% which is quite high), structural variants, or some genetic dark matter implying that our heritability estimates are too high or not being driven by DNA (?).

I think this also has something to say about the omnigenic model. Even with a very high-powered study most of the SNPs are still clustering around genes with known functions related to growth, bone structure, etc. About a third aren’t near anything at all and we have no idea what they might be doing. But again, the low heritability explained would argue that rare variants may play a much larger role than previously appreciated, which may hew closer to Jim Lupski’s Clan Genomics model. And, this is much more speculative, but perhaps this is hinting at the biological underpinnings of ‘interindividual variation is larger than population level differences,’ i.e., rare variants (and the rarer end of SNPs) unique to your ‘clan’ have a similar or larger effect size than the very common SNPs shared by populations. Eager to see what people think or if they have any corrections.

By the way, how does one use superscripts around these parts? Would have been useful to clean up some of these asides with footnotes. Also, how to use tilde without getting strikethrough?

6
Jump in the discussion.

No email address required.

sub <sub>script</sub>

super <sup>script</sup>

\\\\

sub script

super script

~~~~

sub <subscript</sub

super <supscript</sup

\\

sub script

super script



Sorry but what does this comment mean?

I asked for formatting tips since the new site is different from reddit. ~ kept giving me strikethrough

edit: yeah the tilde is bugged

That looks like a bug. The use of backticks didn't prevent the tildes from creating a strikethrough effect.

only explains 45% of the variance

I can't be arsed to fully follow all of the details, but I'm told that you have to take the square root of these "percent of variance explained" numbers.

Thanks! I'll take a look. I do need to work on the stats side of things...

If they used 23andme data, then it wasn't a GWAS. A GWAS looks at the entire genome. 23andme only looks at 600,000 SNPs.

They mention that they analyzed the HapMap3 SNPs. Not clear to me whether they themselves reran the panels, if 23andme did, or if they just feed all the data into a model regardless of whether it has all the HM3 SNPs or not.

Presumably the rest is coming from rare variants (the cutoff in this study is a minor allele frequency (MAF) of < 1% which is quite high), structural variants, or some genetic dark matter implying that our heritability estimates are too high or not being driven by DNA (?).

Or environmental factors (e.g. prevalent nutritional deficiencies)?

Incidentally, how do heritability estimates discriminate between genes that "causally" influence height (e.g. a gene that, when expressed, somehow biostructurally increases bone growth), and genes that dictate "unrelated" behavioral patterns which, in turn, affect the desired trait (e.g. craving/distaste for junk food)? Am I right in thinking that this is another major weakness of GWAS - even if you identify candidate genes, those genes might completely fail to transfer to, say, another population in which junk food doesn't exist?

So if you run a GWAS identifying N promising genes for affecting height on US citizens, you couldn't use that to reliably increase the height of European babies?

Or environmental factors (e.g. prevalent nutritional deficiencies)?

It's possible. People like to use height because, in the west at least, a very large fraction of the variation will be genetic. But who knows?

Incidentally, how do heritability estimates discriminate between genes that "causally" influence height (e.g. a gene that, when expressed, somehow biostructurally increases bone growth), and genes that dictate "unrelated" behavioral patterns which, in turn, affect the desired trait (e.g. craving/distaste for junk food)? Am I right in thinking that this is another major weakness of GWAS - even if you identify candidate genes, those genes might completely fail to transfer to, say, another population in which junk food doesn't exist?

You wouldn't, and that goes beyond GWAS. It's a fundamental problem with all the correlational genetic studies. Inferring mechanism is extremely difficult, and it's easy to be fooled by how you think about the trait rather than how biology thinks about the trait.

So if you run a GWAS identifying N promising genes for affecting height on US citizens, you couldn't use that to reliably increase the height of European babies?

I think it would probably work due to shared ancestry, particularly with their racial breakdown scheme. May not work as well in other races, although they do mention that the majority of their loci are shared.

Incidentally, how do heritability estimates discriminate between genes that "causally" influence height

They don't. Heritability refers to given population in given environment. Gene effect depends on environment (you don't need gene to make vitamin C it your environment has excess of it) and even on frequency of it in population. In a food scarce environment gene which increases bone growth might as well have smaller effect than genes which allow to get more food. So of course simple linear regression, which GWAS is, wouldn't tell about many things.

So if you run a GWAS identifying N promising genes for affecting height on US citizens, you couldn't use that to reliably increase the height of European babies?

Because junk food mainly affects width and not height, this is unlikely to be a problem. And it's not that Europe is free or junk food too.

So, to put it in some context: the criticism of GWAS has always been that these studies are large, expensive, rarely teach us anything about the underlying biology and explain little of the actual heritability (‘missing heritability’ problem). The ‘mechanistic’ biologists interested in curing disease or engineering biology generally dislike GWAS.

The development of big GWAS and tools like AlphaFold suggest to me that we’re nearing the point where useful empirical information overwhelms the capacities of human comprehension. The etiology of Alzheimer’s might just be Ala->Gly x100, and the true story is an overwhelming mass of minutiae, compared to the comprehensible ‘protein x is broken’. A lot of the work of medicine has been outsourced to evolution, and we’ve cribbed from her notes on every antibiotic and biologic we’ve produced. But we’re getting close to the point where we can build magic bullets from first principles.

The development of big GWAS and tools like AlphaFold suggest to me that we’re nearing the point where useful empirical information overwhelms the capacities of human comprehension.

Exactly! I enjoyed this essay quite a bit. Maybe our fate was never to truly understand biology, but build an oracle that can.

A lot of the work of medicine has been outsourced to evolution, and we’ve cribbed from her notes on every antibiotic and biologic we’ve produced. But we’re getting close to the point where we can build magic bullets from first principles.

It's an interesting question. Perhaps the antibiotic discovery space has been completely saturated by Nature already, at least in terms of targets. In the late 2000s, we developed fully synthetic antibiotics never before seen in nature and bacteria developed resistance just the same. I wonder if the future will be more medicinal chemistry tricks or a pivot to something like bacteriophages...

In terms of biologics, are you referring to monoclonal antibodies? If I'm interpreting you correctly, one day having to raise the right antibody to your target will be trivial because you'll just feed the sequence into alphafold and you're done. I agree, the first person with a model capable of that is going to mint money for a while. There are still a host of other very difficult problems to be solved even at that point though; these kinds of models are only going to get us so far.

Biologics are a big category, of which the -mabs are the early success story. The next evolution would be exactly what you described, where we can construct a protein to block targets by way of a fancy ml chemistry algorithm instead of trial and error. Beyond that, we get into de novo synthetic proteins that have more in common with sci-fi nanomachines than penicillin. Then, ???.

In doing GWAS, it's normal to use extremely low p thresholds to correct for the problem of multiple comparisons, right? I understand why it's important to do this, but doesn't this lead to exclusion of many SNPs with real but small impact? Or by "saturated" do they mean that they have a sample large enough to render this concern negligible?

We'd have to dig pretty deep into their data to get the true answer to this, but their study is sufficiently powered to detect extremely low effect sizes. For example, if they're detecting 12,000 significant associations and they've explained 45% of the heritability, they're sufficiently powered to detect variants that explain much less than 0.004% of the heritability.

Someone else with a better handle on the math could give you a more robust answer though.

As for saturation, they split their data into significant and non-significant SNPs and find that the former explain 'around 100%' of the SNP based heritability.

We estimated the variance explained by GWS SNPs using the genetic relationship-based restricted maximum likelihood (GREML) approach implemented in GCTA1,7. This approach involves two main steps: (i) calculation of genetic relationships matrices (GRM); and (ii) estimation of variance components corresponding to each of these matrices using a REML algorithm. We partitioned the genome in two sets containing GWS loci on the one hand and all other HM3 SNPs on the other hand. GWS loci were defined as non-overlapping genomic segments containing at least one GWS SNP and such that GWS SNPs in adjacent loci are more than 2 × 35 kb away from each other (that is, a 35-kb window on each side). We then calculated a GRM based on each set of SNPs and estimated jointly a variance explained by GWS alone and that explained by the rest of the genome. We performed these analyses in multiple samples independent of our discovery GWAS, which include participants of diverse ancestry. Details about the samples used for these analyses are provided below. We extended our analyses to also quantify the variance explained by GWS loci using alternative definitions based on a window size of 0 kb and 10 kb around GWS SNPs (Supplementary Figs. 18 and 19).

Again, someone else with better stats skills could better answer the question. It's something I should work on but it's not terribly relevant to my day job...

The problem is not with frequent variants with small effects, but, as OP states, with rare variants.

Rare variants are often rare because they are harmful (novel mutations not tested by selection).

Variant with 50% frequency probably doesn't affect trait studied either way. Rare variant probably would be bad.

But GWAS will assume that rare variant has zero effect. And, since GWAS is currently done with tag SNPs it will miss any novel mutational load.