The largest GWAS of all time (of all time!) dropped a few weeks ago to little fanfare, at least in these spaces. In a nutshell: 5.4 million participants measuring height and 1.4 million SNPs per participant, so about 7 trillion data points if I’m not mistaken. If you submitted 23andme samples, congratulations! You contributed to the (current) record holder for largest GWAS in history. In total, the study accounts for 40-45% of the phenotypic variance of height, and furthermore, the authors claim this is saturating: adding more samples won’t increase the fraction of heritability that they can account for.
What you can do with this data:
Generate some robust polygenic scores (PGS)
‘Risk prediction’ if you have a burning desire to know how tall someone will be (with large error bars)
What you can’t do with this data:
Understand the phenomenon of ‘height’ in any meaningful way
Genetic engineering a la Oryx and Crake, which is how most people see using CRISPR to make designer babies.
Develop any kind of treatment or therapeutic that would improve the human condition.
So, to put it in some context: the criticism of GWAS has always been that these studies are large, expensive, rarely teach us anything about the underlying biology and explain little of the actual heritability (‘missing heritability’ problem). The ‘mechanistic’ biologists interested in curing disease or engineering biology generally dislike GWAS. It’s interesting in the way that astrobiology is interesting; good to know that planet XYZ792 150 light years away may have liquid water on it’s surface, but not really of practical use. What they (and I, being very much of this pedigree) missed is that PGS are of use if you’re in the business of embryo selection and I was corrected on that point a few years ago (conversation here if you want to see me being wrong). So if your goal is having really tall (or short!) children, this paper is good news for you, but you’ll probably still be dissatisfied with the current low-throughputness of embryo selection.
That being said, these criticisms are still salient and, to some extent, I think have been validated: saturating the SNP space with an absurd number of samples (for context: there are only 1.5 million Americans with type 1 diabetes! Good luck saturating that GWAS in our lifetime) only explains 45% of the variance, and this number will undoubtedly vary from trait to trait. Presumably the rest is coming from rare variants (the cutoff in this study is a minor allele frequency (MAF) of < 1% which is quite high), structural variants, or some genetic dark matter implying that our heritability estimates are too high or not being driven by DNA (?).
I think this also has something to say about the omnigenic model. Even with a very high-powered study most of the SNPs are still clustering around genes with known functions related to growth, bone structure, etc. About a third aren’t near anything at all and we have no idea what they might be doing. But again, the low heritability explained would argue that rare variants may play a much larger role than previously appreciated, which may hew closer to Jim Lupski’s Clan Genomics model. And, this is much more speculative, but perhaps this is hinting at the biological underpinnings of ‘interindividual variation is larger than population level differences,’ i.e., rare variants (and the rarer end of SNPs) unique to your ‘clan’ have a similar or larger effect size than the very common SNPs shared by populations. Eager to see what people think or if they have any corrections.
By the way, how does one use superscripts around these parts? Would have been useful to clean up some of these asides with footnotes. Also, how to use tilde without getting strikethrough?
As the academic system is slowly imploding, my career followed suit and I recently found myself licking my wounds in a cushy industry job (read: adult daycare) and dreaming of startups. This was one of my brainstorms, but for the life of me, I can’t figure out a way that it could ever be profitable, so I’m releasing it into the wild.
You’ve probably heard of the hygiene hypothesis; in a nutshell, our immune systems ‘evolved’ to deal with lives that were, immunologically speaking, nasty, brutish and short. Consequently, the dial on the thermostat got turned up a bit too high for our fully [immunologically] automated gay space communism with pesky luxuries like vaccines, soap and plumbing. The incidences of immune conditions like asthma, allergies, MS, Crohn’s, T1D have gone up three to four fold in the last 70 odd years in developed nations which is too rapid for dysgenics as an explanation. Some interesting pieces of evidence hinting at a deeper truth;
Adult immigrants from developing nations to the first world are by and large unaffected, but their children do have increases. This suggests an environmental rather than genetic etiology, and furthermore, that the environmental influences have to happen while the immune system is developing (though the evidence for this latter point is not particularly strong in my opinion). 1 2 3 4 5
Abiotic mice (no bacteria or fungi in their gut, skin, esophagus, etc) have very defective immune systems. Whole compartments of the immune system fail to develop properly, suggesting that interplay between pathogens, benign commensals and the immune system is required.
A number of studies have shown that even within developed nations, individuals raised on farms or exposed to animals at very young ages have lower incidences of atopy and autoimmunity.
Your immune system develops in ‘waves’ and is ‘educated’ throughout your development (and almost certainly beyond!). Furthermore, there is substantial variation in our immune systems due to infectious history/environment. (Note that some competing papers took similar approaches with significantly different conclusions). These all point towards significant environmental influences* on these complex immunological diseases.
You’ve probably also heard of Alex Jones claiming that the US government is turning the frogs gay. With this audience, you probably also know that, uh, ‘turning the frogs gay’ isn’t a very honest description, but it is a real problem. Indeed, the process for dumping a new chemical into the environment is labyrinthine, but it probably isn’t particularly effective at screening substances that might influence the immune system. They seem largely focused on chemicals that mimic hormones (see: declining sperm counts and the aforementioned gay frogs).
The crux of this post: Why isn’t more effort expended towards identifying environmental factors, preferably added in the last 70 odd years in the developed world, that modify the immune system?
The hypothesis: Increased exposure to certain chemicals in our environment (food, makeup, air pollution, water contamination), when intersecting with susceptible genotypes, has led to an increase in allergy and autoimmune disease in the developed world.
So, to test it, you’d want to screen large numbers of chemicals in some kind of high-throughput immune assays. Good news: The dataset exists, and you can download it yourself! Bad news: It’s crap! Half-good-half-bad news: Nobody (as far as I know) talks about it or uses it for anything.
About 10 years ago the EPA decided to modernize environmental toxicology and generate The Dataset to end all datasets. They spent (wasted?) tens (hundreds?) of millions of dollars building the data architecture, contracting an army of adult daycare inmates like myself to carry out the assays all to generate a half-dozen low-impact publications nobody has ever read (don’t trust their publications page, it’s padded with anyone who uses the data for any purpose) and this monstrous dataset. Here’s a 728 page pdf some poor soul generated to describe the in vitro assays.
I fiddled around with the data about a year ago at this point, and generated this list of compounds if anyone is interested. I mostly focused on assays relevant to T cells (due to personal biases - B cells are Boring, T cells are Terrific) that came up with a Ka < 10uM, although keep in mind that the majority of these things will be false positives*. Tldr; pesticides are really, really bad and you shouldn’t eat them; they light up every assay like a roman candle. Triclosan was an interesting hit as it’s been (weakly) shown to influence autoimmunity in some mouse models as well as an association with allergy development. Here it came up as a potentiator of lck activity, which is one of the major stimulatory proteins in T cells.
So…who cares? I suppose one might imagine mining some of these molecules as precursors to new drugs after the medicinal chemists have their way with them, although that kind of ‘pharma 1.0’ thinking never really appealed to me. Then again, everyone tells me to just try to make something work, and then your second company can be your vanity project/moonshot. Alternatively, I’ve got to assume that such a large database is amenable to machine learning, maybe along the lines of this paper? I think the largest problem is that the majority of the data here is without a doubt crap. Less relevant to the startup perspective is what the EPA actually wants to do, which is regulate some of these compounds. This would probably be prosocial, but then, if you wanted me to do prosocial stuff you should have given me my academic lab, ja?
*Note that, as complex traits, there are obviously genetic influences on the development of atopy and autoimmunity. The intersection of susceptible genetics and environment leads to disease.
**Cons: - Tons of false positives as many of these compounds won’t be bioavailable or aren’t present in quantities large enough to be relevant
Dataset sucks and others have claimed it to be unreliable
Unclear that people suddenly started being exposed to these things in the last 70 years
Assays poorly optimized and either cell-free (very prone to false positives) or done artificial overexpression systems