site banner
Jump in the discussion.

No email address required.

What would be the time and cost of creating a similar dataset?

I don't have access to a financial breakdown of the kinds of studies that feed into this database so I admit this is somewhat me talking out of my ass, but the sequencing costs involved are pretty low these days. I'd guess the cost of computation might even be comparable to running the SNP panels (maybe 25-50$ per sample in bulk, or even 70-100$ per sample in bulk if you just want full genome data).

The real cost is the army of nurses, doctors and scientists doing more or less unpaid labor for career advancement and altruistic reasons; doing this as a private company would be staggeringly expensive unless your scale is much smaller, and either way, any kind of payout from this data would be dubious. Getting the demographic and phenotypic data to associate with the genetic data is an enormous pain in the ass between IRBs, patients who are unreliable and disinterested in giving you data, making sure you're following all the regulations around PII, etc. Not to mention the fact that half the population hates the medical-industrial complex right now and is unlikely to cooperate on any kind of large scale project.

Ideally, we'd all be genome sequenced at birth and our medical records would be entered into a centralized system where researchers could access de-identified data. The ML folks and data scientists would be able to tease out a remarkable number of associations that we just don't have the power for right now. Although maybe we've just circled back to square one, where that centralized system would decide what you can do with it's data...

I have helped prep data for NHLBI. Basically it is just taking existing study data then standardizing and anonymizing it for inclusion into NHLBI. The idea is to retain the data that's already been gathered for future use.

I'm very much opposed to the restrictions FWIW.