site banner

Small-Scale Question Sunday for June 11, 2023

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

3
Jump in the discussion.

No email address required.

Apologies for the naive question, but I'm largely ignorant of the nuts and bolts of AI/ML.

Many data formats in biology are just giant arrays, with each row representing a biological cell and columns representing a gene (RNA-Seq), parameter (flow cytometry). Sometimes rows are genetic variants and columns are various characteristics of said variant (minor allele frequency, predicted impact on protein function, etc).

Is there a way to feed this kind of data to LLMs? It seems trivial for chatGPT to parse 'This is an experiment looking at activation of CD8+ T cells, generate me a series of scatterplots and gates showcasing the data' but less trivial to parse the giant 500,000x15 (flow) or 10,000x20,000 (scRNA-Seq) arrays. Or is there a way for LLMs to interact with existing software?

I'm no expert but have some familiarity. The LLMs have a limited context window (gpt4 is 8000 tokens) so it can't hold all of that data at once. Probably the easiest way to get it to chew through that much is to ask it for code to do the things you want (directing it to write some pygraph or R code or something). It could plausibly do it inline if you asked it to summarize chunks of data, then fed the previous summary in with the next chunk. The code would act as a much more auditable and probably accurate tool though.