site banner

Small-Scale Question Sunday for June 11, 2023

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

3
Jump in the discussion.

No email address required.

Apologies for the naive question, but I'm largely ignorant of the nuts and bolts of AI/ML.

Many data formats in biology are just giant arrays, with each row representing a biological cell and columns representing a gene (RNA-Seq), parameter (flow cytometry). Sometimes rows are genetic variants and columns are various characteristics of said variant (minor allele frequency, predicted impact on protein function, etc).

Is there a way to feed this kind of data to LLMs? It seems trivial for chatGPT to parse 'This is an experiment looking at activation of CD8+ T cells, generate me a series of scatterplots and gates showcasing the data' but less trivial to parse the giant 500,000x15 (flow) or 10,000x20,000 (scRNA-Seq) arrays. Or is there a way for LLMs to interact with existing software?

What’s the advantage over normal programming?

Imagine you have 50 samples in your experiment, each sample has 10 gates so you're skimming over 500 scatter plots and then inputting however many readouts you have into other histogram plots to represent the data you got.

This really feels like a pair of 'for' loops instead of a flexible task. You could even go up a level and write a tool that lets you pick different axes.

Why language models specifically? From a cursory google I found a couple of papers which may make more sense to you than me

https://www.sciencedirect.com/science/article/pii/S1672022922001668

https://www.frontiersin.org/articles/10.3389/fimmu.2021.787574/full

To overcome the challenges faced by manual gating, many computational tools have been developed to automate every step of the cytometry data analysis, including quality control (5), batch normalization (6, 7), data visualization (8–10), cell population identification (11–16), and sample classification (17–20). The tools utilize a wide range of computations methods, ranging from rule-based algorithms to machine learning models.

Do you want LLMs so you can "talk to" your lab results? Otherwise it's easier to analyse masses of data without the LLM middleman.

Do you want LLMs so you can "talk to" your lab results? Otherwise it's easier to analyse masses of data without the LLM middleman.

Yeah, exactly. There's a lot of grunt work involved in flow cytometry analysis which I was thinking of more than the scRNA-Seq. Machine learning for most basic flow cytometry is slightly overkill because conceptually what you're doing with each gate is conceptually pretty simple. I tried to elaborate/clarify in this comment.

You should send the grunt work to CCP where eve denzions can do it for fractions of a cent.

You've been repeatedly warned to stop doing low effort drive-bys like this that contribute nothing.

Banned for five days this time.

Though to be fair, what little I saw of SNE, for example in analysis of single-cell transcriptomes, and what I heard from objective people familiar with the research, didn't necessarily inspire confidence that the patterns emerging were indicative of anything real.

Interesting. I've spent a lot of time staring at t-SNE plots (or more recently UMAPs took over) and they map pretty well to our underlying understanding of the biology. It got a bit hairy when we asked it to split the data into too many clusters and it was difficult to know if we were looking at some novel, minor cell type or a hallucination.

I think I asked that question poorly and also lack the vocabulary to describe what I'm envisioning. Current software for analyzing this kind of data (flow) exists and the typical workflow is just making a series of scatterplots with 'gates,' or subsets of cells that express a given marker. Here's a basic example.

Verbally, it's all very simple - Gate on singlets, then lymphocytes via forward/side scatter, exclude dead cells, gate on CD3+ and then split into CD4 and CD8 T cells. It's the kind of instruction that should be very easy for chatGPT to parse even with a single sentence outlining the experiment. But how to feed the data? Is there a way for chatGPT to interact with an existing analysis software to draw gates/generate scatterplots...? I assume you wouldn't want to feed the raw array of cells into your prompt, although I don't know.

Maybe I'll back up and zoom out a bit. Most people use flowjo to analyze flow cytometry data. It's a multibillion dollar industry, they haven't updated the software in something like a decade (and that update made it worse than the version I was using before), and you routinely draw the same gates over and over again. Imagine you have 50 samples in your experiment, each sample has 10 gates so you're skimming over 500 scatter plots and then inputting however many readouts you have into other histogram plots to represent the data you got. It's repetitive and the software is clunky. LLMs definitely seem 'smart' enough to understand everything that's going on, but I don't have the first idea how you communicate that kind of data to them...

If your goal is to get chatGPT to produce plots and summaries of your data, one route would be to describe the task as you have here, and ask it to write e.g. Python code that does what you want.

You could then run the code it produces, passing in the location of your data, and hopefully receiving the desired plots and summaries. This would probably involve some work on your part, though; chatGPT's code isn't perfect, so you'd need to understand the process well enough to guide it.

If you have access to chatGPT plugins, you might want to check out Noteable, which claims to streamline this process, and let chatGPT do more of the work. I haven't used it myself though, so I can't say how well it works.

Sorry, I think my description of what I was thinking of was exceptionally poor. I tried to elaborate in this comment.

Right, it's a more complicated workflow than that. I think my previous comment still stands, though.

You don't want to feed the data directly to an LLM, they're terrible at the sort of direct computation you're describing. What you want is to explain your task to an LLM in a high-level way, have it turn that into lower-level instructions that can be read by some other program, and then maybe get the LLM to interpret that program's output and continue.

What I'm describing is a chatGPT plugin, and the only one I know of that does this kind of thing is Noteable, though I've no idea how useful it would be. The other alternative is just to write a program that does what you want directly - it sounds possible from what you've written so far - and get chatGPT to help you write it.

In principle, you could try developing your own plugin, but I suspect that would be a lot more work than just writing a program to solve the original problem.

If you get something like this going let me know. I’m exploring local LLM use-case of cybersecurity packet analysis. Loading bulk data separately from prompt engineering etc, this all is complicated by a small 2k context length. Newer open models have landmark attention tech for 10/30k+ context length, but they are less sophisticated 7/13billion parameter models compared to the 30b ones I’ve been using.

I have no useful suggestion, but that's a neat idea! Great example of the kind of thing that AI could straightforwardly do and save a huge amount of man-hours of tedious, boring labor. The AI probably still won't know why anyone would care about a particular gate, but it could make it quite easy to visualize things.

I'm no expert but have some familiarity. The LLMs have a limited context window (gpt4 is 8000 tokens) so it can't hold all of that data at once. Probably the easiest way to get it to chew through that much is to ask it for code to do the things you want (directing it to write some pygraph or R code or something). It could plausibly do it inline if you asked it to summarize chunks of data, then fed the previous summary in with the next chunk. The code would act as a much more auditable and probably accurate tool though.