site banner

Small-Scale Question Sunday for August 20, 2023

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

3
Jump in the discussion.

No email address required.

Data science is genuinely so fun with ChatGPT 4, Copilot and a decent modern GPU. Interesting paper, but no public GitHub/code. Pasted in 2000 words about their pipeline mostly copied from the paper, GPT 4 reproduced the (relatively complex) pipeline perfectly (in that I was getting almost identical results to the paper). Once I was having issues and I guess image support isn’t yet accessible (for me) in the OpenAI API, so I described (in a paragraph) a diagram and it understood it first attempt.

Copilot makes importing and processing data a joke, I guess there are probably more advanced ways to do it but I literally just #comment what I want it to do, press Enter, and press Tab, and it mostly figures it out. Tell it to make some interesting visualizations, it writes them, ask a query about the data, it can answer it. I’ve also been using copilot to generate clean CSVs from shoddy or otherwise messed up data that would be a nightmare to clean manually.

One of the best uses is finding old papers from pre-2015, pulling the original dataset if it’s public, briefly explaining to ChatGPT what the structure of the data and experiment is (I’ve tried this with copilot and it works sometimes, but actual ChatGPT (4) is more consistent, there are also people who have tried to automate this with the GPT API, but when I tried this code the results were inferior for some papers), and then just asking it to rewrite the approach a modern pipeline. Admittedly I guess this means a late 2021 pipeline given the GPT training set, but it’s enough to yield huge improvements in predictive performance.

I think this has underscored how much of the value moving forward will be in raw data. Foundational models with automated tuning will be used for everything, and LLMs will be able to tune, clean, prepare and modify code to make them work. “AI” is going to be cheap beyond compute costs, which will come down hugely anyway if everyone’s using the same small number of pretrained models (part of the reason why I think Nvidia’s bull run is going to end in tears) and software engineers are increasingly automated.

Instead, the money might well be in people who can navigate bureaucracies to acquire data, or who can collect/generate it themselves. I guess this explains Elon’s obsession with trying to prevent AI companies scraping Twitter, although anything online is going to be a free-for-all except the very largest datasets that people might pay for (sometimes). Niche datasets that you can’t just find on the internet or pull from Kaggle are going to valuable, especially because they might only be saleable once or a few times. The ‘AI layer’, critical though it is, will be almost impossible to make margin on if you’re not a bigtech, all the margin will be in the data itself.

(Also, how funny that I should have learned how to program like 3 months before it ceased to be useful)

This is fascinating, and makes me want to play with data via these tools. I just wish I had an idea of how to get data to answer certain questions, or a better understanding of the infrastructure. Got any data science primers laying around?

Got any data science primers laying around?

Pick up a statistics textbook.