site banner

Culture War Roundup for the week of March 20, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

13
Jump in the discussion.

No email address required.

New research paper attempts to quantify which professions have to most to lose from the introduction of GPTs into the larger world. From the abstract:

Our findings indicate that approximately 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of GPTs while around 19% of workers may see at least 50% of their tasks impacted.

The results vary by models but mathematics and math-related industries like accounting have the highest risk. The researchers overall found that "information processing industries (4-digits NAICS) exhibit high exposure, while manufacturing, agriculture, and mining demonstrate low exposure" (pg 15) and "programming and writing skills...are more susceptible to being influenced by language models."

I find myself wondering if "learn to code" from however long back will shortly become "learn to farm" or some such.

These kinds of papers usually do a complicated mathematical or statistical dance to get an "estimate" of the thing of interest, while assuming away the real complexity involved. The economy is monstrously complex, and the kinds of task a language model could automate depends heavily on the details of an individual task or group of tasks, and of the language model. Whatever technique they used probably won't be particularly informative, and the 'meat' of the estimate will come from something questionable, like bad mathematical assumptions. Even granting all of the paper's conclusions, GPT-4 is so much better than GPT-2, both on language tasks and with the new image modality, models in three years will probably have significantly improved capabilities, making things like "models will affect writing and art more than other occupations" questionable. [written before i read the paper]

Reading the paper, they ... survey people familiar with language models, and, for various detailed descriptions of occupations and the tasks required to do those occupations from a dataset, ask them how much of it GPT would be able to automate. I believe both the authors (mostly openai employees) and OpenAI's existing data labelers were used for this - their wording is "To ensure the quality of these annotations, the authors personally labeled a large sample of tasks and DWAs and enlisted experienced human annotators who have extensively reviewed GPT outputs as part of OpenAI’s alignment work".

A fundamental limitation of our approach lies in the subjectivity of the labeling. In our study, we employ annotators who are familiar with the GPT models’ capabilities. However, this group is not occupationally diverse, potentially leading to biased judgments regarding GPTs’ reliability and effectiveness in performing tasks within unfamiliar occupations. We acknowledge that obtaining high-quality labels for each task in an occupation requires workers engaged in those occupations or, at a minimum, possessing in-depth knowledge of the diverse tasks within those occupations.

They also, of course, ask GPT-4 the same questions, and get similar results to the human answers.

The humans (and, roughly, GPT) estimate "for the median occupation, approximately 15% of tasks are directly exposed to GPTs". Exposure is defined as LLM use decreasing the time required to complete it by at least 50%. If exposure is extended to include hypothetical software built on top of LLMs, the percentage of tasks whose time required is halved increases to 50%. They correlate various 'skills' data from the dataset with exposure, and find "science" and "critical thinking" skills are strongly negatively associated with exposure, whereas "programming" and "writing" skills are strongly positively associated.

I think the entire approach is confused. The data source, and the basis for all conclusions, is - surveying AI experts on the effects of language models on various jobs. These people probably don't know much about accounting, creative writing, or plumbing. Yet, we take their vague intuitions, squeeze them through the authors' 'rubric', and then do analysis on the dataset. This, broadly, makes sense for quantitative data - collect ten thousand datapoints of 'plant growth : fertilizer amount', and then doing a statistical test, has advantages over staring at the plants and guessing. But if you asked a few hundred farmers what they think the plant growth level for some amount of fertilizer is, plot the results, and find a correlation - at best you're getting a noisy estimate of asking the farmers "how effective is fertilizer", and at worst you're obfuscating the farmers' lack of understanding with p-values. Why not, instead, have them debate, research, think, and write about their ideas - more in the form of a blog post? Or do case studies, do a deep dive on AI's applications in specific industries, and then use those to generalize? That seems much weaker than a data analysis with graphs and p-values - but at least it exposes the uncertainty, and explores it! My 'steelman' would be - making estimates for each occupation in the dataset 'grounds' human speculation, and weighting those estimates using occupation frequency data leads to a much better estimate than any one human can give, which feels vaguely like rationalist forecasting methods. But even granting that, it's still mostly compressing very questionable estimates into 'data', hiding the likely more interesting, and potentially-flawed, reasons annotators might give for their estimates.

*The thresholding effect is - if your data is on "does time on this task decrease by 50%? Yes or no?", things like '55% vs 95%' are lost, and this can lead to confusing interpretations of aggregations of thresholded data.