site banner

Small-Scale Question Sunday for August 6, 2023

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

2
Jump in the discussion.

No email address required.

Does anyone have a good guide on OPSEC for running a pseudonymous online profile? And what do you think about Anatoly Karlin’s contention that AI will make the whole exercise futile?

We're all going to get doxxed in the next 5 years, sure. It's going to be completely trivial - as Karlin says - to fingerprint writing style and have AI seamlessly connect all of your online writing. The only people who won't be so easily discovered are those who have zero lengthy writing that is scrape-able (ie. no significant writing under their real name on LinkedIn or Facebook, no documentation under their name, no research papers or dissertations, no articles of any kind under their byline, no PDF of internal feedback or commentary mistakenly uploaded to the corporate website, no high school essay that won third place in a public competition etc...). That's before we consider AI fingerprinting methods that automate a lot of current doxxing methods, eg. trawling countless forum pages for obscure details, username mentions, matching leaked email/password lists, location data and so on.

My guess is that if you're the relatively high verbal IQ kind of person who writes longform political content online, you probably have at least one of the above, or some other writing under your real name that has been or will be scraped at some point.

The only long-form corpus of text under my real name would be my research papers. I guess I'm glad they all got rewritten heavily by my professor.

On the other hand, I am skeptical purely because of how statistics work. Like seriously, I can't find a single paper on this shit, that isn't amateur hour.

It would not be so trivial to produce a low error model that produces high confidence matches unless you are so online with both your pseudonymous account and real account that one number reaches 0 and the other 1, purely out of large numbers.

Sometimes you really do run into the limits of information theory.

Yeah, there's a lot of text on the internet. With a pretty cursory Bayesian analysis, even with a 99.9% accuracy you're looking at a thousand false positives if you are combing through a million posts. Without some other thing to narrow it down, it seems reasonable that it'll not be possible from writing patterns alone.

Good point. On top of the difficulties of modeling such a thing. Unless said model has obscenely (I'm talking retardedly fucking insane moronic batshit crazy) high accuracy, there is always going to be plausible deniability behind false positives.

You will probably Light Yagami yourself with information you gave away about your personal life long before they can fingerprint your text.

You will probably Light Yagami yourself with information you gave away about your personal life long before they can fingerprint your text.

Sure, but the point is that these methods overlap and you can use a powerful LLM to parse high-likelihood text samples for shared details (or even things like shared interests, obscure facts, specific jargon), narrowing down your list of a thousand matches. Plus the passwords/emails thing is really important, most people reuse them (at least sometimes) and there are tons of leaked lists online, with that you can chain together pseudonymous identities (automatically, right now this is still extremely labor intensive so only happens with high-profile doxxings where suspicions already exist).

And I think writing styles are more unique than you think. Specific repeated spelling mistakes, specific repeated phrases, uncommon analogies or metaphors, weird punctuation quirks. And the size of the dataset for a regular user here (many hundreds of thousands of words, in quite a few cases) is likely enough for a model tuned on it to be really good at identifying the unique writing patterns of such a user.

Okay but where is the literature? Just show me theoretically it's possible. I would do the math that supports my side of the argument, but you know.. burden of proof.

The reasons discussed in the two comments above apply even with your new scenario. I don't think you understood the core of the arguments.

Also what you are saying can be done in the present, absent of a "powerful LLM". And no it can't be done automatically anytime soon because HTTP requests are not going to have a "fast takeoff" anytime soon.