@TransgenicSolution's banner p

TransgenicSolution


				

				

				
1 follower   follows 3 users  
joined 2022 December 25 03:27:26 UTC

				

User ID: 2010

TransgenicSolution


				
				
				

				
1 follower   follows 3 users   joined 2022 December 25 03:27:26 UTC

					

No bio...


					

User ID: 2010

[1/2] TLDR: I think successful development of a trusted open model rivaling chatgpt in capability is likely in the span of a year, if people like you, who care about long-term consequences of lacking access to it, play their cards reasonably well.

This is a topic I personally find fascinating, and I will try to answer your questions at the object level, as a technically inclined person keeping the state of the art in mind who's also been following the discourse on this forum for a while. Honestly, I could use much less words to just describe the top-3 of solutions I see for your questions and abstaining from meta-discussion at all, but I will try to give more context and my personal opinion. Depending on your access to training datasets (generally good), compute (harder), various (lesser) model weights and APIs, a positive and practical answer is likely.

I agree with you and the public on the observation that conversation-tuned language models are already proving themselves to be very useful systems. "Prosaic AI alignment" methods aka SFT+RLHF currently utilized by leading silicon valley startups are crude, and the owners double down on said techniques with dubious goals (hard to speculate here, but likely just to test how far they can take it within this approach - given it tends to diminish the inherent almost magical perplexity-minimizing property of the foundation LM when applied too much - as you can read in the original InstructGPT paper). Surely, a trusted, neutral (or ideally, aligned with the user's or user peer group's best interest) oracle is a desirable artifact to have around. How can we approach this ideal, given available parts and techniques?

My first point here is a note that the foundation LM is doing most of the work - instruction tuning and alignment are a thin layer atop of the vast, powerful, opaque and barely systematically controllable LM. Even at the very charitable side of the pro-RLHF opinion spectrum, the role of RLHF is just to "fuse" and align all the internal micro-mesa-macro- skills and contexts the base LM has learned onto the (useful, albeit limiting compared to the general internet context distribution) context tree of helpful humanlike dialogue. But really, most of what ChatGPT can, a powerful enough raw LM should be able to do as well, with a good (soft)-prompt - and given a choice between a larger LM vs a smaller conversational model, I'd choose the former. Companies whose existence depends on the defensibility of the moat around their LM-derived product will tend to structure the discourse around their product and technology to avoid even the fleeting perception of being a feasibly reproducible commodity. So, it should be noted that the RLHF component is, as of now, really not that groundbreaking in terms of performance (even according to a relatively gameable subjective preference metric OpenAI uses - which might play more into conversational attractiveness of the model compared the general capability) - in fact, without separate compensating measures, it tends to lower the zero-shot performance of the model on various standard benchmarks compared to the baseline untuned LM - which is akin to lowering the LM's g, from my point of view.

At the object level, I believe if you have a pure generic internet-corpus LM (preferably, at the level of perplexity and compression of Deepmind's Chinchilla), and some modest computation capability (say, a cluster of 3090s or a commitment to spend a moderately large sum on lambda.labs) you should be able to reproduce ChatGPT-class performance just via finetuning the raw LM on a suitable mixture of datasets (first, to derive an Instruct- version of your LM; and second, to finish the training with conversational tuning - RLHF or not). It should be doable with splendidly available Instruct datasets such as 1 or 2 with O(few thousand) dialogue-specific datapoints, especially if you ditch RLHF altogether and go with one of the newer SFT variants, some of which rival RLHF without suffering its optimization complexities.

Now, as I mention all those datasets, both internet-scale foundational ones and instruction- and conversation finetuning ones, the issue of data bias and contamination comes to mind. Here, I propose to divide biases into two categories, namely:

  1. Biases intrinsic to the data, our world, species and society - I concede that fighting these is not the hill I'm going to die on (unless you pursue the research direction trying to distill general intelligence from large models trained on the usual large datasets - which I won't in a practically-minded article). Basically, I assume that internet as we have it is a reasonably rich Zipfian, bursty multi-scale multi-skill data distribution prone to inducing general reasoning ability in tabula rasa compressors trained on it.

  2. Biases introduced in addition to (1) by selective filtering of the raw dataset, such as the Google's stopword-driven filtering buried in the C4 dataset code. Given the (as of now) crude nature of these filters, at worst they damage the model's world knowledge and some of the model's general priors - and I believe that past some medium model scale, with good prompting (assuming pure training-free in-context learning setting) or with light finetuning, the model's distribution can be nudged back to the unfiltered data distribution. That is, exquisite plasticity of those models is a blessing, and with just 0.1%-10% of training compute being enough to reorient the model around a whole new objective 2 or a new attention regime, or a whole new modality like vision - surely it should be possible to unlearn undesirable censorship-driven biases introduced into the model by its original trainers - that is, if you have the model's weights. Or if your API allows finetuning.

Now, regarding the technical level of your question about model attestation - how can you be sure the model wasn't tampered with badly enough you cannot trust its reasoning on some complicated problem you cannot straightforwardly verify (correctness-wise or exhaustiveness-wise)?

I think that, at least if we speak about raw LMs trained on internet-scale datasets, you can select a random and a reasonably curated set of internet text samples (probably from commoncrawl, personal webcrawls, or perhaps books or newspapers - or default curated prompt datasets such as eleuther's lm-harness, allenai's P3 or google's BIG-bench) which would include samples that tend to trigger undesirable biases likely introduced into the model under test, and measure the perplexity (or KL-divergence against a held-out known-good language model) and use it as a gauge of model tampering. On samples related to "tampering axis" of the model under test, I expect the perplexity and KL-divergence to behave irregularly compared to average (in case of perplexity) or reference LM (in the latter case).

Upon finding biases, the engineer could use either a wide or narrow finetuning regimen designed around uncovered biases to recover the more general distribution, or one of surgical model editing techniques could be used to correct factual memories: 1 2 3

I believe the finetuning variant is more realistic here - and, given compute, you could just use it straight away without testing your model (for example, on a dataset of a wide distribution of books from The Pirate Library) to make sure it has forgotten whatever wide priors its trainers tried to instill and returned to the general distribution.

Two more notes: this method likely won't work for "radioactive data tags" but this shouldn't be much of a problem for a model that starts from freely legally available checkpoint. And the second note: I believe that while there is a theoretical possibility of wide priors being introduced into large LMs via censorship, that this is not the case for high-performance LMs due to the involved orgchart fearing undermining the ROI (general zero-shot LM performance) of their considerable training compute investment.

The next part is biases introduced at the level of instruction tuning and other finetuning datasets. In short, I believe there are biases, but these could be mitigated in at least two ways:

  1. Use known good raw LMs to bootstrap the required datasets from a small curated core - it sounds like a clever technique, but it worked pretty well in several cases, such as Unnatural Instructions and Anthropic's constitutional AI

  2. Find a group of volunteers who will add curated additions to the available finetuning datasets. Training simple adhoc classifiers (with the same raw LM) to remove biased content from said datasets is possible as well. Once these customized datasets are designed, they allow for cheap tuning of newly released LMs, and as higher-capacity models are known to scale in fine-tuning efficiency, the expected value of the constant-size dataset aligned with your group will grow as well.

[2/2]

Overall, I think the fitness landscape here is surprisingly hospitable engineering-wise. Unbiased (as per my definition) LMs are possible, either trained de novo from freely available datasets such as C4 (or its unfiltered superset), The Pile, reddit/stackexchange/hackernews/forums dumps, sci-hub and pirate library, LAION-2B or finetuned from freely available higher-performance checkpoints such as UL20B, GLM-130B, BLOOM-176B.

My general advice here would be to facilitate storage of these datasets and checkpoints (and any newer higher-performing ones likely to appear before the worst-case embargo) among interested persons, as well as devising distributist communal schemes of running said models on commodity GPU servers, such as the one I mentioned earlier (one could imagine modifications to prolong operational lifespan of such servers as well). Also, some trusted group could host the moderate compute the aforementioned LM attestation requires.

The real problem I see here is lack of state of the art publicly available chinchilla-scaled models (though this might change, if carper.ai will lead their training run to completion and will be allowed to release their artifact?) and lack of coordination, determination and access to compute by the people who would be interested in unbiased general-purpose assistants. Generally, the publicly available models are all pretty old and weren't* designed and trained with utmost efficiency of deployment or maximum possible zero-shot performance per parameter in mind. A focused effort likely could surpass the parameter efficiency of even Deepmind's Chinchilla - but the attempt would cost hundreds of thousands of dollars.

As John Carmack has said in a recent interview, The reason I’m staying independent is that there is this really surprising ‘groupthink’ going on with all the major players.

This favourable conclusion, of course, assumes the user has access to some money and/or compute and to open-source LMs. We could imagine a hypothetical future where some form of "the war on general purpose computing" has reached its logical conclusion - making general purpose computation and technologies such as LMs unavailable to the wider public.

This scenario doesn't leave much freedom to the individual, but, assuming some degree of access to advanced AI systems, one could imagine clever prosaic techniques for splitting up subproblems into small, verifiable parts and using filtered adversarial LMs against one another to validate the solutions. In some intermediate scenarios of formal freedom, but de-facto unavailability of unbiased systems this might even work.

As usual, the real bottleneck to solving this stack of technical problems is human coordination. I suspect that this generalist forum is better suited for figuring out a way through it than current technical collectives preoccupied solely with training open-source models.

Respectfully, I think GLM-130B is not the right scale for the present-day present-time personal assistant. Ideally, someone (Carper?) would release a 30B or 70B Chinchilla-scaled LM for us to use as a base, but barring that lucky outcome (not sure if carper will be allowed to) I'd go with UL20B or a smaller Flan-T5, or one of several available 10-20B decoder-only models.

In this setting I have in mind, GLM-130B zero-shot prompted with what amounts to our values could be used either as a source of custom base CoT-dialogue finetune dataset or as a critique-generator and ranker in the Anthropic's constitutional AI setting. So, their inference-only config which supports servers as small as 4x RTX3090 could be used. Granted, the performance of GLM-130B in its current raw shape is somewhere between "GPT-3.5" and older Instruct-GPT-3, but it should suffice for the purpose described here.

fine-tuning an LM with 130B params (in the best possible case of GLM-130B; the less said about the shoddy performance of OPT/BLOOM, the better) requires somewhere in the ballpark of ~1.7TB of vram (this is at least 20+ A100s), and that's on batch size 1 with gradient checkpointing and mixed precision and 8bit adam and fused kernels without kv cache and etc.

Wearing my ML engineer hat I could say that while this is a conventional requirement, if we were determined to tune this LLM on a few batches on a given single server, we could use DeepSpeed's Zero-3 offload mode and maybe a bit of custom code to swap most of the parameters to the CPU RAM, which is much cheaper and is surprisingly efficient given large enough batches. One transformer layer worth of VRAM would be enough. One server likely wouldn't be enough for the complete operation, but used infiniband cards and cables are surprisingly cheap.

Regarding the kv cache, I expect the next generation of the transformer-like models to use advanced optimizations which lower kv cache pressure, specifically memorizing transformer. There are other competitive inventions, and discussion of the highest performing stack of tricks to get to the most efficient LM would be interesting, if exhausting.