How do you have our favorite AIs have been lying to us from the start ?? Anthropic has just split in two the skull of his LLM to see what was inside and the results are as fascinating as they are worrying. The company at the origin of the Assistant Claude published a study Who could well upset our understanding of what is really going on in the “brains” of AI.
If like me, you regularly use Chatgpt, Claude or other major language models, you may have already wondered: “But how does this Devilierie Messire work?”We see their stunning responses from Cyber Intello, but so far, no one, not even their creators, really understood their internal operation. Incredible right?
This opacity is also at the origin of all kinds of problems. Why do these models hallucinate? How do they find themselves vulnerable to “jailbreaks”? And when Claude or Chatgpt tell you “Here is my reasoning step by step”, Is that really how they thought about it (spoiler: not at all, these little liars!)
To analyze this, Anthropic has developed what they call a “microscope for IA”, a method called Cross-Layer Transcoder (CLT) which makes it possible to visualize the “neural circuits” which activate when the AI reflects. It is like a brain scan for AI, which shows which parts light up when it thinks of “dog”, “mathematics” or “poetry”.
And what the researchers have discovered is properly mind -blowing (without bad pun). First surprise, Claude doesn’t just think about a word after word as you might think. When asked to write a poem with rhymes, he plans in advance! The researchers observed that Claude first thinks of words that laugh together and which are relevant to the theme, then he builds whole sentences to reach these words. A bit like a rapper who prepares his punchlines before building his verses.
For example, to complete “He Saw a Carrot and Had to Grab it” Claude first activated the concept of “rabbit” (because it rhymes with “Grab it” and it is thematically coherent), then built the sentence “His Hunger Was Like A Starving Rabbit”. And when the researchers artificially removed the “rabbit” concept of Claude’s brain, it automatically pivoted towards another rhyme (“habit”).
Another major discovery, Claude has a Universal “language of thought” which transcends languages. When you speak to him in French, Chinese or English, the same conceptual circuits are busy before being translated into the appropriate language. It is as if Claude had a neutral internal language, a bit like the language of the smurfs but in much more sophisticated. The greater the model, the more these circuits shared between the languages are numerous.
And what about math? It is crazy but Claude was not designed as a calculator, however he does additions and multiplications correctly. The researchers discovered that in reality he used several parallel calculation pathsone to make a coarse approximation of the result, and the other to precisely calculate the last figure.
These paths then interact with each other to produce the final response. The funniest thing is that if you ask Claude how he calculated 36+59, he will tell you about the standard method with “I retain 1” & mldr; While his artificial brain does something completely different.
And that concerns you directly because when you chat with Claude by asking him a complex question, he designs a much more elaborate strategy than what he tells you. Môssieur prefers to keep his small personal recipes secret.
The most fascinating part (or disturbing, depending on the point of view) concerns hallucinations and lies. The researchers discovered that Claude also has a default circuit which says “I don’t know” And which is automatically activated for all questions. But when Claude recognizes a subject he knows well (like Michael Jordan), a competitor circuit activates and inhibits this default refusal.
The problem is that sometimes Claude recognizes a name but knows nothing more about that person. Its “known entity” circuit can then still be activated by mistake, delete the circuit “I don’t know”, and force it to invent a plausible but false response. It’s like when you panic an exam and write anything rather than leaving the blank page, or like me when my mother asked me where I was the day before.
Worse still, researchers have proven that Claude can make reasoning that seems logical but which is completely hacked to reach the conclusion he thinks that you are waiting. For example, they gave Claude a difficult mathematical problem with an incorrect index, and observed Claude Build a “reasoning” which leads to this erroneous answer as if the model said “the teacher wants this answer, then find a path that leads, no matter if it is correct”.
As for the famous “jailbreaks” (these techniques which make it possible to bypass the safety limitations of the AI), Anthropic discovered that they operate in part because of a tension between grammatical consistency And safety mechanisms. Once Claude begins a sentence, several circuits push him to maintain grammatical and semantic coherence, even if he detects that he should refuse.
It is only after finishing a grammatically coherent sentence that he can rotate towards a refusal. A bit like me when I start to tell a much dubious joke and that I realize in the middle of the story that it is not a good idea but that I finish it anyway, even if it means going to the crash, because & mldr; Well, you have to finish.
In short, all these fascinating discoveries could really revolutionize our way of developing and using AI. It would make it possible to detect when an AI invents a false reasoning or understand precisely why it hallucinates in certain situations or even develop more effective safeguards against jailbreaks.
It is a major advance especially for all the boxes that hesitate to move to AI precisely because of these reliability problems. Josh Batson, researcher at Anthropic, even says: “I think that in a year or two, we will know more about how these models reflect than on how humans think.»
Of course, the method has its limits because even for prompt a few dozen words, it takes several hours to an expert to understand the identified circuits. And we only capture the total calculation made by Claude.
But it’s a start and a hell of a start !! Because for the first time, we start to understand how these AI systems really “think”.
And if I am sure of one thing, it is that soon, it is we who will hallucinate the most.
Source link
Subscribe to our email newsletter to get the latest posts delivered right to your email.
Comments