How do you have our favorite AIs have been lying to us from the start ?? Anthropic has just split in two the skull of his LLM to see what was inside and the results are as fascinating as they are worrying. The company at the origin of the Assistant Claude published a study Who could well upset our understanding of what is really going on in the “brains” of AI.

If like me, you regularly use Chatgpt, Claude or other major language models, you may have already wondered: “But how does this Devilierie Messire work?”We see their stunning responses from Cyber ​​Intello, but so far, no one, not even their creators, really understood their internal operation. Incredible right?

This opacity is also at the origin of all kinds of problems. Why do these models hallucinate? How do they find themselves vulnerable to “jailbreaks”? And when Claude or Chatgpt tell you “Here is my reasoning step by step”, Is that really how they thought about it (spoiler: not at all, these little liars!)

To analyze this, Anthropic has developed what they call a “microscope for IA”, a method called Cross-Layer Transcoder (CLT) which makes it possible to visualize the “neural circuits” which activate when the AI ​​reflects. It is like a brain scan for AI, which shows which parts light up when it thinks of “dog”, “mathematics” or “poetry”.

And what the researchers have discovered is properly mind -blowing (without bad pun). First surprise, Claude doesn’t just think about a word after word as you might think. When asked to write a poem with rhymes, he plans in advance! The researchers observed that Claude first thinks of words that laugh together and which are relevant to the theme, then he builds whole sentences to reach these words. A bit like a rapper who prepares his punchlines before building his verses.

For example, to complete “He Saw a Carrot and Had to Grab it” Claude first activated the concept of “rabbit” (because it rhymes with “Grab it” and it is thematically coherent), then built the sentence “His Hunger Was Like A Starving Rabbit”. And when the researchers artificially removed the “rabbit” concept of Claude’s brain, it automatically pivoted towards another rhyme (“habit”).

7032ED7DB85B8CD3EFE70A89DEF4F15BFE8FC05 1650X900 1

Another major discovery, Claude has a Universal “language of thought” which transcends languages. When you speak to him in French, Chinese or English, the same conceptual circuits are busy before being translated into the appropriate language. It is as if Claude had a neutral internal language, a bit like the language of the smurfs but in much more sophisticated. The greater the model, the more these circuits shared between the languages ​​are numerous.

E0E156EA6C912A385D666ED562187FCED8C392A58 1650x750 1

And what about math? It is crazy but Claude was not designed as a calculator, however he does additions and multiplications correctly. The researchers discovered that in reality he used several parallel calculation pathsone to make a coarse approximation of the result, and the other to precisely calculate the last figure.

EAABAEB746713F7F82991A0CC6EDB091452B2FEE 1650x855 1

These paths then interact with each other to produce the final response. The funniest thing is that if you ask Claude how he calculated 36+59, he will tell you about the standard method with “I retain 1” & mldr; While his artificial brain does something completely different.

A48C1E8195E458AD53F9C81DF45AF735E267A13D 1650x512 1

And that concerns you directly because when you chat with Claude by asking him a complex question, he designs a much more elaborate strategy than what he tells you. Môssieur prefers to keep his small personal recipes secret.

017EBC3169BD6C37E795D54B726C340EADF8018E 1650x866 1

The most fascinating part (or disturbing, depending on the point of view) concerns hallucinations and lies. The researchers discovered that Claude also has a default circuit which says “I don’t know” And which is automatically activated for all questions. But when Claude recognizes a subject he knows well (like Michael Jordan), a competitor circuit activates and inhibits this default refusal.

The problem is that sometimes Claude recognizes a name but knows nothing more about that person. Its “known entity” circuit can then still be activated by mistake, delete the circuit “I don’t know”, and force it to invent a plausible but false response. It’s like when you panic an exam and write anything rather than leaving the blank page, or like me when my mother asked me where I was the day before.

BE304D3250C2AAB04E19908B3AFC9970D1ED7BB0 1650x1004 1

Worse still, researchers have proven that Claude can make reasoning that seems logical but which is completely hacked to reach the conclusion he thinks that you are waiting. For example, they gave Claude a difficult mathematical problem with an incorrect index, and observed Claude Build a “reasoning” which leads to this erroneous answer as if the model said “the teacher wants this answer, then find a path that leads, no matter if it is correct”.

As for the famous “jailbreaks” (these techniques which make it possible to bypass the safety limitations of the AI), Anthropic discovered that they operate in part because of a tension between grammatical consistency And safety mechanisms. Once Claude begins a sentence, several circuits push him to maintain grammatical and semantic coherence, even if he detects that he should refuse.

165b18b79295a96bc7142b209caa333f4ec5378d0 1650x548 1

It is only after finishing a grammatically coherent sentence that he can rotate towards a refusal. A bit like me when I start to tell a much dubious joke and that I realize in the middle of the story that it is not a good idea but that I finish it anyway, even if it means going to the crash, because & mldr; Well, you have to finish.

1612AF943004563A78CB7F6591C4CD990C433769 1650x1022 1

In short, all these fascinating discoveries could really revolutionize our way of developing and using AI. It would make it possible to detect when an AI invents a false reasoning or understand precisely why it hallucinates in certain situations or even develop more effective safeguards against jailbreaks.

https://www.youtube.com/watch?v=BJ9BD2D3DZA


It is a major advance especially for all the boxes that hesitate to move to AI precisely because of these reliability problems. Josh Batson, researcher at Anthropic, even says: “I think that in a year or two, we will know more about how these models reflect than on how humans think.»

Of course, the method has its limits because even for prompt a few dozen words, it takes several hours to an expert to understand the identified circuits. And we only capture the total calculation made by Claude.

But it’s a start and a hell of a start !! Because for the first time, we start to understand how these AI systems really “think”.

And if I am sure of one thing, it is that soon, it is we who will hallucinate the most.


Source link

Categorized in: