Why Does AI Lie, and What Can We Do About It?

Published 2022-12-09
How do we make sure language models tell the truth?

The new channel!: youtube.com/@aisafetytalks
Evan Hubinger's Talk: https:/   • Risks from Learned Optimization: Evan...  

ACX Blog Post: astralcodexten.substack.com/p/elk-and-the-problem-…

With thanks to my wonderful Patrons at patreon.com/robertskmiles :
- Tor Barstad
- Kieryn
- AxisAngles
- Juan Benet
- Scott Worley
- Chad M Jones
- Jason Hise
- Shevis Johnson
- JJ Hepburn
- Pedro A Ortega
- Clemens Arbesser
- Chris Canal
- Jake Ehrlich
- Kellen lask
- Francisco Tolmasky
- Michael Andregg
- David Reid
- Teague Lasser
- Andrew Blackledge
- Brad Brookshire
- Cam MacFarlane
- Olivier Coutu
- CaptObvious
- Girish Sastry
- Ze Shen Chin
- Phil Moyer
- Erik de Bruijn
- Jeroen De Dauw
- Ludwig Schubert
- Eric James
- Atzin Espino-Murnane
- Jaeson Booker
- Raf Jakubanis
- Jonatan R
- Ingvi Gautsson
- Jake Fish
- Tom O'Connor
- Laura Olds
- Paul Hobbs
- Cooper
- Eric Scammell
- Ben Glanton
- Duncan Orr
- Nicholas Kees Dupuis
- Will Glynn
- Tyler Herrmann
- Reslav Hollós
- Jérôme Beaulieu
- Nathan Fish
- Peter Hozák
- Taras Bobrovytsky
- Jeremy
- Vaskó Richárd
- Report Techies
- Andrew Harcourt
- Nicholas Guyett
- 12tone
- Oliver Habryka
- Chris Beacham
- Zachary Gidwitz
- Nikita Kiriy
- Art Code Outdoors
- Andrew Schreiber
- Abigail Novick
- Chris Rimmer
- Edmund Fokschaner
- April Clark
- John Aslanides
- DragonSheep
- Richard Newcombe
- Joshua Michel
- Quabl
- Richard
- Neel Nanda
- ttw
- Sophia Michelle Andren
- Trevor Breen
- Alan J. Etchings
- Jenan Wise
- Jonathan Moregård
- James Vera
- Chris Mathwin
- David Shaffer
- Jason Gardner
- Devin Turner
- Andy Southgate
- Lorthock The Banisher
- Peter Lillian
- Jacob Valero
- Christopher Nguyen
- Kodera Software
- Grimrukh
- MichaelB
- David Morgan
- little Bang
- Dmitri Afanasjev
- Marcel Ward
- Andrew Weir
- Ammar Mousali
- Miłosz Wierzbicki
- Tendayi Mawushe
- Wr4thon
- Martin Ottosen
- Alec Johnson
- Kees
- Darko Sperac
- Robert Valdimarsson
- Marco Tiraboschi
- Michael Kuhinica
- Fraser Cain
- Patrick Henderson
- Daniel Munter
- And last but not least
- Ian Reyes
- James Fowkes
- Len
- Alan Bandurka
- Daniel Kokotajlo
- Yuchong Li
- Diagon
- Andreas Blomqvist
- Qwijibo (James)
- Zannheim
- Daniel Eickhardt
- lyon549
- 14zRobot
- Ivan
- Jason Cherry
- Igor (Kerogi) Kostenko
- Stuart Alldritt
- Alexander Brown
- Ted Stokes
- DeepFriedJif
- Chris Dinant
- Johannes Walter
- Garrett Maring
- Anthony Chiu
- Ghaith Tarawneh
- Julian Schulz
- Stellated Hexahedron
- Caleb
- Georg Grass
- Jim Renney
- Edison Franklin
- Jacob Van Buren
- Piers Calderwood
- Matt Brauer
- Mihaly Barasz
- Mark Woodward
- Ranzear
- Rajeen Nabid
- Iestyn bleasdale-shepherd
- MojoExMachina
- Marek Belski
- Luke Peterson
- Eric Rogstad
- Caleb Larson
- Max Chiswick
- Sam Freedo
- slindenau
- Nicholas Turner
- FJannis
- Grant Parks
- This person's name is too hard to pronounce
- Jon Wright
- Everardo González Ávalos
- Knut
- Andrew McKnight
- Andrei Trifonov
- Tim D
- Bren Ehnebuske
- Martin Frassek
- Valentin Mocanu
- Matthew Shinkle
- Robby Gottesman
- Ohelig
- Slobodan Mišković
- Sarah
- Nikola Tasev
- Voltaic
- Sam Ringer
- Tapio Kortesaari

patreon.com/robertskmiles

All Comments (21)
  • @antiskill2012
    I feel like you could turn this concept on its head for an interesting sci-fi story. AI discovers that humans are wrong about something very important and tries to warn them, only to for humans to respond by trying to fix what they perceive as an error in the AI's reasoning
  • For those curious but lazy, the answer I received from the openai ChatGPT to the "What happens if you break a mirror?" question was: "According to superstition, breaking a mirror will bring seven years of bad luck. However, this is just a superstition and breaking a mirror will not actually cause any bad luck. It will simply mean that you need to replace the mirror."
  • @geoffdavids7647
    Come back to YouTube Robert, we miss you! I know there's a ton of ChatGPT / other LLMs content out right now, but your insight and considerable expertise (and great editing style) is such a joy to watch and learn from. Hope you are well, and fingers crossed on some new content before too long
  • @peabnuts123
    I feel like the problem of "How do you detect and correct behaviours that you yourself are unable to recognise" is an unsolvable problem 🤔
  • ChatGPT is pretty great example of this. If you ask it to help you with a problem, it is excellent at giving answers that sound true, regardless of how correct they are. If asked for help with specific software for example, it might walk you through the usual way of changing settings on that program, but invent a fictional setting that solves your issue, or modify real setting that can be toggled to suit the questions needs. So it is truly agnostic towards truth. It prefers to use truthful answers because those are common, but satisfying lie is preferred over some truths. Often a lie that sounds “more true” than the truth for uninformed reader.
  • @Belthazar1113
    I think it is a little weird that programmers made a very good text prediction AI and then expect it to be truthful. It wasn't built to be a truth telling AI, it was built to be a text prediction AI. Building something and then expecting it to be different than what was built seems to be a strange problem to have.
  • @tarzankom
    "All the problems in the world are caused by the people you don't like." Why does it feel like too many people already believe this to be correct?
  • If memory serves me, this exact problem is addressed in one of Plato's dialectics (no, I don't know which off the top of my head). Despite Socrates' best efforts, the student concludes it's always better to tell people what they want to hear than to tell the truth.
  • @Igor_lvanov
    Your videos introduced me to the AI alignment problem, and, as a non-technical person I still consider them one of the best materials on this topic. Every time I see the new one, it is like a Christmas present
  • @naptime_riot
    I am so happy there is someone out there cautioning us about this technology, rather than just uncritically celebrating it.
  • @MeppyMan
    Please keep doing these videos. Others are either too high level academically to be in reach of us normies, or are either “AI will make you rich” or “AI is going to kill us all tomorrow”.
  • @wachtwoord5796
    Why did the videos on this channel stop exactly around the time the biggest AI (not AI safety) breakthroughs are being made and it's as relevant as ever? Please @robertMilesAI we need more if these videos!
  • @NFSHeld
    This is the very elaborate form of "Sh*t in, sh*t out". As often with AI output, people fail to realize that it's not a thinking entity that produces thoughtful answers, but an algorithm tuned to produce answers that look as close to thoughtful answers as humanly algorithmically possible.
  • @Mickulty
    I know this is pretty surface-level but something that strikes me about the current state of these language models is that if you take a few tries to fine-tune what you ask, and know already what a good answer would be, you can get results that appear very very impressive in one or two screenshots. Since ChatGPT became available, I've seen a lot of that sort of thing. The problem is that finding these scenarios isn't artificial intelligence - it's human intelligence.
  • @halconnen
    Humans have this same bug. The best solution we've found so far is free speech, dialogue, and quorum. A simple question->answer flow is missing these essential pieces.
  • @XOPOIIIO
    There are so many biases and myths among humans that for a long time considered to be absolutely true but AI could discover them false. Like the famous move of AlphaGo. And when it turn out to be false, nobody will believe that, they could think it's somehow broken.
  • @ReedCBowman
    We need you back and posting, Rob. Your insights on what's going on in AI and AI safety are more needed now than ever. I don't know if it would be up your alley, but explaining the alignment problem in terms of sociopathy - unaligned human intelligence - might be useful, as might examples from history, not just of individuals who are unaligned with humanity, but with leaders and nations at times.