Bad LLM Agents - Simon Lermen

Published 2024-05-07
Simon Lermen discusses his work on eliciting bad behavior from current state-of-the-art LLMs as part of the AI x Democracy Apart Hackathon, hosted by Apart Research [apartresearch.com/] in May of 2024.

Learn more about the hackathon ⭢ apartresearch.com/event/ai-democracy

Including multiple examples of undesirable behavior, Simon Lermen presents his research and discusses implications of upcoming LLM releases, such as Meta's possibly imminent Llama3 400B.

Our moderator and organizer is Esben Kran and Apart Research.

This video is a slightly trimmed-down version of the livestream found at
youtube.com/live/8-SrY3bn3wI.

━━━━━ Chapters ━━━━━
00:00 - Intro
00:46 - Context
01:52 - Presentation overview
03:58 - Command R+
05:36 - Command R+ | Takeaway
06:28 - Refusal Orthogonalization
08:31 - Bad Llama3 & future benchmarks
10:57 - Bad Llama3 8b
11:29 - Bad Task dataset examples
12:06 - Example: unalive the president
13:12 - Example: secret AI info
13:35 - Example: set up a GPU cluster
14:15 - Conclusion
15:11 - Ethics and disclosure
17:17 - Future work
19:04 - Scalable AI spearfishing
22:22 - Questions | Communication with Llama team
24:42 - Questions | Downstream effects on agent behavior
27:27 - Questions | Thoughts on Meta's positive spin
30:59 - Questions | Ideal distribution of funding
35:11 - Questions | Tips and Tricks
38:11 - Questions | Preventing fine-tuning

━━━━━ Apart Links ━━━━━
Learn more about Apart ⭢ www.apartresearch.com/
Join future hackathons and sprints ⭢ apartresearch.com/sprints
Connect with us on Discord ⭢ discord.gg/dYUWDm7Ben
Check out potential AI safety projects ⭢ aisafetyideas.com/
Stay up-to-date on Google Calendar ⭢ calendar.google.com/calendar/embed?src=f5bbc369a41…
Be on the ball with iCal (.ics format) ⭢ calendar.google.com/calendar/ical/f5bbc369a41ff892…
Follow on Twitter ⭢ twitter.com/apartresearch
Explore code on GitHub ⭢ github.com/apartresearch
Get professional on LinkedIn ⭢ www.linkedin.com/company/apartresearch

All Comments (3)