Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI Breakdown

arxiv preprint - Time Travel in LLMs: Tracing Data Contamination in Large Language Models

16 Jan 2024

Description

In this episode, we discuss Time Travel in LLMs: Tracing Data Contamination in Large Language Models by Shahriar Golchin, Mihai Surdeanu. The paper presents a method to detect test data contamination in large language models by checking if the model's output closely matches specific segments of reference data. This process involves guided instructions using dataset names and partition types, comparing the model's output to reference instances, and assessing partitions based on statistical overlap measures or classification by GPT-4's few-shot in-context learning. The results show high accuracy in identifying contamination, revealing that GPT-4 has been contaminated with certain datasets such as AG News, WNLI, and XSum.

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.