CSSI-MLFL joint seminar: Dallas Card (UMichigan), "Archival perspective on pretraining data for large language models"

Location

CS 150/151 and Zoom

Date

Thu, 04/18/2024 - 12:00

Hello CSSI folks -- this week we're pleased to announce a CSS-relevant talk on culture, power, and language ideology behind large language models -- this Thursday, with the CICS MLFL series. (CSSI doesn't have a Friday seminar this week.)

Dallas Card, an assistant professor in the School of Information at the University of Michigan, will be speaking virtually; lunch is available at an in-person session at the Machine Learning and Friends Lunch. His research focuses on making machine learning more reliable and responsible, and on using machine learning and natural language processing to learn about society from text. His work received a best short paper nomination at ACL 2019, a distinguished paper award at FAccT 2022, and has been covered by The Washington Post, Vox, Wired, and other outlets.

If you'd like to potentially have a research discussion with Professor Card, please add your contact here. Please feel free to email cics.mlfl@gmail.com for any questions and comments.

who: Dallas Card (https://dallascard.github.io/)

when: Thursday, 4/18/2024, 12pm-1pm

where: CS 150/151, Zoom

food: Pizza and drink

Title: An archival perspective on pretraining data for large language models

Abstract:

Large language models (LLMs) have transformed the field of NLP and are already being used for a wide range of applications, including measurement of text in computational social science. In general, however, more attention has been paid to the models themselves than to the data on which they are trained. In this talk, I will discuss recent work investigating widely used pretraining corpora for LLMs, and provide an overview of commonly emphasized issues among dataset creators. I will also discuss how filtering for textual quality can lead to unintentionally privileging certain voices (representing a kind of implicit language ideology), and what lessons we can draw from the field of archival studies. Finally, I will conclude with some work in progress assessing the Corpus of Founding Era American English, and its use in contemporary legal debates. Optional pre-read: https://dallascard.github.io/resources/patterns-2024.pdf

Bio:

Dallas Card is an Assistant Professor in the School of Information at the University of Michigan, where his research focuses on making machine learning more reliable and responsible, and on using machine learning and natural language processing to learn about society from text. His work received a best short paper nomination at ACL 2019, a distinguished paper award at FAccT 2022, and has been covered by The Washington Post, Vox, Wired, and other outlets. Prior to starting at Michigan, Dallas was a postdoctoral researcher with the Stanford Natural Language Processing Group and the Stanford Data Science Institute. He holds a Ph.D. in Machine Learning from Carnegie Mellon University.