David Mimno is an Associate Professor and Chair of the Department of Information Science in the Ann S. Bowers College of Computing and Information Science at Cornell University. He holds a Ph.D. from UMass Amherst and was previously the head programmer at the Perseus Project at Tufts as well as a researcher at Princeton University. Professor Mimno’s work has been supported by the Sloan Foundation, the NEH, and the NSF.
Language Models and Language DataCornell Course
Course Overview
In this course, you will analyze how large language models are constructed from diverse text sources and examine the entire model life cycle, from pretraining data collection to generating meaningful outputs. You'll explore how choices about data type, genre, and tokenization affect a model's performance, discovering how to compare real-world corpora such as Wikipedia, Reddit, and GitHub.
Through hands-on projects, you will design tokenizers, quantify text characteristics, and apply methods like byte-pair encoding to see how different preprocessing strategies shape model capabilities. You'll also investigate how models interpret context by studying keywords in context (KWIC) views and embedding-based analysis.
By the end of this course, you will have a clear understanding of how data selection and processing decisions influence the way LLMs behave, preparing you to evaluate or improve existing models.
You are required to have completed the following courses or have equivalent experience before taking this course:
- LLM Tools, Platforms, and Prompts
- Language Models and Next-Word Pronunciation
- Fine-Tuning LLMs
Key Course Takeaways
- Summarize the life cycle of a language model, detailing each phase from data collection through inference
- Assess the impact of data collection and curation choices on a model's predictive capabilities and domain coverage
- Analyze pretraining documents to see how an LLM extends prompts within specific, real-world contexts
- Classify and quantify text collections by genre, language, and code to gauge their effect on model behavior
- Explain how a pretraining dataset's composition influences tokenizer coverage and performance across different text domains

How It Works
Course Author
Who Should Enroll
- Engineers
- Developers
- Analysts
- Data scientists
- AI engineers
- Entrepreneurs
- Data journalists
- Product managers
- Researchers
- Policymakers
- Legal professionals
100% Online
cornell's Top Minds
career