Sumanta Basu is an Assistant Professor in the Department of Statistics and Data Science at Cornell University. Broadly, his research interests are structure learning and the prediction of large systems from data, with a particular emphasis on developing learning algorithms for time series data. Professor Basu also collaborates with biological and social scientists on a wide range of problems, including genomics, large-scale metabolomics, and systemic risk monitoring in financial markets. His research is supported by multiple awards from the National Science Foundation and the National Institutes of Health. At Cornell, Professor Basu teaches “Introductory Statistics” for graduate students outside the Statistics Department and “Computational Statistics” for Statistics Ph.D. students. He also serves as a faculty consultant at Cornell Statistical Consulting Unit, which assists the broader Cornell community with various aspects of analyzing empirical research. Professor Basu received his Ph.D. from the University of Michigan and was a postdoctoral scholar at the University of California, Berkeley, and Lawrence Berkeley National Laboratory. Before he received his Ph.D, Professor Basu was a business analyst, working with large retail companies on the design and data analysis of their promotional campaigns.
Overview and Courses
In today's AI-driven landscape, processing and analyzing textual data is increasingly critical for understanding customer sentiment, market trends, and other key business insights. Natural language processing (NLP) techniques have become essential tools for transforming raw text into meaningful data that underpins many of today’s AI applications.
This certificate program is designed to equip you with foundational skills in NLP, with a focus on text preprocessing, summarization, visualization, and sentiment analysis. In the first course, you will clean and manipulate text data using regular expressions, preprocess complex textual information, and address common challenges in messy datasets. You will also have the opportunity to explore advanced text preprocessing techniques such as stemming and tokenization, which are essential for preparing text for further analysis. In the second course, you will develop the ability to summarize and visualize text distributions across documents, leveraging tools like word clouds and document-term matrices to uncover patterns and trends. Finally, the third course introduces you to sentiment analysis, where you will quantify and interpret emotions in text and compare sentiment across documents and over time.
By the end of this program, you will have the practical knowledge needed to preprocess and analyze textual data, giving you a valuable edge in data science, AI engineering, or any field that requires a deep understanding of textual information.
To succeed in this program, you should have a foundation in R programming. If you do not have this experience, start with the Data Science Essentials certificate program.
The courses in this certificate program are required to be completed in the order that they appear.
Course list
With the rapid growth of text data across industries, knowing how to clean and process it is key to extracting valuable insights. This course gives you hands-on experience with text preprocessing, the foundation of any natural language processing (NLP) workflow.
You will start the course by using regular expressions to identify and edit patterns in text before tackling tasks like converting text to lowercase, replacing characters, and removing unwanted elements. As you progress, you will handle more advanced tasks such as tokenizing text into words or n-grams and filtering out irrelevant stop words. Finally, you will clean messy text by standardizing variations and using techniques like stemming.
By the end of the course, you will be equipped to prepare large text datasets for deeper analysis, paving the way for sentiment analysis and other advanced NLP tasks.
- Jun 24, 2026
- Sep 16, 2026
- Dec 9, 2026
- Mar 3, 2027
- May 26, 2027
Summarizing and visualizing text data is a key skill for professionals looking to uncover meaningful insights from large volumes of information. In this course, you will master the tools and techniques to condense and display text data, making complex patterns easier to interpret.
Starting with the tidytext package in R, you will tokenize unstructured text data and convert it into structured data for analysis. You will then summarize word distributions within individual documents and bring them to life with visualizations like word clouds. As you progress, you will explore advanced techniques for summarizing and comparing text across multiple documents, using tools such as document-feature matrices.
By the end of the course, you will have the skills to compare word usage across texts and track how language patterns evolve over time, helping you reveal deeper trends in your data.
You are required to have completed the following course or have equivalent experience before taking this course:
- Mastering NLP Fundamentals
- Jul 8, 2026
- Sep 30, 2026
- Dec 23, 2026
- Mar 17, 2027
- Jun 9, 2027
In today's data-driven world, being able to quantify and analyze sentiment in text is a powerful skill for understanding customer feedback, social media trends, and more. This course gives you the expertise to transform text into meaningful sentiment scores using key libraries like AFINN, Bing, and NRC.
You will begin by working with these sentiment analysis tools to categorize and quantify emotional tones in documents. From there, you will calculate and visualize sentiment scores using tools like line plots, bar charts, and word clouds. Finally, you will compare sentiment across multiple documents and track changes over time.
By the end of the course, you will be ready to interpret and act on sentiment trends in real-world applications, offering valuable insights for business strategies, customer relations, and market analysis.
You are required to have completed the following courses or have equivalent experience before taking this course:
- Mastering NLP Fundamentals
- Exploring Summarization and Visualization
- Jul 22, 2026
- Oct 14, 2026
- Jan 6, 2027
- Mar 31, 2027
- Jun 23, 2027
eCornell Online Workshops are live, interactive 3-hour learning experiences led by Cornell faculty experts. These premium short-format sessions focus on AI topics and are designed for busy professionals who want to gain immediately applicable skills and strategic perspectives. Workshops include faculty presentations, breakout discussions, and guided hands-on practice.
The AI Workshops All-Access Pass provides you with unlimited participation for 6 months from your date of purchase. Whether you choose to attend one workshop per month, or several per week, the All-Access Pass will allow you to customize your AI journey and stay on top of the latest AI trends.
Workshops cover a range of cutting-edge AI topics applicable across industries, hosted by Cornell faculty at the forefront of their fields. Whether you are just getting started with AI, seeking to build your AI skillset, or exploring advanced applications of AI, Workshops will provide you with an action-oriented learning experience for immediate application in your career. Sample Workshops include:
- Work Smarter with AI Agents: Individual and Team Effectiveness
- Leading AI Transformation: Bigger Than You Imagine, Harder Than You Expect
- Using AI at Work: Practical Choices and Better Results
- Search & Discoverability in the Era of AI
- Don't Just Prompt AI - Govern it
- AI-Powered Product Manager
- Leverage AI and Human Connection to Lead through Uncertainty
How It Works
- View slide #1
- View slide #2
- View slide #3
- View slide #4
- View slide #5
- View slide #6
- View slide #7
- View slide #8
Faculty Authors
Sreyoshi Das designs and offers courses on the applications of statistics and data science in the industry, with specific emphasis in the areas of economics and finance. Her courses aim to integrate academic training with hands-on work experience.
Before joining Cornell in 2022, Professor Das worked in economic consulting, where she developed a variety of quantitative and qualitative analyses to support testifying experts, client attorneys, government agencies, and corporations. In 2017, Professor Das received her Ph.D. in Economics from the University of Michigan, where she conducted research on banking and systemic risk, financial markets in emerging economies, and behavioral macroeconomics.
Key Course Takeaways
- Clean and preprocess the textual data contained within a set of documents in preparation for sentiment analysis
- Summarize and visualize the distribution of words within a single document (univariate) and across multiple documents (multivariate)
- Compare word distributions across documents and over time
- Use three different sentiment analysis lexicons (AFINN, Bing, and NRC) to quantify and interpret sentiments associated with words, sentences, and paragraphs
- Compare sentiments across documents and over time


What You'll Earn
- Text Analysis Certificate from Cornell’s Ann S. Bowers College of Computing and Information Science
- 48 Professional Development Hours (4.8 CEUs)
Watch the Video
Who Should Enroll
- Data scientists
- Computer scientists
- Analysts
- User behavior and UX teams
- Researchers
- Social scientists
Frequently Asked Questions
Text shows up everywhere in modern work, from customer reviews and support tickets to survey comments and internal feedback, but it is hard to use at scale without a disciplined workflow. Cornell’s Text Analysis Certificate helps you turn unstructured language into structured, analyzable data so you can surface patterns, quantify sentiment, and communicate findings clearly.
In this certificate program, authored by faculty from the Cornell Bowers College of Computing and Information Science, you will build core NLP capabilities that translate directly to real analysis work: cleaning and standardizing messy text, tokenizing and creating n-grams, summarizing language within and across documents, visualizing results with tools like word clouds and bigram networks, and converting text into numeric sentiment measures using widely used lexicons.
You will practice through applied, graded projects with facilitator feedback, using realistic datasets and R-based workflows that mirror how text analysis is done in data science and analytics teams.
If you want practical NLP foundations, job-ready text summarization and sentiment skills, and a structured learning experience with expert guidance, you should choose Cornell’s Text Analysis Certificate.
Many online text analytics courses stop at watching videos or running canned notebooks. Cornell’s Text Analysis Certificate is built to help you develop a repeatable workflow you can use on real, messy language data, with applied assignments that require you to make practical choices about preprocessing, feature creation, visualization, and sentiment measurement.
The learning experience is also more human and more interactive than a typical self-directed course library. You learn alongside a small cohort of professionals and receive expert facilitation throughout the program, including structured discussions and personalized feedback on your project work, so you can troubleshoot issues and strengthen your analysis rather than guessing alone.
Just as important, the Text Analysis Certificate curriculum is faculty-authored and emphasizes the fundamentals that make text analysis reliable in practice, such as regular-expression pattern matching, corpus and token design, stop-word and n-gram decisions, TF-IDF (Term Frequency-Inverse Document Frequency) for distinguishing language across documents, and careful interpretation of lexicon-based sentiment when context and negation matter.
Enrolling in Cornell’s Text Analysis Certificate also provides you with a 6-month All-Access Pass to eCornell's live online AI Workshops, interactive sessions led by world-class Cornell faculty that combine Ivy League insight with practical applications for busy professionals. Each 3-hour Workshop features structured instruction, guided practice, and real tools to build competitive AI capabilities, plus the opportunity to connect with a global cohort of growth-oriented peers. While AI Workshops are not required, they enhance certificate programs through:
- Integrating AI perspectives across most curricula
- Responding to emerging AI developments and trends
- Offering direct engagement with Cornell faculty at the forefront of AI research
Cornell’s Text Analysis Certificate is a strong fit if you analyze or need to make decisions from text and you want an NLP foundation you can apply immediately in your role.
The Text Analysis Certificate is designed for:
- Data scientists and analysts who want a structured workflow for turning documents into features, summaries, and sentiment measures
- Computer science and technical professionals who want practical, R-based text processing and exploration skills
- UX, user behavior, and insights teams working with open-ended feedback, reviews, and qualitative responses
- Researchers and social scientists who need repeatable methods to summarize language patterns across sources and time
To be ready to move quickly, you should be comfortable working in R, since the hands-on exercises and projects use R for cleaning, tokenization, visualization, and sentiment analysis.
Across Cornell’s Text Analysis Certificate, your projects are designed to mirror real text analytics work: You take raw language data, make defensible preprocessing decisions, and produce summaries and sentiment outputs you can explain.
Examples of project work you will complete include:
- Finding and editing patterns in raw text using regular expressions, then documenting your approach
- Preprocessing a text corpus to reduce vocabulary through tokenization choices, stop-word removal, and n-grams
- Addressing messy text issues by standardizing variants, detecting likely typos with approximate matching, and applying stemming or lemmatization
- Cleaning and tokenizing a reviews dataset and submitting an R Markdown analysis that includes stemming and lemmatization
- Summarizing and visualizing a long-form text by creating frequency tables, word clouds, and bigram network visualizations
- Comparing language across multiple documents using matrix-based representations and TF-IDF to identify distinctive terms
- Scoring sentiment with multiple lexicons, visualizing results, and comparing sentiment across documents, groups, or time
By the end of Cornell’s Text Analysis Certificate program, you will have multiple completed analyses and code-based submissions that demonstrate an end-to-end approach to text preprocessing, summarization, visualization, and sentiment measurement.
Cornell’s Text Analysis Certificate equips you to turn unstructured text into defensible insights you can use to inform decisions, measure sentiment, and communicate patterns to stakeholders.
After completing the Text Analysis Certificate, you will be prepared to:
- Clean and preprocess the textual data contained within a set of documents in preparation for sentiment analysis
- Summarize and visualize the distribution of words within a single document (univariate) and across multiple documents (multivariate)
- Compare word distributions across documents and over time
- Use three different sentiment analysis lexicons (AFINN, Bing, and NRC) to quantify and interpret sentiments associated with words, sentences, and paragraphs
- Compare sentiments across documents and over time
Learners can expect to come away with practical, job-relevant capabilities such as cleaning and preparing messy text data for analysis, building repeatable pipelines for tokenization, normalization, and feature creation, applying common NLP techniques like sentiment analysis, representing documents numerically with approaches such as TF-IDF, and translating outputs into business-ready summaries and recommendations. Overall, the program is designed to help you move beyond manual reading and ad hoc keyword searches and toward methods you can apply to large volumes of language in sources like customer feedback, support tickets, survey comments, and social media.
What truly sets eCornell apart is how our programs unlock genuine career transformation. Learners earn promotions to senior positions, enjoy meaningful salary growth, build valuable professional networks, and navigate successful career transitions.
Cornell’s Text Analysis Certificate, which consists of 3 short courses, is designed to be completed in 2 months. Each course runs for 2 weeks, with a typical weekly time commitment of 6 to 8 hours.
Most of your work is asynchronous, so you can complete readings, videos, coding exercises, and project work on your own schedule within each course’s weekly rhythm. At the same time, you are not learning in isolation. Facilitator-led discussions and scheduled interactive elements provide structure, accountability, and support while you build your text preprocessing, visualization, and sentiment analysis skills in R.
Cornell’s Text Analysis Certificate is built around practical, job-relevant capabilities that learners typically look for when they want to turn unstructured text into clear insights and action.
Learners can expect to come away with skills such as:
- Cleaning and preparing messy text data for analysis
- Building pipelines for tokenization, normalization, and feature creation
- Applying common NLP techniques like sentiment analysis and topic discovery
- Using vectorization approaches (for example, TF-IDF) to represent documents numerically
- Interpreting model outputs to explain patterns in customer, employee, or market language
- Translating findings into business-ready summaries, dashboards, and recommendations
- Working through realistic examples using modern analytics tooling and workflows
Overall, the Text Analysis Certificate is designed to help learners move beyond manual reading and ad hoc keyword searches, and instead build repeatable methods for extracting meaning from large volumes of text in domains like customer feedback, support tickets, survey comments, and social media.
A working foundation in R will help you get the most from Cornell’s Text Analysis Certificate, since the learning activities involve writing and debugging code for text cleaning, tokenization, visualization, and sentiment scoring. If you are not yet comfortable in R, eCornell recommends starting with a preparatory program that builds core data science essentials before tackling this certificate.
Within the Text Analysis Certificate program, you will use established R tooling for text work, including workflows for regular expressions and string manipulation, corpus and token structures, tidy text processing, visualization, and lexicon-based sentiment analysis. You will also have access to an in-exercise “Coding Coach” that can explain common error messages while you work through coding activities.
Sentiment analysis is a central outcome in Cornell’s Text Analysis Certificate, and you will learn how to translate language into numeric or categorical measures you can compare across documents and over time.
You will work with three widely used sentiment lexicons and learn when each is most useful:
- Bing, which classifies words as positive or negative
- AFINN, which assigns numeric intensity scores (negative to positive)
- NRC, which maps words to multiple emotion categories in addition to polarity
Beyond running lookups, you will practice aggregating word-level sentiment into document-level scores, visualizing distributions, and interpreting results carefully, including how context and negation can affect what lexicon-based sentiment captures.
Real text rarely arrives clean. Cornell’s Text Analysis Certificate trains you to handle the issues that show up in customer feedback, reviews, surveys, and open-ended responses so your downstream analysis is more accurate and easier to explain.
You will learn how to:
- Find and repair patterns in text using regular expressions
- Standardize casing, whitespace, punctuation, and other inconsistencies
- Tokenize text into words and n-grams and make informed stop-word decisions
- Detect and manage common misspellings and near-matches using edit-distance approaches
- Reduce vocabulary noise with stemming or lemmatization, and map informal language to meaningful categories with custom dictionaries
To ensure the best learning experience, Cornell’s Text Analysis Certificate program provides carefully selected datasets and materials that build a strong foundation. By practicing these techniques in code-based exercises and projects, you build a preprocessing workflow you can reuse when new text sources and edge cases appear in your own industry data.

{Anytime, anywhere.}
Request Information Now by completing the form below.

Text Analysis
| Select Payment Method | Cost |
|---|---|
| $3,750 | |


