Sumanta Basu is an Assistant Professor in the Department of Statistics and Data Science at Cornell University. Broadly, his research interests are structure learning and the prediction of large systems from data, with a particular emphasis on developing learning algorithms for time series data. Professor Basu also collaborates with biological and social scientists on a wide range of problems, including genomics, large-scale metabolomics, and systemic risk monitoring in financial markets. His research is supported by multiple awards from the National Science Foundation and the National Institutes of Health. At Cornell, Professor Basu teaches “Introductory Statistics” for graduate students outside the Statistics Department and “Computational Statistics” for Statistics Ph.D. students. He also serves as a faculty consultant at Cornell Statistical Consulting Unit, which assists the broader Cornell community with various aspects of analyzing empirical research. Professor Basu received his Ph.D. from the University of Michigan and was a postdoctoral scholar at the University of California, Berkeley, and Lawrence Berkeley National Laboratory. Before he received his Ph.D, Professor Basu was a business analyst, working with large retail companies on the design and data analysis of their promotional campaigns.
Data Science for Machine LearningCornell Certificate Program
Overview and Courses
In today's data-driven world, advanced data modeling techniques are essential for enabling informed decision making and strategic planning.
This certificate program is designed to help you understand predictive modeling, with a focus on making accurate predictions using various types of data. Throughout this program, you will explore models such as polynomial regression, splines, and generalized additive models. These models are used to analyze complex relationships within datasets that may include both numerical and categorical variables. You will also gain practical skills in building models using R, which will allow you to examine how different types of information can be combined to make predictions. You will have the opportunity to practice modeling interactions between different types of data, such as categories and numbers, and use decision trees to understand complex relationships that linear models are unable to capture. By the end of the program, you will be able to create and evaluate predictive models, equipping you with valuable skills for decision making in a variety of industries.
To be successful in this course, you should have a foundation in R programming and be able to leverage those skills to create and summarize datasets with visualizations, interpret data, employ simulations, use linear regression, clean data, and create visualizations. Experience with R will be critical to success as we don't explicitly teach how to use R in this certificate. High school or college level math and algebra are also recommended. If you do not have this experience, start with the Data Science Essentials certificate program.
The courses in this certificate program are required to be completed in the order that they appear.
Course list
- Jun 10, 2026
- Sep 2, 2026
- Nov 25, 2026
- Feb 17, 2027
- May 12, 2027
In this course, you will explore strategies for incorporating categorical predictors in a regression model, including using dummy variables to represent different categories. You will inspect binary and nonbinary categorical variables and discover how to interpret the estimated coefficients of dummy variables.
As you progress through the course, you will practice modeling and interpreting interactions between categorical and quantitative predictors in a linear model. Finally, you will focus on defining and implementing decision trees, which are advantageous for capturing complex interactions between predictors that linear models may be unable to capture. By the end of the course, you will be equipped to transform categorical variables into numerical variables, fit regression models with categorical predictors, interpret dummy variable coefficients, and use decision trees for modeling complex relationships between predictors.
You are required to have completed the following courses or have equivalent experience before taking this course:
- Nonlinear Regression Models
- Jun 24, 2026
- Sep 16, 2026
- Dec 9, 2026
- Mar 3, 2027
- May 26, 2027
The goal of this course is to introduce you to the fundamental concepts and techniques used in predictive modeling. Throughout this course, you will evaluate the balance between model flexibility and interpretability, examine how to select the best parameters using cross-validation, and practice building models that generalize well to new data. You will also explore techniques for splitting datasets, selecting tuning parameters, and fitting models using loss functions. By the end of the course, you will have a solid understanding of model flexibility, interpretability, and the bias-variance trade-off, equipping you to effectively build and evaluate predictive models.
You are required to have completed the following courses or have equivalent experience before taking this course:
- Nonlinear Regression Models
- Modeling Interactions Between Predictors
- Jul 8, 2026
- Sep 30, 2026
- Dec 23, 2026
- Mar 17, 2027
- Jun 9, 2027
When working with real-world datasets, more than a single model may be required to capture the complexity of the data. Ensemble methods prove to be extremely useful with complex datasets by allowing us to combine simpler models to fully grasp the patterns in the data, thereby improving the predictive power of the models.
In this course, you'll discover how to use two ensemble methods: random forests and boosted decision trees. You'll practice these ensemble methods with datasets in R and apply the ensemble techniques you've learned to build robust predictive models. You'll practice improving decision tree performance using random forest models and practice interpreting those models. You'll then use another technique and apply boosting to reduce errors and aggregate predictions to decision trees.
You are required to have completed the following courses or have equivalent experience before taking this course:
- Nonlinear Regression Models
- Modeling Interactions Between Predictors
- Foundations of Predictive Modeling
- Jul 22, 2026
- Oct 14, 2026
- Jan 6, 2027
- Mar 31, 2027
- Jun 23, 2027
eCornell Online Workshops are live, interactive 3-hour learning experiences led by Cornell faculty experts. These premium short-format sessions focus on AI topics and are designed for busy professionals who want to gain immediately applicable skills and strategic perspectives. Workshops include faculty presentations, breakout discussions, and guided hands-on practice.
The AI Workshops All-Access Pass provides you with unlimited participation for 6 months from your date of purchase. Whether you choose to attend one workshop per month, or several per week, the All-Access Pass will allow you to customize your AI journey and stay on top of the latest AI trends.
Workshops cover a range of cutting-edge AI topics applicable across industries, hosted by Cornell faculty at the forefront of their fields. Whether you are just getting started with AI, seeking to build your AI skillset, or exploring advanced applications of AI, Workshops will provide you with an action-oriented learning experience for immediate application in your career. Sample Workshops include:
- Work Smarter with AI Agents: Individual and Team Effectiveness
- Leading AI Transformation: Bigger Than You Imagine, Harder Than You Expect
- Using AI at Work: Practical Choices and Better Results
- Search & Discoverability in the Era of AI
- Don't Just Prompt AI - Govern it
- AI-Powered Product Manager
- Leverage AI and Human Connection to Lead through Uncertainty
How It Works
- View slide #1
- View slide #2
- View slide #3
- View slide #4
- View slide #5
- View slide #6
- View slide #7
- View slide #8
Faculty Author
Key Course Takeaways
- Select an optimal model based on modeling goals and characteristics of a dataset
- Identify when a nonlinear model is necessary based on data characteristics and how to implement it
- Identify or detect when an interaction between predictors would improve a model
- Improve predictive accuracy by combining different models into an ensemble

Download a Brochure
Not ready to enroll but want to learn more? Download the certificate brochure to review program details.

What You'll Earn
- Data Science for Machine Learning Certificate from Cornell’s Ann S. Bowers College of Computing and Information Science
- 64 Professional Development Hours (6.4 CEUs)
Watch the Video
Who Should Enroll
- Current and aspiring data scientists and analysts
- Business decision makers
- Marketing analysts
- Consultants
- Executives
- Anyone seeking to gain deeper exposure to data science
Frequently Asked Questions
Better predictions require more than “plug-and-play” modeling. In real organizations, you have to recognize nonlinear patterns, account for interactions between variables, and choose model complexity that holds up on new data. Cornell’s Data Science for Machine Learning Certificate helps you build that judgment and the practical R-based workflow to support it.
In this certificate program, authored by faculty from Cornell’s Bowers College of Computing and Information Science, you will learn when linear models fall short, how to model curvature with approaches like polynomial regression, splines, and generalized additive models, and how to capture conditional relationships with interaction terms and tree-based methods. You’ll also practice selecting and tuning models using validation and cross-validation, so your modeling choices are grounded in evidence rather than guesswork.
Because the work is applied throughout, you will repeatedly move from exploratory plots and model setup to interpretation, prediction, and evaluation. If you want stronger predictive modeling judgment, hands-on R practice with modern modeling methods, and a structured learning experience designed for busy professionals, you should choose Cornell’s Data Science for Machine Learning Certificate.
Many online programs teach modeling as a set of disconnected techniques. Cornell’s Data Science for Machine Learning Certificate is designed to help you think like a modeler: Diagnose when common assumptions fail, choose the right level of flexibility, and justify decisions using validation.
You learn in a small, supported cohort experience where applied work is central. Across the Data Science for Machine Learning Certificate, you practice building models in R, interpreting what the model is telling you, and testing whether it generalizes. The curriculum also emphasizes realistic modeling challenges that show up in practice, including nonlinear relationships, interactions between predictors, and the trade-off between interpretability and predictive power.
Finally, rather than stopping at a single “best” model, you learn how and why ensembles improve stability and accuracy, and you practice interpreting black-box methods using feature importance measures and partial dependence plots. That blend of rigorous concepts, repeated application, and expert-facilitated learning is what differentiates Cornell’s Data Science for Machine Learning Certificate.
Enrolling in the Data Science for Machine Learning Certificate also provides you with a 6-month All-Access Pass to eCornell's live online AI Workshops, interactive sessions led by world-class Cornell faculty that combine Ivy League insight with practical applications for busy professionals. Each 3-hour Workshop features structured instruction, guided practice, and real tools to build competitive AI capabilities, plus the opportunity to connect with a global cohort of growth-oriented peers. While AI Workshops are not required, they enhance certificate programs through:
- Integrating AI perspectives across most curricula
- Responding to emerging AI developments and trends
- Offering direct engagement with Cornell faculty at the forefront of AI research
Cornell’s Data Science for Machine Learning Certificate is a strong fit if you already work with data and want to become more effective at building and evaluating predictive models in R. The program is designed for current and aspiring data scientists and analysts as well as professionals in decision-making roles who want to better understand and apply modern modeling approaches.
The experience assumes you can already code in R well enough to create and summarize datasets, build visualizations, interpret data, run simulations, fit linear regression models, and clean data. High school or college-level math and algebra are also recommended. Learners who want a more introductory ramp into R and core data science workflows are typically better served by starting with a fundamentals-oriented program before taking on this modeling-focused certificate.
Project work in Cornell’s Data Science for Machine Learning Certificate is designed to mirror how modeling happens on the job; you explore data, build competing models, interpret outputs, and defend your choices using prediction performance.
Examples of the kinds of projects you will complete include:
- Modeling nonlinear trends in real public-health time series data by fitting polynomial regression models, generating predictions, and evaluating where the fit performs poorly
- Improving a nonlinear model by identifying turning points, selecting knot placement, fitting splines, and comparing spline performance to a global polynomial approach
- Building a generalized additive model that combines multiple predictors (including nonlinear terms) to produce forecasts and interpret marginal effects using model output and diagnostic plots
- Creating a predictive model that incorporates categorical variables through dummy coding, then extending it with interaction terms to reflect how one predictor changes the effect of another
- Training a decision tree on a real dataset to capture complex interactions that are difficult to express in linear form, then translating the fitted tree into interpretable decision rules
- Developing ensemble models such as random forests and boosted trees, then interpreting them with feature importance measures and partial dependence plots
Throughout Cornell’s Data Science for Machine Learning Certificate, you submit your work in R Markdown analyses, which helps you practice communicating not only what you built, but why your modeling decisions make sense.
Cornell’s Data Science for Machine Learning Certificate helps you strengthen the modeling judgment and R-based workflow you need to build predictive models that are more accurate, explainable, and defensible in real business and research settings.
After completing the Data Science for Machine Learning Certificate, you will be prepared to:
- Select an optimal model based on modeling goals and characteristics of a dataset
- Identify when a nonlinear model is necessary based on data characteristics and how to implement it
- Identify or detect when an interaction between predictors would improve a model
- Improve predictive accuracy by combining different models into an ensemble
Students commonly report that the program delivers substantial skill growth in a manageable format, with a strong balance of core concepts and hands-on work that fits into a busy schedule. Feedback frequently highlights clearer modeling thinking, practical tools that translate to day-to-day analytics work, and increased confidence due to the program’s well-organized learning flow and steady support. Learners also point to the value of refreshing statistical fundamentals while strengthening technical skills, and many say they would feel comfortable recommending the experience to colleagues.
What truly sets eCornell apart is how our programs unlock genuine career transformation. Learners earn promotions to senior positions, enjoy meaningful salary growth, build valuable professional networks, and navigate successful career transitions.
Cornell’s Data Science for Machine Learning Certificate, which consists of 4 short courses, is designed to be completed in 2 months. Each course runs for 2 weeks, with a typical weekly time commitment of 6 to 8 hours.
Most coursework is asynchronous, so you can complete readings, videos, coding exercises, and project work around your workday. At the same time, the experience stays structured through regular deadlines, graded assignments, and facilitated interaction. Opportunities for live sessions create space to ask questions, discuss modeling decisions, and learn from how other professionals approach similar analytical problems.
Students in Cornell’s Data Science for Machine Learning Certificate often say the program helps them build practical modeling skills quickly, with a strong balance of core concepts and hands-on work that fits into a busy schedule. Learners appreciate that the experience is designed to help them apply statistical thinking through coding and real analysis tasks without feeling overwhelming.
Common themes you will hear include:
- A strong mix of modeling theory, applied statistics, and coding practice in R
- Clear learning flow that makes it easier to build confidence with data science workflows
- Practical tools and exercises that translate directly to day-to-day analytics work
- A fast, efficient way to refresh stats fundamentals while strengthening technical skills
- Well-organized course structure that keeps progress predictable from week to week
- Flexible pacing that works well for full-time professionals
- Short, digestible lessons that make complex topics easier to absorb
- Helpful course resources and facilitator guidance that support steady learning
- A straightforward online experience that is easy to navigate and complete on time
Many students also mention feeling comfortable recommending Cornell’s Data Science for Machine Learning Certificate program to colleagues because it delivers substantial skills growth in a manageable amount of time.
Prior experience with R is important for success in Cornell’s Data Science for Machine Learning Certificate because the program focuses on building and evaluating models, not on teaching R from the ground up. You should feel comfortable using R to create and summarize datasets, produce visualizations, clean data, run simulations, and fit and interpret linear regression models.
If you are newer to R or want a more guided start with core workflows, you will likely get more value by building that foundation first then returning to Cornell’s Data Science for Machine Learning Certificate when you are ready to move faster through modeling implementation and interpretation.
A major focus of Cornell’s Data Science for Machine Learning Certificate is learning how to select models that generalize well to new data, not just models that fit historical data closely. You will study why overfitting happens, how the bias-variance trade-off affects model performance, and how tuning parameters control a model’s flexibility.
You will then practice practical selection techniques such as training-test splits and k-fold cross-validation to choose tuning parameters and compare candidate models using out-of-sample error. This helps you build a repeatable process for defending model choices in professional analytics and data science work.
Real datasets rarely behave in clean, straight lines, and predictors often influence outcomes differently depending on context. Cornell’s Data Science for Machine Learning Certificate prepares you to model both effects directly.
You will learn how to recognize when nonlinear modeling is warranted and implement approaches such as polynomial regression, splines, and generalized additive models to capture curved relationships. You’ll also practice encoding categorical variables appropriately and adding interaction terms when the effect of one predictor depends on another. When interactions become too complex for linear models to express clearly, you’ll use tree-based methods and then move to ensembles that improve predictive stability and accuracy.
Explore Related Programs
Request Information Now by completing the form below.

Data Science for Machine Learning
| Select Payment Method | Cost |
|---|---|
| $3,750 | |



























