CRISP-DM with detailed sample projects in Python and R

In this blog, I discussed the most widely-used analytics model, CRISP-DM with sample projects in Python and R.

This CRISP-DM with sample data science projects in Python and R will be completed over the next few weeks. I encourage you to give more attention to what was done so you can replicate in any programming language you prefer.

CRISP-DM stands for CRoss-Industry Standard Process for Data Mining. It is the most widely-used analytics model and an open standard process model that describes common approaches used by data mining experts.

The CRISP-DM has six major phases:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

1. Business Understanding
Any good project starts with a good understanding of the customer’s needs. Business Understanding seek to establish the following:
I. Determine business goals and objectives: What the customer really wants to accomplish
II. Assess the current situation: Resources availability, requirements, risks, etc are determined
III. Financial implications: Conduct a cost-benefit analysis
IV. Create a project plan: Highlight the stages of the project, duration, resources, inputs, outputs, and dependencies.

This phase is very essential and should be thought out well from the beginning of the project.

2. Data Understanding

Next is the Data Understanding phase. Adding to the foundation of Business Understanding, it drives the focus to identify, collect, and analyze the data sets that can help you accomplish the project goals. This phase has the following tasks.
I. Data gathering: Know, get or gain access to the necessary data source(s)
II. Examine data: Load data into your analysis tool to examine the data, its format, and fields/columns
III. Explore data: Statistical analysis of data, visualize, and identify relationships among its fields
IV. Verify data quality: To what extent is your data clean or dirty? Identify other quality issues.

Find sample project here

3. Data Preparation

At this phase (also be referred to as "data munging"), the data is prepared for modeling. It has the following tasks and more as the data demands.

I. Fetch or Select data: Determine which data sets will be used
II. Clean data: Here, you handle missing values by either imputing or removal
III. Engineer features: This involves the creation of new features recommended by from subject matter experts and as needed
IV. Format data: Convert values in data to acceptable format (e.g. strings/categorical to numeric) by machine learning model.

Sample projects can be found: here and here

4. Modeling
Here you will build, assess and iterate over several models based on different modeling techniques.
I. Select modeling techniques: Determine which algorithms to try (e.g. trees, regression, neural net) and document
II. Generate test design: Pending your modeling approach, you might need to split the data into training, test, and validation sets
III. Build model: This might just be executing a few lines of code like "model = LinearRegression().fit(X_train, y_train)". Set parameters, describe the resulting models and report on the interpretation of the models
IV. Assess model: Generally, multiple models are competing against each other, and the data scientist needs to interpret the model results based on domain knowledge, the pre-defined success criteria, and the test design.

Although the CRISP-DM guide suggests to "iterate model building and assessment until you strongly believe that you have found the best model(s)", in practice teams should continue iterating until they find a "good enough" model, proceed through the CRISP-DM lifecycle, then further improve the model in future iterations.

Sample projects one and two here.

5. Evaluation
Previous Assess Model task dealt with factors such as the accuracy and generality of the model. This phase has three tasks:
I. Evaluate results: Do the models meet the business success criteria? Which one(s) should be approved for the business?
II. Review process: Review the work accomplished. Was anything overlooked? Were all steps properly executed? Summarize findings and correct anything if needed
III. Determine next steps: Depending on the results of the assessment and the process review, you now decide how to proceed. Do you finish this project and move on to deployment, initiate further iterations, or set up new projects?

Simple sample project here

6. Deployment
A model is not useful unless customers can access its results. The following four tasks conclude the phases:

I. Plan deployment: Take your evaluation results and determine a strategy for their deployment. It makes sense to consider the ways and means of deployment during the business understanding phase, because deployment is crucial to the success of the project.
II. Plan monitoring and maintenance: Develop a thorough monitoring and maintenance plan to avoid issues during the operational phase.
III. Produce final report: The project team documents a summary of the project which might include includes all of the previous deliverables, summarizing a final presentation of data mining results at a meeting.
IV. Review project: Conduct a project retrospective about what went well and what went wrong, and how to improve in the future.

Want to know how to get your data science journey started? Check this post.

Thank you for reading. Feel free to like and leave a comment.

Comments

Bashua Mubarak
May 06
This is a good read and a detailed explanation

CRISP-DM with detailed sample projects in Python and R

Comments

Bashua Mubarak