Summary: Analysis, Financial Insights, and Delivery to the Client
Learn about different methods of delivering the model to the client and to monitor its effectivity throughout the deployment.
Thoughts: Delivering a predictive model to the client
We have now completed the modeling activities and also created a financial analysis to indicate to the client how they can use the model. While we have completed the essential intellectual contributions that are the data scientist’s responsibility, it is necessary to agree with the client on the form in which all these contributions will be delivered.
A key contribution is the predictive capability embodied in the trained model. Assuming the client can work with the trained model object we created with XGBoost, this model could be saved to disk as we’ve done and sent to the client. Then, the client would be able to use it within their workflow. This pathway to model delivery may require the data scientist to work with engineers in the client’s organization, to deploy the model within the client’s infrastructure.
Alternatively, it may be necessary to express the model as a mathematical equation (for example, using logistic regression coefficients) or a set of if-then statements (as in decision trees or random forest) that the client could use to implement the predictive capability in SQL. While expressing random forests in SQL code is cumbersome due to the possibility of having many trees with many levels, there are software packages that will create this representation for you from a trained scikit-learn model (for example, SKompiler).
Cloud platforms for model development and deployment: In this course, we used scikit-learn and the XGBoost package to build predictive models locally on our computers. Recently, cloud platforms such as Amazon Web Services (AWS) have made machine learning capabilities available through offerings such as Amazon SageMaker. SageMaker includes a version of XGBoost, which you can use to train models with similar syntax to what we’ve done here. Subtle differences may exist in the implementation of model training between the methods shown in this course and the Amazon distribution of SageMaker, and you are encouraged to check your work every step of the way to make sure your results are as intended. For example, fitting an XGBoost model using early stopping may require additional steps in SageMaker to ensure the trained model uses the best iteration for predictions, as opposed to the last iteration when training stopped.
Cloud platforms such as AWS are attractive because they may greatly simplify the process of integrating a trained machine learning model into a client’s technical stack, which in many cases may already be built on a cloud platform.
Before using the model to make predictions, the client would need to ensure that the data was prepared in the same way it was for the model building we have done. For example, the removal of samples with values of 0 for all the features and the cleaning of the EDUCATION
and MARRIAGE
features would have to be done in the same way we demonstrated earlier in this section. Alternatively, there are other possible ways to deliver model predictions, such as an arrangement where the client delivers features to the data scientist and receives the predictions back.
Another important consideration for the discussion of deliverables is: what format should the predictions be delivered in? A typical delivery format for predictions from a binary classification model, such as that we’ve created for the case study, is to rank accounts by their predicted probability of default. The predicted probability should be supplied along with the account ID and whatever other columns the client would like. This way, when the call center is working their way through the list of account holders to offer counseling to, they can contact those at highest risk for default first and proceed to lower-priority account holders as time and resources allow. The client should be informed of which threshold to use for predicted probabilities, to result in the highest net savings. This threshold would represent the stopping point on the list of account holders to contact if it is ranked on the predicted probability of default.
Get hands-on with 1200+ tech skills courses.