In the previous tutorial you learned how to use sdk python to interact with a Prevision.io regression use case. For a classification training type it is basically the same, with a few differences. In this tutorial we will address a churn business problem as a classification use-case and we will see how to build an auto-ml classification pipeline with Prevision platform using python SDK.
Please note that you have free trial access on the public cloud insatnce, so if you want to test the multiple range of Machine Learning services (feature engineering, creating, training and deploying machine learning models...) all you have to do is to log in. Then you can try the python sdk to interact with your use cases
In this tutorial you will learn how to use Previon python sdk to build automated machine learning classification pipeline with prevision platform. We will pass through the following steps:
First of all we have to connect to Prevision instance
import previsionio as pio
import pandas as pd
URL = 'https://XXXX.prevision.io'
TOKEN = '''YOUR_MASTER_TOKEN'''
# initialize client workspace
pio.client.init_client(URL, TOKEN)
Change the values of the TOKEN with the generated key and the URL endpoint with the name of your instance in order to continue running this notebook.
import previsionio as pio
import pandas as pd
URL = 'https://XXXX.prevision.io'
TOKEN = '''YOUR_MASTER_TOKEN'''
# initialize client workspace
pio.client.init_client(URL, TOKEN)
For this use case we dispose of a dataset containing a ‘Churn' feature which is a binary feature indicating whether or not a client had chruned. from this dataset we will construct 3 subsets:
If your datasets are already stocked, and you want to retrieve, transform and re-upload them, you have firstly to retrieve them as follows:
# get train, valid test dataset stocked on our workspace
# from its name
train = pio.Dataset.from_name('Telco-Customer-Churn_train')
# or its id
train = pio.Dataset.from_id('5ebaad70a7271000e7b28ea0')
valid = pio.Dataset.from_name('Telco-Customer-Churn_valid')
test = pio.Dataset.from_name('Telco-Customer-Churn_test')
train
and test
are Prevision Dataset objects. In order to load in-memory the data content into pandas DataFrames, use to_pandas()
method as follows:
train = train.to_pandas()
valid = valid.to_pandas()
test = test.to_pandas()
To load the datasets, you can reference a file path directly, or use a pre-read pandas dataframe, to easily create a new Prevision.io dataset on the platform:
# load some data from a CSV file
data_path = './Telco-Customer-Churn_train.csv'
dataset = pio.Dataset.new(name='Telco-Customer-Churn_train', file_name=data_path)
# or use a pandas DataFrame
dataframe = pd.read_csv(data_path)
pio_train = pio.Dataset.new(name='Telco-Customer-Churn_train', dataframe=dataframe)
# same for valid and test datsets
dataframe = pd.read_csv('./Telco-Customer-Churn_valid.csv')
pio_valid = pio.Dataset.new(name='Telco-Customer-Churn_valid', dataframe=dataframe)
dataframe = pd.read_csv('./Telco-Customer-Churn_test.csv')
pio_test = pio.Dataset.new(name='Telco-Customer-Churn_test', dataframe=dataframe)
To launch a new use case, you need to define some configuration parameters, and then simply use the SDK's BaseUsecase
derived methods to have the platform automatically take care of starting everything for you.
Set the columns configurations required to define at least the target column via a previsionio.ColumnConfig
object. In the case of a tabular usecase (Regression / Classification / Multi classification), we instantiate a previsionio.ColumnConfig
object with the following attributes:
target_column
: define the name of the target column, this field is mandatoryid_column
: if the dataset contains an ID column, it does not carry any predictive signal, and we should specify while configuring the dataset for the current usecase via this attribute. Later, each prediction is provided with its corresponding idfold_column
: Optionally we can set the fold_column corresponding to the stratification feature. By default the stratification strategy is based on the target column. Please note that the choice of this option can significantly impact the quality of predictions (About the importance of the stratification, Kohavi concludes that: stratification is generally a better scheme, both in terms of bias and variance, when compared to regular cross-validation. ). This option is not available for a Time Series usecase: (Prevision.io use temporal stratification).weight_column
(optional) Typically, this column contains a linear feature indicating the importance of a given row. The higher the weight, the more important the row is. If not fed, all rows are considered equally important (which is the case in most use cases). (not available for a Time Series usecase)drop_list
(optional): list of columns names to drop#configure columns of datasets
col_config = pio.ColumnConfig(target_column='Churn',
id_column='customerID')
If you want, you can also specify some training parameters, such as which models are used, which transformations are applied, and how the models are optimized. It can be done using previsionio.TrainingConfig
.
This object offers you a range of parameters to set:
name
: the name to assign to your use casedataset
: the corresponding pio.Dataset
holdout dataset to evaluate the generalization errorholdout_dataset
(optional):the corresponding pio.Dataset
training datasetmetric
: the metric that will be used to optimize models fine tuning: we will use pio.metrics.Classification.AUC
for our actual use case (consult this page to get more information about the other metrics that are supported by the platform)column_config
: this parameter takes as value a previsionio.ColumnConfig
object (checkout the first section in this page) it determines some speciale roles (target, id, fold..) of the training dataset features.training_config
: it takes as value a previsionio.TrainingConfig
object (checkout the previous section)For our exemple we will chose the following use case config:
uc_config = pio.TrainingConfig(models=[pio.Model.XGBoost, pio.Model.RandomForest],
features=pio.Feature.Full.drop(pio.Feature.PCA,
pio.Feature.KMeans,
pio.Feature.PolynomialFeatures),
profile=pio.Profile.Quick)
To create the usecase and start your training session, you need to call the fit()
method of one of the SDK's usecase classes including:
previsionio.Regression
: for a regression use case ; derived from previsionio.Supervised
classprevisionio.Classification
: for a classification use case ; derived from previsionio.Supervised
classprevisionio.MultiClassification
: for a multiclassification use case ; derived from previsionio.Supervised
classprevisionio.TimeSeries
: for time series problem ...The class you pick depends on the type of problem your usecase uses: regression, (binary) classification, multiclassification or timeseries; and on whether it uses a simple tabular dataset or images.
For our example we will use previsionio.Classification.fit()
method: it takes as parameters:
previsionio.Model.LinReg, previsionio.Model.LightGBM, previsionio.Model.RandomForest, previsionio.Model.XGBoost, previsionio.Model.ExtraTrees or previsionio.Model.NeuralNet
pio.SimpleModel.DecisionTree
and pio.SimpleModel.LinReg
which are decision tree model and linear regression model.pio.Profile.Quick
object (recommended for first iterations as baseline use cases): this profile enables fast running, however it has lower performancespio.Profile.Normal
object is something in-between (quicq and advanced running profile) to help you investigate an interesting resultpio.Profile.Advanced
class : it runs slower but has better performances (it is usually for optimization steps at the end of your project)For out use case we will opt for the following parameters:
#launch Regression auto ml use case
uc = pio.Classification.fit('churn_from_sdk',
dataset=pio_train,
metric=pio.metrics.Classification.AUC,
holdout_dataset=pio_valid,
column_config=col_config,
training_config=uc_config)
If you shut down your python session and you want to retrieve later your use case, use from_name
or from_id
methods with the wanted use case version:
version = 1
# get usecase from name
uc = pio.Classification.from_name('churn_from_sdk')
#or from id
uc_id = uc.id
uc = pio.Classification.from_if(uc_id)
uc.id
: the usecase iduc.version
: ther version of the use caseuc.name
: the name of use caseuc.type_problem
: contains the corresponding training type problem : it can be ‘regression', ‘classification' or ‘multiclassification'uc.train_dataset
contains the previsionio.Dataset
training datasetuc.metric
: usecase metricuc.data_type
: whether the training dataset type is tabular
, timeseries
or images
uc.drop_list
: list of dropped columnsuc.fe_selected_list
: list of the selected feature engineering for the use caseuc.models_list
the list of the selected models for the use case; the full list is optained with pio.Model.Full
uc.simple_models_list
: list of the selected simple models for the use case : the full list is ['DT', 'LR']
(linear regression and decision tree)uc.training_config
: the corresponding previsionio.TrainingConfig
objectuc.column_config
: the previsionio.ColumnConfig
corresponding to the usecaseuc.model_class
: the class of the models generated by the usecase: it can be previsionio.model.ClassificationModel
, previsionio.model.ClassificationModel
or previsionio.model.MultiClassificationModel
uc.running
: boolean: whether the use case is still running or notuc.score
: the best model scoreuc.shared_users
: list of shared users : each element is a dictionary containing diffrent information of the users that share the use case : ‘usernae', ‘name', ‘email' sharing-date..uc.schema
: a pandas.DataFrame containing features specefications : name, type, options(list of categories for the categorical features), min, max for numerical features.uc.correlation_matrix
: returns a (n x n) pandas.DataFrame
containing pearson coefficients between all the features (with n number of features)Different utilities are provided to interact with the models generated within the use case:
uc.models
: returns the list of the models objects created for the use case. Depending on the training type problem, each element is from previsionio.model.ClassificationModel
, previsionio.model.ClassificationModel
or previsionio.model.MultiClassificationModel
classesuc.best_model
returns previsionio model corresponding to the model with best performances for the use caseuc.best_single
: returns the model with the best predictive performance (the minimal loss)over signle models (excluding Blend and Mean based models)uc.fastest_model
: Returns the model that predicts with the lowest response timeget_model_from_name()
or get_model_from_id()
methods# from the model name
m = uc.get_model_from_name('XGB-4')
#or from its id
model_id = "5f75961f938e4e9f7205c746"
m = uc.get_model_from_id(model_id)
It exists two ways to retrieve the wanted model id: Either from the sdk: you can run the following command
for m in uc.models:
print(m.name, m.id)
Or from the url that is used on the front-front end when you select you model:
1-Select the wanted model from the Models tab
2- Retrieve the id from the url
Prevision Models can be retrieved only from trained use cases (as seen in the previous section). A Model object m has the following attributes:
m.id
: the corresponding identifierm.name
: the name of the modelm.algorithm
: the corresponding family model (‘LR', ‘XGB', ‘LGB', ‘RF', ‘ET' or ‘NN')m.duration
: training timem.durationUnitPrevision
: single prediction timem.score
: score of the modelm.uc_id
: the id of corresponding use casem.uc_version
: the version of the corresponding use casem.startDate
: date-time of the model creationm.endDate
: date-time of the model creationm.hyperparameters
: the dictionary corresponding to the hyper-parameters used to train the modelm.feature_importance
: Returns a dataframe of feature importances for the given model (sorted by descending scores)m.cross_validation
: Get Cross Validation predictions as a pandas.Dataframe formatm.optimal_threshold
: Get the value of threshold probability that optimizes the f1_scoreThe chosen model can be used to make prediction with different ways:
The classification differs from the Regression use case in In order to get the prediction scores you can use predict_proba()
to predict from a pandas dataframe or predict_from_dataset_name()
to predict from a dataset already stocked on your workspace. These two functions allows to get the probabilities od the predictions and not the binary decisions
df = pd.read_csv('/path/to/datafile')
# we will use the best model
m = uc.best_model
# predict_proba to predict from df
df_proba = m.predict_proba(df)
# if your dataset is already stocked
test_data = 'Telco-Customer-Churn_test'
df_poba = m.predict_from_dataset_name(test_data)
If you want to get the decisions (0 or 1) rather than the scores, you can use the predict()
function. The decisions are made by comparing the scores to a threshold.
By default it is set to 0.5 : if score>0.5 => the decision would be 1 otherwise 0
df = pd.read_csv('/path/to/datafile')
# we will use the best model
m = uc.best_model
# predict_proba to predict from df
df_decisions = m.predict(df)
If you want to change your desicion threshold (for exemple to the optimal threshold) you preceede as follows:
# change the value of internal _predict_threshold attribute
# (by default it is set to 0.5)
m._predict_threshold = m.optimal_threshold
# use predict()
df_decisions = m.predict(df)