In the previous tutorial you learned how to use sdk python to interact with a Prevision.io regression use case. For a classification training type it is basically the same, with a few differences. In this tutorial we will address a churn business problem as a classification use-case and we will see how to build an auto-ml classification pipeline with Prevision platform using python SDK.

Please note that you have free trial access on the public cloud insatnce, so if you want to test the multiple range of Machine Learning services (feature engineering, creating, training and deploying machine learning models...) all you have to do is to log in. Then you can try the python sdk to interact with your use cases

What You'll Learn?

In this tutorial you will learn how to use Previon python sdk to build automated machine learning classification pipeline with prevision platform. We will pass through the following steps:

Connect to the instance:

First of all we have to connect to Prevision instance

import previsionio as pio
import pandas as pd

URL = 'https://XXXX.prevision.io'
TOKEN = '''YOUR_MASTER_TOKEN'''

# initialize client workspace
pio.client.init_client(URL, TOKEN)

Change the values of the TOKEN with the generated key and the URL endpoint with the name of your instance in order to continue running this notebook.

import previsionio as pio
import pandas as pd

URL = 'https://XXXX.prevision.io'
TOKEN = '''YOUR_MASTER_TOKEN'''

# initialize client workspace
pio.client.init_client(URL, TOKEN)

For this use case we dispose of a dataset containing a ‘Churn' feature which is a binary feature indicating whether or not a client had chruned. from this dataset we will construct 3 subsets:

  1. The training dataset : Telco-Customer-Churn_train This datast is used to train our models, and also to perform the internal validation or cross validation: in fact depending on the training profile (quick/normal or advanced) this dataset is splitted on n sub-sets called folds, the model is evaluated n time at each iteration one subset is kept for validation and the model is trained on the others then evaluated on the remaining one.
  2. The holdout dataset : Telco-Customer-Churn_valid It is used to evaluate the generalisation error : the performance of the model on new data (never seen while training)
  3. the training dataset : Telco-Customer-Churn_test We will use this dataset to show how to use the sdk to predict on new data

Case1 : Get the datasets are already stocked on the platform :

If your datasets are already stocked, and you want to retrieve, transform and re-upload them, you have firstly to retrieve them as follows:

# get train, valid test dataset stocked on our workspace
# from its name
train = pio.Dataset.from_name('Telco-Customer-Churn_train')
# or its id
train = pio.Dataset.from_id('5ebaad70a7271000e7b28ea0')

valid = pio.Dataset.from_name('Telco-Customer-Churn_valid')
test = pio.Dataset.from_name('Telco-Customer-Churn_test')

train and test are Prevision Dataset objects. In order to load in-memory the data content into pandas DataFrames, use to_pandas() method as follows:

train = train.to_pandas()
valid = valid.to_pandas()
test = test.to_pandas()

Case2 : Load new datasets on your workspace:

To load the datasets, you can reference a file path directly, or use a pre-read pandas dataframe, to easily create a new Prevision.io dataset on the platform:

# load some data from a CSV file
data_path = './Telco-Customer-Churn_train.csv'
dataset = pio.Dataset.new(name='Telco-Customer-Churn_train', file_name=data_path)

# or use a pandas DataFrame
dataframe = pd.read_csv(data_path)
pio_train = pio.Dataset.new(name='Telco-Customer-Churn_train', dataframe=dataframe)

# same for valid and test datsets
dataframe = pd.read_csv('./Telco-Customer-Churn_valid.csv')
pio_valid = pio.Dataset.new(name='Telco-Customer-Churn_valid', dataframe=dataframe)
dataframe = pd.read_csv('./Telco-Customer-Churn_test.csv')
pio_test = pio.Dataset.new(name='Telco-Customer-Churn_test', dataframe=dataframe)

To launch a new use case, you need to define some configuration parameters, and then simply use the SDK's BaseUsecase derived methods to have the platform automatically take care of starting everything for you.

Column Configuration: (also provided on the other tutorials)

Set the columns configurations required to define at least the target column via a previsionio.ColumnConfig object. In the case of a tabular usecase (Regression / Classification / Multi classification), we instantiate a previsionio.ColumnConfig object with the following attributes:

#configure columns of datasets
col_config = pio.ColumnConfig(target_column='Churn', 
                              id_column='customerID')

Use case Configuration: (also provided on the other tutorials)

If you want, you can also specify some training parameters, such as which models are used, which transformations are applied, and how the models are optimized. It can be done using previsionio.TrainingConfig.

This object offers you a range of parameters to set:

For our exemple we will chose the following use case config:

uc_config = pio.TrainingConfig(models=[pio.Model.XGBoost, pio.Model.RandomForest],
                               features=pio.Feature.Full.drop(pio.Feature.PCA, 
                                                              pio.Feature.KMeans,
                                                              pio.Feature.PolynomialFeatures),
                                profile=pio.Profile.Quick)

To create the usecase and start your training session, you need to call the fit() method of one of the SDK's usecase classes including:

The class you pick depends on the type of problem your usecase uses: regression, (binary) classification, multiclassification or timeseries; and on whether it uses a simple tabular dataset or images.

For our example we will use previsionio.Classification.fit() method: it takes as parameters:

For out use case we will opt for the following parameters:

#launch Regression auto ml use case
uc = pio.Classification.fit('churn_from_sdk',
                            dataset=pio_train,
                            metric=pio.metrics.Classification.AUC,
                            holdout_dataset=pio_valid,
                            column_config=col_config,
                            training_config=uc_config)

If you shut down your python session and you want to retrieve later your use case, use from_name or from_id methods with the wanted use case version:

version = 1
# get usecase from name
uc = pio.Classification.from_name('churn_from_sdk')

#or from id
uc_id = uc.id
uc = pio.Classification.from_if(uc_id)

Attributes and properties:

Get the use case models:

Different utilities are provided to interact with the models generated within the use case:

  1. Get the list of the models: uc.models: returns the list of the models objects created for the use case. Depending on the training type problem, each element is from previsionio.model.ClassificationModel, previsionio.model.ClassificationModel or previsionio.model.MultiClassificationModel classes
  2. Get special models:
  1. Get a specific model: You can get a specific model (not necessarily a special one best or fastest), using get_model_from_name() or get_model_from_id() methods
# from the model name
m = uc.get_model_from_name('XGB-4')

#or from its id
model_id = "5f75961f938e4e9f7205c746"
m = uc.get_model_from_id(model_id)

It exists two ways to retrieve the wanted model id: Either from the sdk: you can run the following command

for m in uc.models:
    print(m.name, m.id)

Or from the url that is used on the front-front end when you select you model:

1-Select the wanted model from the Models tabalt-text-here

2- Retrieve the id from the urlalt-text-here

Prevision Models can be retrieved only from trained use cases (as seen in the previous section). A Model object m has the following attributes:

Make predictions:

The chosen model can be used to make prediction with different ways:

  1. Predict a pandas.DataFrame composed of new data to predict
  2. Predict a dataset already stocked on your workspace
  3. Get Cross Validation predictions

Get the prediciton scores:

The classification differs from the Regression use case in In order to get the prediction scores you can use predict_proba() to predict from a pandas dataframe or predict_from_dataset_name() to predict from a dataset already stocked on your workspace. These two functions allows to get the probabilities od the predictions and not the binary decisions

df = pd.read_csv('/path/to/datafile')

# we will use the best model
m = uc.best_model
# predict_proba to predict from df
df_proba = m.predict_proba(df)

# if your dataset is already stocked 
test_data = 'Telco-Customer-Churn_test'
df_poba = m.predict_from_dataset_name(test_data)

Get the binary decisions

If you want to get the decisions (0 or 1) rather than the scores, you can use the predict() function. The decisions are made by comparing the scores to a threshold.

By default it is set to 0.5 : if score>0.5 => the decision would be 1 otherwise 0

df = pd.read_csv('/path/to/datafile')
# we will use the best model
m = uc.best_model
# predict_proba to predict from df
df_decisions = m.predict(df)

If you want to change your desicion threshold (for exemple to the optimal threshold) you preceede as follows:

# change the value of internal _predict_threshold attribute 
# (by default it is set to 0.5)
m._predict_threshold = m.optimal_threshold
# use predict()
df_decisions = m.predict(df)