Version
Capture Data lineage

Using PureML, you can build complicated pipelines with much ease. In our previous tutorial, all the steps in the pipeline have one input and one output. However, in quite a few cases, a step could have multiple inputs. We will use parent parameter to construct these pipelines.

Let us consider the following example.

‚Äč
Example

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
from pureml.decorators import dataset, model, transformer, load_data

@load_data()
def load_churn_data():
    '''
    Loads churn data
    '''
    df = pd.read_csv('../bigml_59c28831336c6604c800002a.csv')

    return df

@transformer()
def encode_ordinal(df):
    '''
    Encode ordinal data
    '''
    df[df.columns] = OrdinalEncoder().fit_transform(df)

    return df

@transformer()
def encode_binary(df):
    '''
    Encode binary data
    '''
    df['voice mail plan'] = df['voice mail plan'].map({'yes':1, 'no':0})
    df['international plan'] = df['international plan'].map({'yes':1, 'no':0})
    df['churn'] = df['churn'].map({True:1, False:0})

    return df

@dataset('telecom churn:dev', parent=['encode ordinal', 'encode binary'])
def build_dataset():
    '''
    None
    '''
    df = load_churn_data()

    column_names = df.columns

    df[['state',  'phone number']] = encode_ordinal(df[['state',  'phone number']])
    df[['voice mail plan', 'international plan', 'churn']] = encode_binary(df[['voice mail plan', 'international plan', 'churn']])

    return df

df = build_dataset()

The above example generates the following pipeline structure:

Advanced Lineage