Version Dataset

Created a dataset? No? Create now. Yes? Here’s what you should do next.

Register a Dataset version

After the Dataset has been initialized, you can register it using dataset module.

import pureml

data = ###
lineage = ##@

dataset = pureml.dataset.register(data, 'telecom churn:dev', lineage)

lineage is required to register a dataset. Yours can utilize @dataset decorator to auto-generate data lineage.

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
from pureml.decorators import dataset, transformer, load_data

def build_dataset():
    df = load_churn_data()
    df = encode_ordinal(df)
    df = encode_binary(df)
    return df

df = build_dataset()

Register a validated Dataset

Once the dataset is validated, here is how you can register the validated dataset.

def build_dataset
    x_test = #test features
    y_test = #test labels
    return {"x_test":x_test,  "y_test":y_test}

Fetching a Dataset version

Once you register your dataset to PureML, you can load it using dataset module.

Let’s look at how to load the dataset:

import pureml

dataset = pureml.dataset.fetch('telecom churn:dev')

By default, fetch fetches latest version of the dataset. A particular version of a dataset can be fetched by providing version parameter as the following.

dataset = pureml.dataset.fetch('telecom churn:dev:v2')

Here, we have fetched the version v2 of the dataset telecom churn.

Submit, approve and reject dataset version in review

By providing a comprehensive set of data lineage and visualizations, PureML makes it easy to identify and correct any issues with its review feature. Owner can approve or reject to add in production.