Version Dataset
Created a dataset? No? Create now. Yes? Here’s what you should do next.
Register a Dataset version
After the Dataset has been initialized, you can register it using dataset
module.
import pureml
data = ###
lineage = ##@
dataset = pureml.dataset.register(data, 'telecom churn:dev', lineage)
lineage
is required to register a dataset. Yours can utilize @dataset
decorator to auto-generate data lineage.
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
from pureml.decorators import dataset, transformer, load_data
@dataset('telecom_churn:dev')
def build_dataset():
df = load_churn_data()
df = encode_ordinal(df)
df = encode_binary(df)
return df
df = build_dataset()
Register a validated Dataset
Once the dataset is validated, here is how you can register the validated dataset.
@dataset(<dataset_label>)
def build_dataset
...
x_test = #test features
y_test = #test labels
return {"x_test":x_test, "y_test":y_test}
Fetching a Dataset version
Once you register your dataset to PureML, you can load it using dataset
module.
Let’s look at how to load the dataset:
import pureml
dataset = pureml.dataset.fetch('telecom churn:dev')
By default, fetch
fetches latest
version of the dataset. A particular version of a dataset can be fetched by providing version
parameter as the following.
dataset = pureml.dataset.fetch('telecom churn:dev:v2')
Here, we have fetched the version v2
of the dataset telecom churn
.
Submit, approve and reject dataset version in review
By providing a comprehensive set of data lineage and visualizations, PureML makes it easy to identify and correct any issues with its review feature. Owner can approve or reject to add in production.