SuperAlign Datasets are a crucial component for organizing user datasets. A Dataset serves as an empty container for storing the elements of the datasets and contains lineage, dataset-related graphs, and dataset files.

There are two types of datasets in SuperAlign: Private Datasets, which only the user can access and view the content, and Public Datasets, which are accessible to all SuperAlign users.

To register dataset files and add their relevant content to the Dataset, the user needs to initialize an empty Dataset, which can be done via the SuperAlign Python package.

Creating a Dataset

With the SuperAlign dataset module, you can perform a variety of actions related to creating and managing datasets. Here’s an overview of the available methods:

Creating a Dataset To create a new model, import the pureml module and use the dataset.init method:

import pureml

pureml.dataset.init(label='FirstDataset', readme='ReadME.md')

The name of the dataset to be created are required parameters. You can also provide an optional readme file path.

label parameter consists dataset name in the following format:

_\<name>:\<name>:\<version>_

For initializing a dataset, version is not required. So, we use <name> as the label.

label should not contain any spaces. Special characters other than ”-” and ”_” are not allowed

Created a dataset? No? Create now. Yes? Here’s what you should do next.

Register a Dataset version

After the Dataset has been initialized, you can register it using dataset module.

import pureml

data = ###
lineage = ##@

dataset = pureml.dataset.register(data, 'telecom churn', lineage)

lineage is required to register a dataset. Yours can utilize @dataset decorator to auto-generate data lineage.

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
from pureml.decorators import dataset, transformer, load_data

@dataset('telecom_churn')
def build_dataset():
    df = load_churn_data()
    df = encode_ordinal(df)
    df = encode_binary(df)
    return df

df = build_dataset()

Register a validated Dataset

Once the dataset is validated, here is how you can register the validated dataset.

@dataset(<dataset_label>)
def build_dataset
    ...
    x_test = #test features
    y_test = #test labels
    return {"x_test":x_test,  "y_test":y_test}

Fetching a Dataset version

Once you register your dataset to SuperAlign, you can load it using dataset module.

Let’s look at how to load the dataset:

import pureml

dataset = pureml.dataset.fetch('telecom churn')

By default, fetch fetches latest version of the dataset. A particular version of a dataset can be fetched by providing version parameter as the following.

dataset = pureml.dataset.fetch('telecom churn:v2')

Here, we have fetched the version v2 of the dataset telecom churn.

Listing Datasets

To list all available datasets, use the dataset.list method:

import pureml

pureml.dataset.list()

These methods make it easy to create and manage the models in SuperAlign. By using them, you can streamline your model management workflows and improve collaboration among team members.