Creating a Dataset
With the SuperAlign dataset module, you can perform a variety of actions related to creating and managing datasets. Here’s an overview of the available methods: Creating a Dataset To create a new model, import the pureml module and use thedataset.init
method:
label parameter consists dataset name in the following format:
_\<name>:\<name>:\<version>_
For initializing a dataset, version is not required. So, we use <name> as the label.label should not contain any spaces. Special characters other than ”-”
and ”_” are not allowed
Created a dataset? No? Create
now. Yes? Here’s what you should
do next.
Register a Dataset version
After the Dataset has been initialized, you can register it usingdataset
module.
lineage
is required to register a dataset. Yours can utilize @dataset
decorator to auto-generate data lineage.
Register a validated Dataset
Once the dataset is validated, here is how you can register the validated dataset.`x_test` and `y_test` keys are mandatory. Any other key-value pair is allowed in registered dataset.
`x_test` and `y_test` keys are mandatory. Any other key-value pair is allowed in registered dataset.
Ex. If you want to register dataset along with training features and labels, you can add an extra key-value pair as show below:
Fetching a Dataset version
Once you register your dataset to SuperAlign, you can load it usingdataset
module.
Let’s look at how to load the dataset:
fetch
fetches latest
version of the dataset. A particular version of a dataset can be fetched by providing version
parameter as the following.
v2
of the dataset telecom churn
.
Listing Datasets
To list all available datasets, use thedataset.list
method: