Create dataset
Add your first dataset.
SuperAlign Datasets are a crucial component for organizing user datasets. A Dataset serves as an empty container for storing the elements of the datasets and contains lineage, dataset-related graphs, and dataset files.
There are two types of datasets in SuperAlign: Private Datasets, which only the user can access and view the content, and Public Datasets, which are accessible to all SuperAlign users.
To register dataset files and add their relevant content to the Dataset, the user needs to initialize an empty Dataset, which can be done via the SuperAlign Python package.
Creating a Dataset
With the SuperAlign dataset module, you can perform a variety of actions related to creating and managing datasets. Here’s an overview of the available methods:
Creating a Dataset To create a new model, import the pureml module and use the dataset.init
method:
The name of the dataset to be created are required parameters. You can also provide an optional readme file path.
label parameter consists dataset name in the following format:
_\<name>:\<name>:\<version>_
For initializing a dataset, version is not required. So, we use <name> as the label.
label should not contain any spaces. Special characters other than ”-” and ”_” are not allowed
Created a dataset? No? Create now. Yes? Here’s what you should do next.
Register a Dataset version
After the Dataset has been initialized, you can register it using dataset
module.
lineage
is required to register a dataset. Yours can utilize @dataset
decorator to auto-generate data lineage.
Register a validated Dataset
Once the dataset is validated, here is how you can register the validated dataset.
Fetching a Dataset version
Once you register your dataset to SuperAlign, you can load it using dataset
module.
Let’s look at how to load the dataset:
By default, fetch
fetches latest
version of the dataset. A particular version of a dataset can be fetched by providing version
parameter as the following.
Here, we have fetched the version v2
of the dataset telecom churn
.
Listing Datasets
To list all available datasets, use the dataset.list
method:
These methods make it easy to create and manage the models in SuperAlign. By using them, you can streamline your model management workflows and improve collaboration among team members.