SuperAlign Datasets are a crucial component for organizing user datasets. A Dataset serves as an empty container for storing the elements of the datasets and contains lineage, dataset-related graphs, and dataset files. There are two types of datasets in SuperAlign: Private Datasets, which only the user can access and view the content, and Public Datasets, which are accessible to all SuperAlign users. To register dataset files and add their relevant content to the Dataset, the user needs to initialize an empty Dataset, which can be done via the SuperAlign Python package.Documentation Index
Fetch the complete documentation index at: https://docs.superalign.ai/llms.txt
Use this file to discover all available pages before exploring further.
Creating a Dataset
With the SuperAlign dataset module, you can perform a variety of actions related to creating and managing datasets. Here’s an overview of the available methods: Creating a Dataset To create a new model, import the pureml module and use thedataset.init method:
label parameter consists dataset name in the following format:
_\<name>:\<name>:\<version>_For initializing a dataset, version is not required. So, we use <name> as the label.Created a dataset? No? Create
now. Yes? Here’s what you should
do next.
Register a Dataset version
After the Dataset has been initialized, you can register it usingdataset module.
lineage is required to register a dataset. Yours can utilize @dataset decorator to auto-generate data lineage.
Register a validated Dataset
Once the dataset is validated, here is how you can register the validated dataset.`x_test` and `y_test` keys are mandatory. Any other key-value pair is allowed in registered dataset.
`x_test` and `y_test` keys are mandatory. Any other key-value pair is allowed in registered dataset.
Ex. If you want to register dataset along with training features and labels, you can add an extra key-value pair as show below:
Fetching a Dataset version
Once you register your dataset to SuperAlign, you can load it usingdataset module.
Let’s look at how to load the dataset:
fetch fetches latest version of the dataset. A particular version of a dataset can be fetched by providing version parameter as the following.
v2 of the dataset telecom churn.
Listing Datasets
To list all available datasets, use thedataset.list method: