Dataset
PureML Datasets are a crucial component for organizing user datasets. A Dataset serves as an empty container for storing the elements of the datasets and contains lineage, dataset-related graphs, and dataset files.
There are two types of datasets in PureML: Private Datasets, which only the user can access and view the content, and Public Datasets, which are accessible to all PureML users.
To register dataset files and add their relevant content to the Dataset, the user needs to initialize an empty Dataset, which can be done via the PureML Python package.
Creating a Dataset
With the PureML dataset module, you can perform a variety of actions related to creating and managing datasets and branches. Here’s an overview of the available methods:
Creating a Dataset To create a new model, import the pureml module and use the model.init
method:
import pureml
pureml.dataset.init(label='FirstDataset:dev', readme='ReadME.md')
The name of the dataset and the branch to be created are required parameters. You can also provide an optional readme file path.
label parameter consists dataset name, branch in the following format:
_\<name>:\<name>:\<version>_
For initializing a dataset, version is not required. So, we use <name>:<branch> as the label.
label should not contain any spaces. Special characters other than ”-” and ”_” are not allowed
Listing Datasets
To list all available datasets, use the dataset.list
method:
import pureml
pureml.dataset.list()
Creating a Branch
To create a new branch for a dataset, use the dataset.init_branch
method:
import pureml
pureml.dataset.init_branch(label='FirstDataset:dev_2')
The branch name and the name of the dataset in which the branch will be created are required parameters.
Listing Branches
To list all available branches for a model, use the dataset.branch_list
method:
import pureml
pureml.dataset.branch_list(label='FirstDataset')
label parameter consists dataset name, branch in the following format,
<name>:<branch>:<version>.
For getting a list of branches of a dataset, branch, and version is not required. So, we use <name> as the label.
These methods make it easy to create and manage the models and branches in PureML. By using them, you can streamline your model management workflows and improve collaboration among team members.