Getting started with Swedish Leaf Dataset classification


Pureml SDK & CLI can be directly installed using pip.

pip install pureml

For additional project requirements we will need to install the following packages

You can use the following command to install the packages.

pip install pandas==1.4.3 numpy==1.23.1 xgboost==1.7.2 scikit-learn==1.2.0


you can create a requirements.txt file with the following contents


and run the following command

pip install -r requirements.txt

Download and load your dataset

Download your dataset from here.

Start by creating a function to load the dataset into a DataFrame. We will use the @load_data() decorator from PureML SDK.

import cv2
import os
from pureml.decorators import load_data, transformer, dataset, model
import pandas as pd

def load_images(df):
    path_dir = './leaf_datasets_data/swedish-leaf/images' #change the location where your dataset exists

    filepath = []
    for root, dirs, files in os.walk(path_dir):
        for file in files:

    df['filepath'] = pd.Series(filepath)

    df['image'] = df.filepath.apply(lambda x : cv2.imread(x))
    df['label'] = df.filepath.apply(lambda x : int(x.split('/')[-2].strip('leaf')))

    df = df.dropna(axis=0, subset=['image'])

    return df

Preprocess the data

We can add a few more functions to preprocess the data. We will use the @transformer() decorator from PureML SDK.

Add the following additional imports to the top of your file.

import numpy as np
from skimage.feature import graycomatrix, graycoprops
from sklearn.model_selection import train_test_split

with transformer functions. We specify the parent of the functions using the parent argument. This will ensure that the functions are executed in the order specified.

def resize_images(df, size):

    df['image'] = df.image.apply(lambda x: cv2.resize(x, (size, size)))

    return df

def generate_foreground_mask(df):

    def get_mask(row):
        img = row.image
        sat = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)[:,:,0]
        mask = np.zeros((sat.shape[0], sat.shape[1]), np.uint8)

        ret,thresh = cv2.threshold(sat,20,255,0)
        contours, _ = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
        if len(contours)>0:
            contours_max = max(contours, key = cv2.contourArea)

        return mask, contours

    df[['mask','contours']] = df.apply(get_mask, axis=1, result_type ='expand')

    return df

def convert_to_grayscale(df):
    df['gray'] = df.image.apply(lambda x: cv2.cvtColor(x, cv2.COLOR_BGR2GRAY))

    return df

def generate_color_features(df):

    def get_features(row):
        gray = row.gray

        mean, std = cv2.meanStdDev(gray)
        mean, std = np.array(mean), np.array(std)
        min_v = np.array(gray.min())
        max_v = np.array(gray.max() )
        hist = cv2.calcHist([gray],[0], None, [32], [0,256])

        feat = np.concatenate((mean.reshape(-1), std.reshape(-1), min_v.reshape(-1), max_v.reshape(-1), hist.reshape(-1)))

        return feat

    df['color'] = df.apply(get_features, axis=1)

    return df

def generate_texture_features(df):

    def get_features(gray):

        gray = np.array((gray/8), np.uint8)
        glcm = graycomatrix(gray, distances=[2], angles=[0],levels=32, normed=True, symmetric=True)
        mean, std = cv2.meanStdDev(glcm)
        mean, std = np.array(mean), np.array(std)
        contrast = graycoprops(glcm, 'contrast')
        corelation = graycoprops(glcm, 'correlation')
        homogeneity = graycoprops(glcm, 'homogeneity')
        diss = graycoprops(glcm, 'dissimilarity')
        eng = graycoprops(glcm, 'energy')
        ASM = graycoprops(glcm, 'ASM')

        feat = np.concatenate((mean.reshape(-1), std.reshape(-1), contrast.reshape(-1), corelation.reshape(-1),
                               homogeneity.reshape(-1), diss.reshape(-1), eng.ravel(), ASM.ravel()))
        return feat

    df['texture'] = df.gray.apply(get_features)

    return df

def generate_shape_features(df):

    def get_features(row):
        contours = row.contours
        gray = row.gray

        area = np.array(cv2.contourArea(contours[0]))
        perimeter = np.array(cv2.arcLength(contours[0], True))

        (x,y),radius = cv2.minEnclosingCircle(contours[0])
        radius = np.array(radius)

        moments = cv2.HuMoments(cv2.moments(gray)).flatten()

        hull = np.array(cv2.convexHull(contours[0]))
        hull_area = np.array(cv2.contourArea(hull))

        feat = np.concatenate((area.reshape(-1), perimeter.reshape(-1), radius.reshape(-1),

        return feat

    df['shape'] = df.apply(get_features, axis=1)

    return df

@transformer(parent=['generate_shape_features', 'generate_color_features', 'generate_texture_features'])
def generate_features(df):

    def get_feature(row):

        feature = np.concatenate([row['color'], row['shape'], row['texture']])

        return feature

    df['feature'] = df.apply(get_feature, axis=1)

    return df

def split_data(df):
    x_train, x_test,  y_train, y_test = train_test_split(df['feature'].values, df['label'].values, test_size=0.2, random_state=42)

    x_train = np.stack(x_train, axis=0)
    x_test = np.stack(x_test, axis=0)
    y_train = np.stack(y_train, axis=0)
    y_test = np.stack(y_test, axis=0)

    return x_train, x_test,  y_train, y_test

A transformer can have multiple parents. In this case, the transformer will be executed after all the parents have been executed. The output of the parents will be passed as input to the transformer.

Creating a dataset

We can now create a dataset from the pipeline. The dataset will be created by executing the pipeline and saving the output of the last transformer in the pipeline. The dataset can be created by using the @dataset decorator. The decorator takes the following arguments:

  • label: The name of the dataset
  • parent: The name of the transformer whose output will be saved as the dataset
  • upload: If True, the dataset will be uploaded to the cloud. If False, the dataset will be saved locally.
@dataset(label='flavia_handcrafted_features:dev_2',parent='split_data', upload=True)
def create_dataset():
    columns_all = ['filepath','image', 'label', 'mask', 'gray', 'contours', 'color', 'shape', 'texture', 'feature']
    columns_needed = ['feature','label']

    df = pd.DataFrame(columns=columns_all)

    df = load_images(df)

    df = resize_images(df, size=128)

    df = generate_foreground_mask(df)

    df = convert_to_grayscale(df)

    df = generate_color_features(df)

    df = generate_texture_features(df)

    df = generate_shape_features(df)

    df = generate_features(df)

    df = df[columns_needed]

    x_train, x_test,  y_train, y_test = split_data(df)

    return {"x_train":x_train, "x_test":x_test,  "y_train":y_train, "y_test":y_test}

data = create_dataset()
x_train, x_test = data["x_train"], data["x_test"]
y_train, y_test = data["y_train"], data["y_test"]

Creating a model to classify the dataset

With the PureML model module, you can perform a variety of actions related to creating and managing models and branches. PureML assists you with training and tracking all of your machine learning project information, including ML models and datasets, using semantic versioning and full artifact logging.

We can make a separate python file for the model. The model file will contain the model definition and the training code. Let’s start by adding the required imports.

from pytorch_tabnet.tab_model import TabNetClassifier
import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau
import pureml

The model training function can be created by using the @model decorator. The decorator takes the model name and branch as the argument in the format model_name:branch_name.

def train_model():

    tabnet_params = dict(
        gamma = 1.5,
        lambda_sparse = 0,

        optimizer_fn = optim.Adam,
        optimizer_params = dict(lr = 2.5e-2, weight_decay = 1e-6),

        mask_type = "entmax",
        scheduler_params = dict(
            mode = "min", patience = 5, min_lr = 1e-5, factor = 0.5),
        scheduler_fn = ReduceLROnPlateau,
        seed = 42,
        verbose = 10

    epochs = 60
    batch_size = 32

    tabnet = TabNetClassifier(**tabnet_params)

    tabnet.fit(x_train, y_train,
            eval_set = [(x_test, y_test)],
            max_epochs = epochs,
            batch_size = batch_size,
            patience = 12,
            eval_metric=['accuracy', 'balanced_accuracy', 'logloss']
            #device_name = DEVICE

        metrics = tabnet.history.epoch_metrics,
        params = {
            'gamma': tabnet_params['gamma'],
            'lr': tabnet_params['optimizer_params']['lr'],
            'mask_type': tabnet_params['mask_type'],
            'scheduler_fn': 'ReduceLROnPlateau',
            'seed': tabnet_params['seed'],
            'epochs': epochs,

    return tabnet

tabnet = train_model()

The pureml.log function is used here to log the metrics and parameters of the model.

Once ouur training is complete our model will be ready to rock and roll🎸✨. But that’s too much of a hassle. So for now, let’s just do some predictions

Add prediction to your model

For registered models, prediction function along with its requirements and resources can be logged to be used for further processes like evaluating and packaging.

PureML predict module has a method add. Here we are using the following arguments:

  • label: The name of the model (model_name:branch_name:version)
  • paths: The path to the predict.py file and requirements.txt file.

Our predict.py file has the script to load the model and make predictions. The requirements.txt file has the dependencies required to run the predict.py file.

You can know more about the prediction process here

Click Here to see predict.py file
import pureml

                   paths={"predict": "./predict.py", "requirements":"./requirements.txt"})

Create your first Evaluation

PureML has an eval method that runs a task_type on a label_model using a label_dataset.

import pureml

Deploy with FastAPI

You evaluated model can now be deployed using FastAPI simply by using the pureml.fastapi.run method.

import pureml


Deploy using Docker

Alternatively, you can also use Docker for deployment:

To deploy on docker you need to create an .env file with the following variables:


You can get your API token and org id from the settings page. If you do not remember/have any API token then you can generate one

You can then create a docker image using the pureml.docker.create method.

import pureml

pureml.docker.create(label=<model_label>',      # label of the model to be deployed
                     env_path='.env',           # path to the .env file
                     port=<port>,               # port on which the server will run
                     sys_commands=[             # any system commands you want to run before starting the server
                         "apt install gcc -y",