pyanno4rt.learning_model.dataset

Dataset module.


The module aims to provide methods and classes to import and restructure different types of learning model datasets (tabular, image-based, …).

Overview

Classes

TabularDataGenerator

Tabular dataset generation class.

Classes

class pyanno4rt.learning_model.dataset.TabularDataGenerator(model_label, feature_filter, label_name, label_bounds, time_variable_name, label_viewpoint, tune_splits, oof_splits)[source]

Tabular dataset generation class.

This class provides methods to load, decompose, modulate and binarize a tabular base dataset.

Parameters:
  • model_label (str) – Label for the machine learning model.

  • feature_filter (dict) – Dictionary with a list of feature names and a value from {‘retain’, ‘remove’} as an indicator for retaining/removing the features prior to model fitting.

  • label_name (str) – Name of the label variable.

  • label_bounds (list) – Bounds for the label values to binarize into positive (value lies inside the bounds) and negative class (value lies outside the bounds).

  • time_variable_name (str) – Name of the time-after-radiotherapy variable (unit should be days).

  • label_viewpoint ({'early', 'late', 'long-term', 'longitudinal', 'profile'}) – Time of observation for the presence of tumor control and/or normal tissue complication events.

  • tune_splits (int) – Number of splits for the stratified cross-validation within each model hyperparameter optimization step.

  • oof_splits (int) – Number of splits for the stratified cross-validation within the out-of-folds model evaluation step.

model_label

See ‘Parameters’.

Type:

str

feature_filter

See ‘Parameters’.

Type:

dict

label_name

See ‘Parameters’.

Type:

str

label_bounds

See ‘Parameters’.

Type:

list

time_variable_name

See ‘Parameters’.

Type:

str

label_viewpoint

See ‘Parameters’.

Type:

{‘early’, ‘late’, ‘long-term’, ‘longitudinal’, ‘profile’}

tune_splits

See ‘Parameters’.

Type:

int

oof_splits

See ‘Parameters’.

Type:

int

Overview

Methods

generate(data_path)

Generate the data information.

decompose(dataset, feature_filter, label_name, time_variable_name)

Decompose the base tabular dataset.

modulate(data_information, label_viewpoint)

Modulate the data information.

binarize(data_information, label_bounds)

Binarize the data information.

add_fold_numbers(data_information, tune_splits, oof_splits)

Add the stratified cross-validation fold numbers.

Members

generate(data_path)[source]

Generate the data information.

Parameters:

data_path (str) – Path to the data set used for fitting the machine learning model.

Returns:

Dictionary with the decomposed, modulated and binarized data information.

Return type:

dict

decompose(dataset, feature_filter, label_name, time_variable_name)[source]

Decompose the base tabular dataset.

Parameters:
  • dataset (DataFrame) – Dataframe with the feature and label names/values.

  • feature_filter (dict) – Dictionary with a list of feature names and a value from {‘retain’, ‘remove’} as an indicator for retaining/removing the features prior to model fitting.

  • label_name (str) – Name of the label variable.

  • time_variable_name (str) – Name of the time-after-radiotherapy variable (unit should be days).

Returns:

Dictionary with the decomposed data information.

Return type:

dict

modulate(data_information, label_viewpoint)[source]

Modulate the data information.

Parameters:
  • data_information (dict) – Dictionary with the decomposed data information.

  • label_viewpoint ({'early', 'late', 'long-term', 'longitudinal', 'profile'}) – Time of observation for the presence of tumor control and/or normal tissue complication events.

Returns:

Dictionary with the modulated data information.

Return type:

dict

binarize(data_information, label_bounds)[source]

Binarize the data information.

Parameters:
  • data_information (dict) – Dictionary with the decomposed data information.

  • label_bounds (list) – Bounds for the label values to binarize into positive (value lies inside the bounds) and negative class (value lies outside the bounds).

Returns:

Dictionary with the binarized data information.

Return type:

dict

add_fold_numbers(data_information, tune_splits, oof_splits)[source]

Add the stratified cross-validation fold numbers.

Parameters:
  • data_information (dict) – Dictionary with the preprocessed data information.

  • tune_splits (int) – Number of splits for the stratified cross-validation within each model hyperparameter optimization step.

  • oof_splits (int) – Number of splits for the stratified cross-validation within the out-of-folds model evaluation step.

Returns:

Dictionary with the stratified cross-validation fold numbers.

Return type:

dict