pyanno4rt.learning_model.dataset

Dataset module.

The module aims to provide methods and classes to import and restructure different types of learning model datasets (tabular, image-based, …).

Overview

Classes
`TabularDataGenerator`	Tabular dataset generation class.

Classes

class pyanno4rt.learning_model.dataset.TabularDataGenerator(model_label, feature_filter, label_name, label_bounds, time_variable_name, label_viewpoint, tune_splits, oof_splits)[source]

Tabular dataset generation class.

This class provides methods to load, decompose, modulate and binarize a tabular base dataset.

Parameters:

model_label (str) – Label for the machine learning model.
feature_filter (dict) – Dictionary with a list of feature names and a value from {‘retain’, ‘remove’} as an indicator for retaining/removing the features prior to model fitting.
label_name (str) – Name of the label variable.
label_bounds (list) – Bounds for the label values to binarize into positive (value lies inside the bounds) and negative class (value lies outside the bounds).
time_variable_name (str) – Name of the time-after-radiotherapy variable (unit should be days).
label_viewpoint ({'early', 'late', 'long-term', 'longitudinal', 'profile'}) – Time of observation for the presence of tumor control and/or normal tissue complication events.
tune_splits (int) – Number of splits for the stratified cross-validation within each model hyperparameter optimization step.
oof_splits (int) – Number of splits for the stratified cross-validation within the out-of-folds model evaluation step.

model_label

See ‘Parameters’.

Type:: str

feature_filter

See ‘Parameters’.

Type:: dict

label_name

See ‘Parameters’.

Type:: str

label_bounds

See ‘Parameters’.

Type:: list

time_variable_name

See ‘Parameters’.

Type:: str

label_viewpoint

See ‘Parameters’.

Type:: {‘early’, ‘late’, ‘long-term’, ‘longitudinal’, ‘profile’}

tune_splits

See ‘Parameters’.

Type:: int

oof_splits

See ‘Parameters’.

Type:: int

Overview

Methods
`generate`(data_path)	Generate the data information.
`decompose`(dataset, feature_filter, label_name, time_variable_name)	Decompose the base tabular dataset.
`modulate`(data_information, label_viewpoint)	Modulate the data information.
`binarize`(data_information, label_bounds)	Binarize the data information.
`add_fold_numbers`(data_information, tune_splits, oof_splits)	Add the stratified cross-validation fold numbers.

Members

generate(data_path)[source]

Generate the data information.

Parameters:: data_path (str) – Path to the data set used for fitting the machine learning model.
Returns:: Dictionary with the decomposed, modulated and binarized data information.
Return type:: dict

decompose(dataset, feature_filter, label_name, time_variable_name)[source]

Decompose the base tabular dataset.

Parameters:

dataset (DataFrame) – Dataframe with the feature and label names/values.
feature_filter (dict) – Dictionary with a list of feature names and a value from {‘retain’, ‘remove’} as an indicator for retaining/removing the features prior to model fitting.
label_name (str) – Name of the label variable.
time_variable_name (str) – Name of the time-after-radiotherapy variable (unit should be days).

Returns:

Dictionary with the decomposed data information.

Return type:

dict

modulate(data_information, label_viewpoint)[source]

Modulate the data information.

Parameters:

data_information (dict) – Dictionary with the decomposed data information.
label_viewpoint ({'early', 'late', 'long-term', 'longitudinal', 'profile'}) – Time of observation for the presence of tumor control and/or normal tissue complication events.

Returns:

Dictionary with the modulated data information.

Return type:

dict

binarize(data_information, label_bounds)[source]

Binarize the data information.

Parameters:

data_information (dict) – Dictionary with the decomposed data information.
label_bounds (list) – Bounds for the label values to binarize into positive (value lies inside the bounds) and negative class (value lies outside the bounds).

Returns:

Dictionary with the binarized data information.

Return type:

dict

add_fold_numbers(data_information, tune_splits, oof_splits)[source]

Add the stratified cross-validation fold numbers.

Parameters:

data_information (dict) – Dictionary with the preprocessed data information.
tune_splits (int) – Number of splits for the stratified cross-validation within each model hyperparameter optimization step.
oof_splits (int) – Number of splits for the stratified cross-validation within the out-of-folds model evaluation step.

Returns:

Dictionary with the stratified cross-validation fold numbers.

Return type:

dict