socube.data package¶
Submodules¶
socube.data.loading module¶
-
class
socube.data.loading.
DatasetBase
(labels: pandas.core.frame.DataFrame, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify')¶ Bases:
torch.utils.data.dataset.Dataset
Abstract base class for datasets. All SoCube extended datasets must inherit and implement its abstract interface.
- Parameters
labels (pd.DataFrame) – Dataframe containing labels for each sample.
shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.
seed (int, default None) – Random seed for shuffling.
k (int, default 5) – Number of folds for k-fold cross-validation.
task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.
-
property
kFold
¶ Get generator for k-fold cross-validation dataset
- Returns
kFold – An generator for k-fold cross-validation dataset. Each iteration generates a tuple of two Subset objects for training and validating
- Return type
generator
-
abstract
sampler
(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.Sampler¶ Abstract method for sampling a subset of this dataset.
- Parameters
subset (Subset) – A subset of this dataset.
- Returns
sampler – A sampler for the subset.
- Return type
Sampler
-
class
socube.data.loading.
ConvDatasetBase
(data_dir: str, labels: pandas.core.frame.DataFrame, transform: torch.nn.modules.module.Module = None, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify', use_index: bool = True)¶ Bases:
socube.data.loading.DatasetBase
Basical dataset designed for CNN.
- Parameters
data_dir (str) – Path to the directory containing dataset.
labels (pd.DataFrame) – Dataframe containing labels for each sample.
transform (torch.nn.Module, default None) – Transform to apply to each sample.
shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches.
seed (int, default None) – Random seed for shuffling.
k (int, default 5) – Number of folds for k-fold cross-validation.
task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.
use_index (bool, default True) – Whether to use the numeric index as the sample file name, such as “0.npy”, if False, then use the sample name in the labels as the sample file name, such as “sample_name.npy”.
socube.data.preprocess module¶
-
socube.data.preprocess.
summary
(data: pandas.core.frame.DataFrame, axis: int = 1) → pandas.core.frame.DataFrame¶ Data summary for each column or row.
- Parameters
data (dataframe) – a dataframe with row and column
axis (int, default 1) – 0 for summary for column, 1 for summary for row
- Returns
- Return type
a dataframe with summary for each column or row
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.summary(data)
-
socube.data.preprocess.
filterData
(data: pandas.core.frame.DataFrame, filtered_gene_prop: float = 0.05, filtered_cell_prop: float = 0.05, mini_expr: float = 0.05, mini_library_size: int = 1000) → pandas.core.frame.DataFrame¶ Remove genes and cells which have low variation with given proportions and remove genes whose average expression less then mini_expr and remove cells whose cell library size less then mini_library_size.
- Parameters
data (dataframe) – a dataframe, which row is gene and column is cell
filtered_gene_prop (float, default 0.05) – Remove genes with low variation with this proportion
filtered_cell_prop (float, default 0.05) – Remove cells with low variation with this proportion
mini_expr (float, default 0.05) – Remove genes whose average expression less then mini_expr
mini_library_size (int, default 1000) – Remove cells whose cell library size less then mini_library_size
- Returns
- Return type
a dataframe with filtered genes and cells
-
socube.data.preprocess.
minmax
(data: pandas.core.frame.DataFrame, range: Tuple[int] = (0, 1), flag: int = 0, dtype: str = 'float32') → pandas.core.frame.DataFrame¶ Perform maximum-minimum normalization
- Parameters
data (dataframe) – a dataframe, which row is sample and column is feature
range (tuple, default (0, 1)) – The maximum and minimum values of the normalized data, normalized to 0~1 by default
flag (int, default 0) – Equal to 0 for minmax by columns, greater than 0 for minmax by rows, less than 0 for minmax by global.
dtype (str, default "float32") – The data type of the normalized data
- Returns
- Return type
a dataframe with normalized data
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.minmax(data)
-
socube.data.preprocess.
std
(data: pandas.core.frame.DataFrame, horizontal: bool = False, dtype: str = 'float32', global_minmax: bool = False) → pandas.core.frame.DataFrame¶ Standardization of data
- Parameters
data (dataframe) – a dataframe, which row is sample and column is feature
horizontal (bool, default False) – If True, perform standardization horizontally
dtype (str, default "float32") – The data type of the standardized data
global_minmax (bool, default False) – If True, perform global standardization, otherwise standardization by row or column
- Returns
- Return type
a dataframe with standardized data
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.std(data)
-
socube.data.preprocess.
cosineDistanceMatrix
(x1: torch.Tensor, x2: torch.Tensor = None, device_name: str = 'cpu') → torch.Tensor¶ Calculate the cosine distance matrix between the two sets of samples.
- Parameters
x1 (torch.Tensor) – a tensor of samples, with shape (n1, d)
x2 (torch.Tensor, default None) – a tensor of samples, with shape (n2, d), if None, x2 = x1
device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.
- Returns
- Return type
a tensor of cosine distance matrix, with shape (n1, n2)
Examples
>>> import torch >>> import socube.data.preprocess as pre >>> x1 = torch.rand(10, 10) >>> x2 = torch.rand(10, 10) >>> pre.cosineDistanceMatrix(x1, x2)
-
socube.data.preprocess.
scatterToGrid
(scatters2d: torch.Tensor, transform: Tuple[int] = (1, -1), device_name: str = 'cpu') → torch.Tensor¶ Scattered coordinates-grid coordinate mapping based on J-V linear assignment algorithm
- Parameters
scatters2d (torch.Tensor) – a tensor of scattered coordinates, with shape (n, 2)
transform (tuple, default (1, -1)) – the transformation of the coordinates, such as (1, -1). Greater than 0 means that the corresponding coordinates do not change direction, less than 0 means that the corresponding coordinates are reversed. For scatter and grid plots, the y-axis directions are often opposite, and for visual consistency, the y-axis needs to be transformed
device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.
- Returns
- Return type
a tensor of grid coordinates, with shape (n, 2)
Examples
>>> import torch >>> import socube.data.preprocess as pre >>> scatters2d = torch.rand(10, 2) >>> pre.scatterToGrid(scatters2d)
-
socube.data.preprocess.
umap2D
(data: pandas.core.frame.DataFrame, metric: str = 'correlation', neighbors: int = 5, seed: int = None) → pandas.core.frame.DataFrame¶ Reducing high-dimensional data to 2D using UMAP
- Parameters
data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data
metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.
neighbors (int, default 5) – the number of neighbors used for UMAP.
seed (int, default None) – the random seed used for UMAP.
- Returns
- Return type
a dataframe of two-dimensional data, with shape (n, 2)
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.umap2D(data)
-
socube.data.preprocess.
tsne2D
(data: pandas.core.frame.DataFrame, metric: str = 'correlation', seed: int = None) → pandas.core.frame.DataFrame¶ Reducing high-dimensional data to 2D using t-SNE
- Parameters
data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data
metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.
seed (int, default None) – the random seed used for t-SNE.
- Returns
- Return type
a dataframe of two-dimensional data, with shape (n, 2)
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.tsne2D(data)
-
socube.data.preprocess.
vec2Grid
(vector: numpy.ndarray, shuffle: bool = False, seed: int = None) → pandas.core.frame.DataFrame¶ Converts a one-dimensional vector to a two-dimensional grid
- Parameters
vector (np.ndarray) – a one-dimensional vector, with shape (n,). n is the number of samples
shuffle (bool, default False) – whether to shuffle the vector
seed (int, default None) – the random seed used for shuffling the vector
- Returns
a dataframe of two-dimensional data, with shape (n, 2),
each row represents a grid point with horizontal (x) and vertical (y) coordinates
Examples
>>> import numpy as np >>> import socube.data.preprocess as pre >>> vector = np.random.rand(10) >>> pre.vec2Grid(vector)
-
socube.data.preprocess.
onehot
(label: numpy.ndarray, class_nums: int = None) → numpy.ndarray¶ Convert 1D multi-label vector (each element is a sample’s label) to onehot matrix. The label should be a integer
- Parameters
label (np.ndarray) – a one-dimensional integer vector, with shape (n,). n is the number of samples
class_nums (int, default None) – the number of classes. If None, the number of classes is automatically determined.
- Returns
- Return type
a ndarray of onehot matrix with shape (n, class_nums)
Examples
>>> onehot(np.array([1,2,4])) array([[0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]], dtype=int32)
>>> onehot(np.array([1,2,4]), 6) array([[0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0]], dtype=int32)
-
socube.data.preprocess.
items
(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶ Convert a dataFrame to a dataframe with row, col, val three columns
- Parameters
data (pd.DataFrame) –
- Returns
- Return type
a dataframe with row, col, val three columns
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.items(data)
socube.data.visualize module¶
-
socube.data.visualize.
getHeatColor
(intensity: float, topColor: str, bottomColor: str = '#ffffff') → str¶ Calculation of heat map color values based on intensity values
- Parameters
intensity (float value) – color intensity values between 0 and 1
topColor (hexadecimal color string) – Color value when intensity value is 1
bottomColor (hexadecimal color string) – Color value when intensity value is 0, default is white
- Returns
- Return type
The hexadecimal color string corresponding to the intensity
Examples
>>> getHeatColor(0.5, "#ff0000") '#ff7f7f'
-
socube.data.visualize.
convertHexToRGB
(hex_color: str) → Tuple[int]¶ Convert hexadecimal color strings to RGB tri-color integer tuples
- Parameters
hex_color (hexadecimal color string, such as '#ff0000') –
- Returns
- Return type
RGB tri-color integer tuples
Examples
>>> hexToRGB('#ff0000') (255, 0, 0)
-
socube.data.visualize.
convertRGBToHex
(color: Tuple[int]) → str¶ Convert RGB tricolor integer tuple to hexadecimal color string
- Parameters
color (RGB tricolor integer tuple) – such as (255, 0, 0)
- Returns
- Return type
hexadecimal color string, such as ‘#ff0000’
Examples
>>> rgbToHex((255, 0, 0)) '#ff0000'
-
socube.data.visualize.
plotScatter
(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], title: str, subtitle: str, filename: str = None, scatter_symbol: str = 'circle', width: int = 1000, height: int = 850, radius: int = 3, x_min: Optional[int] = None, y_min: Optional[int] = None, x_max: Optional[int] = None, y_max: Optional[int] = None, x_title: Optional[str] = None, y_title: Optional[str] = None)¶ Draw the scatter image of socube
- Parameters
data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.
colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.
title (str) – The title of the plot
subtitle (str) – The subtitle of the plot
filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.
width (int) – The width of the plot, unit is pixel
height (int) – The height of the plot, unit is pixel
radius (int) – The radius of the scatter point, unit is pixel
- Returns
- Return type
The plot object
Examples
>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]}) >>> colormap = {'0': '#ff0000', '1': '#00ff00'} >>> plotScatter(data2d, colormap, 'title', 'subtitle', 'test.html')
-
socube.data.visualize.
plotGrid
(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], shape: Tuple[int], title: str, subtitle: str, filename: str, width: int = 1000, height: int = 850) → highcharts.highcharts.highcharts.Highchart¶ Draw socube’s Grid image
- Parameters
data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.
colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.
shape (Tuple[int]) – The shape of the grid, (row, col)
title (str) – The title of the plot
subtitle (str) – The subtitle of the plot
filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.
width (int) – The width of the plot, unit is pixel
height (int) – The height of the plot, unit is pixel
- Returns
- Return type
The plot object
Examples
>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]}) >>> colormap = {'0': '#ff0000', '1': '#00ff00'} >>> plotGrid(data2d, colormap, (6, 3), 'title', 'subtitle', 'test.html')
-
socube.data.visualize.
plotAUC
(data: Dict[str, Tuple[Sequence]], title: str, xlabel: str, ylabel: str, file: Optional[str] = None, slash: int = 0)¶ Plot a AUC curve
- Parameters
data (Dict[str, Tuple[Sequence]]) – The data to be plotted, with key as the name of the curve and value as the (x, y) data.
title (str) – The title of the plot
xlabel (str) – The xlabel of the plot
ylabel (str) – The ylabel of the plot
file (str) – The filename of the plot, if None, the plot will not be saved
slash (int) – if slash is positve, will plot a forward slash, if slash is negative, will plot a back slash.
Examples
>>> data = {'AUC': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3]), 'AUC2': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3])} >>> plotAUC(data, 'AUC', 'xlabel', 'ylabel', 'auc.png')
Module contents¶
-
socube.data.
summary
(data: pandas.core.frame.DataFrame, axis: int = 1) → pandas.core.frame.DataFrame¶ Data summary for each column or row.
- Parameters
data (dataframe) – a dataframe with row and column
axis (int, default 1) – 0 for summary for column, 1 for summary for row
- Returns
- Return type
a dataframe with summary for each column or row
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.summary(data)
-
socube.data.
filterData
(data: pandas.core.frame.DataFrame, filtered_gene_prop: float = 0.05, filtered_cell_prop: float = 0.05, mini_expr: float = 0.05, mini_library_size: int = 1000) → pandas.core.frame.DataFrame¶ Remove genes and cells which have low variation with given proportions and remove genes whose average expression less then mini_expr and remove cells whose cell library size less then mini_library_size.
- Parameters
data (dataframe) – a dataframe, which row is gene and column is cell
filtered_gene_prop (float, default 0.05) – Remove genes with low variation with this proportion
filtered_cell_prop (float, default 0.05) – Remove cells with low variation with this proportion
mini_expr (float, default 0.05) – Remove genes whose average expression less then mini_expr
mini_library_size (int, default 1000) – Remove cells whose cell library size less then mini_library_size
- Returns
- Return type
a dataframe with filtered genes and cells
-
socube.data.
minmax
(data: pandas.core.frame.DataFrame, range: Tuple[int] = (0, 1), flag: int = 0, dtype: str = 'float32') → pandas.core.frame.DataFrame¶ Perform maximum-minimum normalization
- Parameters
data (dataframe) – a dataframe, which row is sample and column is feature
range (tuple, default (0, 1)) – The maximum and minimum values of the normalized data, normalized to 0~1 by default
flag (int, default 0) – Equal to 0 for minmax by columns, greater than 0 for minmax by rows, less than 0 for minmax by global.
dtype (str, default "float32") – The data type of the normalized data
- Returns
- Return type
a dataframe with normalized data
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.minmax(data)
-
socube.data.
std
(data: pandas.core.frame.DataFrame, horizontal: bool = False, dtype: str = 'float32', global_minmax: bool = False) → pandas.core.frame.DataFrame¶ Standardization of data
- Parameters
data (dataframe) – a dataframe, which row is sample and column is feature
horizontal (bool, default False) – If True, perform standardization horizontally
dtype (str, default "float32") – The data type of the standardized data
global_minmax (bool, default False) – If True, perform global standardization, otherwise standardization by row or column
- Returns
- Return type
a dataframe with standardized data
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.std(data)
-
socube.data.
cosineDistanceMatrix
(x1: torch.Tensor, x2: torch.Tensor = None, device_name: str = 'cpu') → torch.Tensor¶ Calculate the cosine distance matrix between the two sets of samples.
- Parameters
x1 (torch.Tensor) – a tensor of samples, with shape (n1, d)
x2 (torch.Tensor, default None) – a tensor of samples, with shape (n2, d), if None, x2 = x1
device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.
- Returns
- Return type
a tensor of cosine distance matrix, with shape (n1, n2)
Examples
>>> import torch >>> import socube.data.preprocess as pre >>> x1 = torch.rand(10, 10) >>> x2 = torch.rand(10, 10) >>> pre.cosineDistanceMatrix(x1, x2)
-
socube.data.
scatterToGrid
(scatters2d: torch.Tensor, transform: Tuple[int] = (1, -1), device_name: str = 'cpu') → torch.Tensor¶ Scattered coordinates-grid coordinate mapping based on J-V linear assignment algorithm
- Parameters
scatters2d (torch.Tensor) – a tensor of scattered coordinates, with shape (n, 2)
transform (tuple, default (1, -1)) – the transformation of the coordinates, such as (1, -1). Greater than 0 means that the corresponding coordinates do not change direction, less than 0 means that the corresponding coordinates are reversed. For scatter and grid plots, the y-axis directions are often opposite, and for visual consistency, the y-axis needs to be transformed
device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.
- Returns
- Return type
a tensor of grid coordinates, with shape (n, 2)
Examples
>>> import torch >>> import socube.data.preprocess as pre >>> scatters2d = torch.rand(10, 2) >>> pre.scatterToGrid(scatters2d)
-
socube.data.
umap2D
(data: pandas.core.frame.DataFrame, metric: str = 'correlation', neighbors: int = 5, seed: int = None) → pandas.core.frame.DataFrame¶ Reducing high-dimensional data to 2D using UMAP
- Parameters
data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data
metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.
neighbors (int, default 5) – the number of neighbors used for UMAP.
seed (int, default None) – the random seed used for UMAP.
- Returns
- Return type
a dataframe of two-dimensional data, with shape (n, 2)
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.umap2D(data)
-
socube.data.
tsne2D
(data: pandas.core.frame.DataFrame, metric: str = 'correlation', seed: int = None) → pandas.core.frame.DataFrame¶ Reducing high-dimensional data to 2D using t-SNE
- Parameters
data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data
metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.
seed (int, default None) – the random seed used for t-SNE.
- Returns
- Return type
a dataframe of two-dimensional data, with shape (n, 2)
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.tsne2D(data)
-
socube.data.
vec2Grid
(vector: numpy.ndarray, shuffle: bool = False, seed: int = None) → pandas.core.frame.DataFrame¶ Converts a one-dimensional vector to a two-dimensional grid
- Parameters
vector (np.ndarray) – a one-dimensional vector, with shape (n,). n is the number of samples
shuffle (bool, default False) – whether to shuffle the vector
seed (int, default None) – the random seed used for shuffling the vector
- Returns
a dataframe of two-dimensional data, with shape (n, 2),
each row represents a grid point with horizontal (x) and vertical (y) coordinates
Examples
>>> import numpy as np >>> import socube.data.preprocess as pre >>> vector = np.random.rand(10) >>> pre.vec2Grid(vector)
-
socube.data.
onehot
(label: numpy.ndarray, class_nums: int = None) → numpy.ndarray¶ Convert 1D multi-label vector (each element is a sample’s label) to onehot matrix. The label should be a integer
- Parameters
label (np.ndarray) – a one-dimensional integer vector, with shape (n,). n is the number of samples
class_nums (int, default None) – the number of classes. If None, the number of classes is automatically determined.
- Returns
- Return type
a ndarray of onehot matrix with shape (n, class_nums)
Examples
>>> onehot(np.array([1,2,4])) array([[0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]], dtype=int32)
>>> onehot(np.array([1,2,4]), 6) array([[0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0]], dtype=int32)
-
socube.data.
items
(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶ Convert a dataFrame to a dataframe with row, col, val three columns
- Parameters
data (pd.DataFrame) –
- Returns
- Return type
a dataframe with row, col, val three columns
Examples
>>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.items(data)
-
class
socube.data.
DatasetBase
(labels: pandas.core.frame.DataFrame, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify')¶ Bases:
torch.utils.data.dataset.Dataset
Abstract base class for datasets. All SoCube extended datasets must inherit and implement its abstract interface.
- Parameters
labels (pd.DataFrame) – Dataframe containing labels for each sample.
shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.
seed (int, default None) – Random seed for shuffling.
k (int, default 5) – Number of folds for k-fold cross-validation.
task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.
-
property
kFold
¶ Get generator for k-fold cross-validation dataset
- Returns
kFold – An generator for k-fold cross-validation dataset. Each iteration generates a tuple of two Subset objects for training and validating
- Return type
generator
-
abstract
sampler
(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.Sampler¶ Abstract method for sampling a subset of this dataset.
- Parameters
subset (Subset) – A subset of this dataset.
- Returns
sampler – A sampler for the subset.
- Return type
Sampler
-
class
socube.data.
ConvDatasetBase
(data_dir: str, labels: pandas.core.frame.DataFrame, transform: torch.nn.modules.module.Module = None, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify', use_index: bool = True)¶ Bases:
socube.data.loading.DatasetBase
Basical dataset designed for CNN.
- Parameters
data_dir (str) – Path to the directory containing dataset.
labels (pd.DataFrame) – Dataframe containing labels for each sample.
transform (torch.nn.Module, default None) – Transform to apply to each sample.
shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches.
seed (int, default None) – Random seed for shuffling.
k (int, default 5) – Number of folds for k-fold cross-validation.
task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.
use_index (bool, default True) – Whether to use the numeric index as the sample file name, such as “0.npy”, if False, then use the sample name in the labels as the sample file name, such as “sample_name.npy”.
-
socube.data.
getHeatColor
(intensity: float, topColor: str, bottomColor: str = '#ffffff') → str¶ Calculation of heat map color values based on intensity values
- Parameters
intensity (float value) – color intensity values between 0 and 1
topColor (hexadecimal color string) – Color value when intensity value is 1
bottomColor (hexadecimal color string) – Color value when intensity value is 0, default is white
- Returns
- Return type
The hexadecimal color string corresponding to the intensity
Examples
>>> getHeatColor(0.5, "#ff0000") '#ff7f7f'
-
socube.data.
convertHexToRGB
(hex_color: str) → Tuple[int]¶ Convert hexadecimal color strings to RGB tri-color integer tuples
- Parameters
hex_color (hexadecimal color string, such as '#ff0000') –
- Returns
- Return type
RGB tri-color integer tuples
Examples
>>> hexToRGB('#ff0000') (255, 0, 0)
-
socube.data.
convertRGBToHex
(color: Tuple[int]) → str¶ Convert RGB tricolor integer tuple to hexadecimal color string
- Parameters
color (RGB tricolor integer tuple) – such as (255, 0, 0)
- Returns
- Return type
hexadecimal color string, such as ‘#ff0000’
Examples
>>> rgbToHex((255, 0, 0)) '#ff0000'
-
socube.data.
plotScatter
(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], title: str, subtitle: str, filename: str = None, scatter_symbol: str = 'circle', width: int = 1000, height: int = 850, radius: int = 3, x_min: Optional[int] = None, y_min: Optional[int] = None, x_max: Optional[int] = None, y_max: Optional[int] = None, x_title: Optional[str] = None, y_title: Optional[str] = None)¶ Draw the scatter image of socube
- Parameters
data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.
colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.
title (str) – The title of the plot
subtitle (str) – The subtitle of the plot
filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.
width (int) – The width of the plot, unit is pixel
height (int) – The height of the plot, unit is pixel
radius (int) – The radius of the scatter point, unit is pixel
- Returns
- Return type
The plot object
Examples
>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]}) >>> colormap = {'0': '#ff0000', '1': '#00ff00'} >>> plotScatter(data2d, colormap, 'title', 'subtitle', 'test.html')
-
socube.data.
plotGrid
(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], shape: Tuple[int], title: str, subtitle: str, filename: str, width: int = 1000, height: int = 850) → highcharts.highcharts.highcharts.Highchart¶ Draw socube’s Grid image
- Parameters
data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.
colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.
shape (Tuple[int]) – The shape of the grid, (row, col)
title (str) – The title of the plot
subtitle (str) – The subtitle of the plot
filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.
width (int) – The width of the plot, unit is pixel
height (int) – The height of the plot, unit is pixel
- Returns
- Return type
The plot object
Examples
>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]}) >>> colormap = {'0': '#ff0000', '1': '#00ff00'} >>> plotGrid(data2d, colormap, (6, 3), 'title', 'subtitle', 'test.html')
-
socube.data.
plotAUC
(data: Dict[str, Tuple[Sequence]], title: str, xlabel: str, ylabel: str, file: Optional[str] = None, slash: int = 0)¶ Plot a AUC curve
- Parameters
data (Dict[str, Tuple[Sequence]]) – The data to be plotted, with key as the name of the curve and value as the (x, y) data.
title (str) – The title of the plot
xlabel (str) – The xlabel of the plot
ylabel (str) – The ylabel of the plot
file (str) – The filename of the plot, if None, the plot will not be saved
slash (int) – if slash is positve, will plot a forward slash, if slash is negative, will plot a back slash.
Examples
>>> data = {'AUC': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3]), 'AUC2': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3])} >>> plotAUC(data, 'AUC', 'xlabel', 'ylabel', 'auc.png')