socube.data package¶
Submodules¶
socube.data.loading module¶
- 
class socube.data.loading.DatasetBase(labels: pandas.core.frame.DataFrame, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify')¶
- Bases: - torch.utils.data.dataset.Dataset- Abstract base class for datasets. All SoCube extended datasets must inherit and implement its abstract interface. - Parameters
- labels (pd.DataFrame) – Dataframe containing labels for each sample. 
- shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled. 
- seed (int, default None) – Random seed for shuffling. 
- k (int, default 5) – Number of folds for k-fold cross-validation. 
- task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”. 
 
 - 
property kFold¶
- Get generator for k-fold cross-validation dataset - Returns
- kFold – An generator for k-fold cross-validation dataset. Each iteration generates a tuple of two Subset objects for training and validating 
- Return type
- generator 
 
 - 
abstract sampler(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.Sampler¶
- Abstract method for sampling a subset of this dataset. - Parameters
- subset (Subset) – A subset of this dataset. 
- Returns
- sampler – A sampler for the subset. 
- Return type
- Sampler 
 
 
- 
class socube.data.loading.ConvDatasetBase(data_dir: str, labels: pandas.core.frame.DataFrame, transform: torch.nn.modules.module.Module = None, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify', use_index: bool = True)¶
- Bases: - socube.data.loading.DatasetBase- Basical dataset designed for CNN. - Parameters
- data_dir (str) – Path to the directory containing dataset. 
- labels (pd.DataFrame) – Dataframe containing labels for each sample. 
- transform (torch.nn.Module, default None) – Transform to apply to each sample. 
- shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches. 
- seed (int, default None) – Random seed for shuffling. 
- k (int, default 5) – Number of folds for k-fold cross-validation. 
- task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”. 
- use_index (bool, default True) – Whether to use the numeric index as the sample file name, such as “0.npy”, if False, then use the sample name in the labels as the sample file name, such as “sample_name.npy”. 
 
 
socube.data.preprocess module¶
- 
socube.data.preprocess.summary(data: pandas.core.frame.DataFrame, axis: int = 1) → pandas.core.frame.DataFrame¶
- Data summary for each column or row. - Parameters
- data (dataframe) – a dataframe with row and column 
- axis (int, default 1) – 0 for summary for column, 1 for summary for row 
 
- Returns
- Return type
- a dataframe with summary for each column or row 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.summary(data) 
- 
socube.data.preprocess.filterData(data: pandas.core.frame.DataFrame, filtered_gene_prop: float = 0.05, filtered_cell_prop: float = 0.05, mini_expr: float = 0.05, mini_library_size: int = 1000) → pandas.core.frame.DataFrame¶
- Remove genes and cells which have low variation with given proportions and remove genes whose average expression less then mini_expr and remove cells whose cell library size less then mini_library_size. - Parameters
- data (dataframe) – a dataframe, which row is gene and column is cell 
- filtered_gene_prop (float, default 0.05) – Remove genes with low variation with this proportion 
- filtered_cell_prop (float, default 0.05) – Remove cells with low variation with this proportion 
- mini_expr (float, default 0.05) – Remove genes whose average expression less then mini_expr 
- mini_library_size (int, default 1000) – Remove cells whose cell library size less then mini_library_size 
 
- Returns
- Return type
- a dataframe with filtered genes and cells 
 
- 
socube.data.preprocess.minmax(data: pandas.core.frame.DataFrame, range: Tuple[int] = (0, 1), flag: int = 0, dtype: str = 'float32') → pandas.core.frame.DataFrame¶
- Perform maximum-minimum normalization - Parameters
- data (dataframe) – a dataframe, which row is sample and column is feature 
- range (tuple, default (0, 1)) – The maximum and minimum values of the normalized data, normalized to 0~1 by default 
- flag (int, default 0) – Equal to 0 for minmax by columns, greater than 0 for minmax by rows, less than 0 for minmax by global. 
- dtype (str, default "float32") – The data type of the normalized data 
 
- Returns
- Return type
- a dataframe with normalized data 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.minmax(data) 
- 
socube.data.preprocess.std(data: pandas.core.frame.DataFrame, horizontal: bool = False, dtype: str = 'float32', global_minmax: bool = False) → pandas.core.frame.DataFrame¶
- Standardization of data - Parameters
- data (dataframe) – a dataframe, which row is sample and column is feature 
- horizontal (bool, default False) – If True, perform standardization horizontally 
- dtype (str, default "float32") – The data type of the standardized data 
- global_minmax (bool, default False) – If True, perform global standardization, otherwise standardization by row or column 
 
- Returns
- Return type
- a dataframe with standardized data 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.std(data) 
- 
socube.data.preprocess.cosineDistanceMatrix(x1: torch.Tensor, x2: torch.Tensor = None, device_name: str = 'cpu') → torch.Tensor¶
- Calculate the cosine distance matrix between the two sets of samples. - Parameters
- x1 (torch.Tensor) – a tensor of samples, with shape (n1, d) 
- x2 (torch.Tensor, default None) – a tensor of samples, with shape (n2, d), if None, x2 = x1 
- device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc. 
 
- Returns
- Return type
- a tensor of cosine distance matrix, with shape (n1, n2) 
 - Examples - >>> import torch >>> import socube.data.preprocess as pre >>> x1 = torch.rand(10, 10) >>> x2 = torch.rand(10, 10) >>> pre.cosineDistanceMatrix(x1, x2) 
- 
socube.data.preprocess.scatterToGrid(scatters2d: torch.Tensor, transform: Tuple[int] = (1, -1), device_name: str = 'cpu') → torch.Tensor¶
- Scattered coordinates-grid coordinate mapping based on J-V linear assignment algorithm - Parameters
- scatters2d (torch.Tensor) – a tensor of scattered coordinates, with shape (n, 2) 
- transform (tuple, default (1, -1)) – the transformation of the coordinates, such as (1, -1). Greater than 0 means that the corresponding coordinates do not change direction, less than 0 means that the corresponding coordinates are reversed. For scatter and grid plots, the y-axis directions are often opposite, and for visual consistency, the y-axis needs to be transformed 
- device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc. 
 
- Returns
- Return type
- a tensor of grid coordinates, with shape (n, 2) 
 - Examples - >>> import torch >>> import socube.data.preprocess as pre >>> scatters2d = torch.rand(10, 2) >>> pre.scatterToGrid(scatters2d) 
- 
socube.data.preprocess.umap2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', neighbors: int = 5, seed: int = None) → pandas.core.frame.DataFrame¶
- Reducing high-dimensional data to 2D using UMAP - Parameters
- data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data 
- metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc. 
- neighbors (int, default 5) – the number of neighbors used for UMAP. 
- seed (int, default None) – the random seed used for UMAP. 
 
- Returns
- Return type
- a dataframe of two-dimensional data, with shape (n, 2) 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.umap2D(data) 
- 
socube.data.preprocess.tsne2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', seed: int = None) → pandas.core.frame.DataFrame¶
- Reducing high-dimensional data to 2D using t-SNE - Parameters
- data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data 
- metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc. 
- seed (int, default None) – the random seed used for t-SNE. 
 
- Returns
- Return type
- a dataframe of two-dimensional data, with shape (n, 2) 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.tsne2D(data) 
- 
socube.data.preprocess.vec2Grid(vector: numpy.ndarray, shuffle: bool = False, seed: int = None) → pandas.core.frame.DataFrame¶
- Converts a one-dimensional vector to a two-dimensional grid - Parameters
- vector (np.ndarray) – a one-dimensional vector, with shape (n,). n is the number of samples 
- shuffle (bool, default False) – whether to shuffle the vector 
- seed (int, default None) – the random seed used for shuffling the vector 
 
- Returns
- a dataframe of two-dimensional data, with shape (n, 2), 
- each row represents a grid point with horizontal (x) and vertical (y) coordinates 
 
 - Examples - >>> import numpy as np >>> import socube.data.preprocess as pre >>> vector = np.random.rand(10) >>> pre.vec2Grid(vector) 
- 
socube.data.preprocess.onehot(label: numpy.ndarray, class_nums: int = None) → numpy.ndarray¶
- Convert 1D multi-label vector (each element is a sample’s label) to onehot matrix. The label should be a integer - Parameters
- label (np.ndarray) – a one-dimensional integer vector, with shape (n,). n is the number of samples 
- class_nums (int, default None) – the number of classes. If None, the number of classes is automatically determined. 
 
- Returns
- Return type
- a ndarray of onehot matrix with shape (n, class_nums) 
 - Examples - >>> onehot(np.array([1,2,4])) array([[0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]], dtype=int32) - >>> onehot(np.array([1,2,4]), 6) array([[0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0]], dtype=int32) 
- 
socube.data.preprocess.items(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶
- Convert a dataFrame to a dataframe with row, col, val three columns - Parameters
- data (pd.DataFrame) – 
- Returns
- Return type
- a dataframe with row, col, val three columns 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.items(data) 
socube.data.visualize module¶
- 
socube.data.visualize.getHeatColor(intensity: float, topColor: str, bottomColor: str = '#ffffff') → str¶
- Calculation of heat map color values based on intensity values - Parameters
- intensity (float value) – color intensity values between 0 and 1 
- topColor (hexadecimal color string) – Color value when intensity value is 1 
- bottomColor (hexadecimal color string) – Color value when intensity value is 0, default is white 
 
- Returns
- Return type
- The hexadecimal color string corresponding to the intensity 
 - Examples - >>> getHeatColor(0.5, "#ff0000") '#ff7f7f' 
- 
socube.data.visualize.convertHexToRGB(hex_color: str) → Tuple[int]¶
- Convert hexadecimal color strings to RGB tri-color integer tuples - Parameters
- hex_color (hexadecimal color string, such as '#ff0000') – 
- Returns
- Return type
- RGB tri-color integer tuples 
 - Examples - >>> hexToRGB('#ff0000') (255, 0, 0) 
- 
socube.data.visualize.convertRGBToHex(color: Tuple[int]) → str¶
- Convert RGB tricolor integer tuple to hexadecimal color string - Parameters
- color (RGB tricolor integer tuple) – such as (255, 0, 0) 
- Returns
- Return type
- hexadecimal color string, such as ‘#ff0000’ 
 - Examples - >>> rgbToHex((255, 0, 0)) '#ff0000' 
- 
socube.data.visualize.plotScatter(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], title: str, subtitle: str, filename: str = None, scatter_symbol: str = 'circle', width: int = 1000, height: int = 850, radius: int = 3, x_min: Optional[int] = None, y_min: Optional[int] = None, x_max: Optional[int] = None, y_max: Optional[int] = None, x_title: Optional[str] = None, y_title: Optional[str] = None)¶
- Draw the scatter image of socube - Parameters
- data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value. 
- colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color. 
- title (str) – The title of the plot 
- subtitle (str) – The subtitle of the plot 
- filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it. 
- width (int) – The width of the plot, unit is pixel 
- height (int) – The height of the plot, unit is pixel 
- radius (int) – The radius of the scatter point, unit is pixel 
 
- Returns
- Return type
- The plot object 
 - Examples - >>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]}) >>> colormap = {'0': '#ff0000', '1': '#00ff00'} >>> plotScatter(data2d, colormap, 'title', 'subtitle', 'test.html') 
- 
socube.data.visualize.plotGrid(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], shape: Tuple[int], title: str, subtitle: str, filename: str, width: int = 1000, height: int = 850) → highcharts.highcharts.highcharts.Highchart¶
- Draw socube’s Grid image - Parameters
- data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value. 
- colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color. 
- shape (Tuple[int]) – The shape of the grid, (row, col) 
- title (str) – The title of the plot 
- subtitle (str) – The subtitle of the plot 
- filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it. 
- width (int) – The width of the plot, unit is pixel 
- height (int) – The height of the plot, unit is pixel 
 
- Returns
- Return type
- The plot object 
 - Examples - >>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]}) >>> colormap = {'0': '#ff0000', '1': '#00ff00'} >>> plotGrid(data2d, colormap, (6, 3), 'title', 'subtitle', 'test.html') 
- 
socube.data.visualize.plotAUC(data: Dict[str, Tuple[Sequence]], title: str, xlabel: str, ylabel: str, file: Optional[str] = None, slash: int = 0)¶
- Plot a AUC curve - Parameters
- data (Dict[str, Tuple[Sequence]]) – The data to be plotted, with key as the name of the curve and value as the (x, y) data. 
- title (str) – The title of the plot 
- xlabel (str) – The xlabel of the plot 
- ylabel (str) – The ylabel of the plot 
- file (str) – The filename of the plot, if None, the plot will not be saved 
- slash (int) – if slash is positve, will plot a forward slash, if slash is negative, will plot a back slash. 
 
 - Examples - >>> data = {'AUC': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3]), 'AUC2': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3])} >>> plotAUC(data, 'AUC', 'xlabel', 'ylabel', 'auc.png') 
Module contents¶
- 
socube.data.summary(data: pandas.core.frame.DataFrame, axis: int = 1) → pandas.core.frame.DataFrame¶
- Data summary for each column or row. - Parameters
- data (dataframe) – a dataframe with row and column 
- axis (int, default 1) – 0 for summary for column, 1 for summary for row 
 
- Returns
- Return type
- a dataframe with summary for each column or row 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.summary(data) 
- 
socube.data.filterData(data: pandas.core.frame.DataFrame, filtered_gene_prop: float = 0.05, filtered_cell_prop: float = 0.05, mini_expr: float = 0.05, mini_library_size: int = 1000) → pandas.core.frame.DataFrame¶
- Remove genes and cells which have low variation with given proportions and remove genes whose average expression less then mini_expr and remove cells whose cell library size less then mini_library_size. - Parameters
- data (dataframe) – a dataframe, which row is gene and column is cell 
- filtered_gene_prop (float, default 0.05) – Remove genes with low variation with this proportion 
- filtered_cell_prop (float, default 0.05) – Remove cells with low variation with this proportion 
- mini_expr (float, default 0.05) – Remove genes whose average expression less then mini_expr 
- mini_library_size (int, default 1000) – Remove cells whose cell library size less then mini_library_size 
 
- Returns
- Return type
- a dataframe with filtered genes and cells 
 
- 
socube.data.minmax(data: pandas.core.frame.DataFrame, range: Tuple[int] = (0, 1), flag: int = 0, dtype: str = 'float32') → pandas.core.frame.DataFrame¶
- Perform maximum-minimum normalization - Parameters
- data (dataframe) – a dataframe, which row is sample and column is feature 
- range (tuple, default (0, 1)) – The maximum and minimum values of the normalized data, normalized to 0~1 by default 
- flag (int, default 0) – Equal to 0 for minmax by columns, greater than 0 for minmax by rows, less than 0 for minmax by global. 
- dtype (str, default "float32") – The data type of the normalized data 
 
- Returns
- Return type
- a dataframe with normalized data 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.minmax(data) 
- 
socube.data.std(data: pandas.core.frame.DataFrame, horizontal: bool = False, dtype: str = 'float32', global_minmax: bool = False) → pandas.core.frame.DataFrame¶
- Standardization of data - Parameters
- data (dataframe) – a dataframe, which row is sample and column is feature 
- horizontal (bool, default False) – If True, perform standardization horizontally 
- dtype (str, default "float32") – The data type of the standardized data 
- global_minmax (bool, default False) – If True, perform global standardization, otherwise standardization by row or column 
 
- Returns
- Return type
- a dataframe with standardized data 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.std(data) 
- 
socube.data.cosineDistanceMatrix(x1: torch.Tensor, x2: torch.Tensor = None, device_name: str = 'cpu') → torch.Tensor¶
- Calculate the cosine distance matrix between the two sets of samples. - Parameters
- x1 (torch.Tensor) – a tensor of samples, with shape (n1, d) 
- x2 (torch.Tensor, default None) – a tensor of samples, with shape (n2, d), if None, x2 = x1 
- device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc. 
 
- Returns
- Return type
- a tensor of cosine distance matrix, with shape (n1, n2) 
 - Examples - >>> import torch >>> import socube.data.preprocess as pre >>> x1 = torch.rand(10, 10) >>> x2 = torch.rand(10, 10) >>> pre.cosineDistanceMatrix(x1, x2) 
- 
socube.data.scatterToGrid(scatters2d: torch.Tensor, transform: Tuple[int] = (1, -1), device_name: str = 'cpu') → torch.Tensor¶
- Scattered coordinates-grid coordinate mapping based on J-V linear assignment algorithm - Parameters
- scatters2d (torch.Tensor) – a tensor of scattered coordinates, with shape (n, 2) 
- transform (tuple, default (1, -1)) – the transformation of the coordinates, such as (1, -1). Greater than 0 means that the corresponding coordinates do not change direction, less than 0 means that the corresponding coordinates are reversed. For scatter and grid plots, the y-axis directions are often opposite, and for visual consistency, the y-axis needs to be transformed 
- device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc. 
 
- Returns
- Return type
- a tensor of grid coordinates, with shape (n, 2) 
 - Examples - >>> import torch >>> import socube.data.preprocess as pre >>> scatters2d = torch.rand(10, 2) >>> pre.scatterToGrid(scatters2d) 
- 
socube.data.umap2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', neighbors: int = 5, seed: int = None) → pandas.core.frame.DataFrame¶
- Reducing high-dimensional data to 2D using UMAP - Parameters
- data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data 
- metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc. 
- neighbors (int, default 5) – the number of neighbors used for UMAP. 
- seed (int, default None) – the random seed used for UMAP. 
 
- Returns
- Return type
- a dataframe of two-dimensional data, with shape (n, 2) 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.umap2D(data) 
- 
socube.data.tsne2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', seed: int = None) → pandas.core.frame.DataFrame¶
- Reducing high-dimensional data to 2D using t-SNE - Parameters
- data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data 
- metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc. 
- seed (int, default None) – the random seed used for t-SNE. 
 
- Returns
- Return type
- a dataframe of two-dimensional data, with shape (n, 2) 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.tsne2D(data) 
- 
socube.data.vec2Grid(vector: numpy.ndarray, shuffle: bool = False, seed: int = None) → pandas.core.frame.DataFrame¶
- Converts a one-dimensional vector to a two-dimensional grid - Parameters
- vector (np.ndarray) – a one-dimensional vector, with shape (n,). n is the number of samples 
- shuffle (bool, default False) – whether to shuffle the vector 
- seed (int, default None) – the random seed used for shuffling the vector 
 
- Returns
- a dataframe of two-dimensional data, with shape (n, 2), 
- each row represents a grid point with horizontal (x) and vertical (y) coordinates 
 
 - Examples - >>> import numpy as np >>> import socube.data.preprocess as pre >>> vector = np.random.rand(10) >>> pre.vec2Grid(vector) 
- 
socube.data.onehot(label: numpy.ndarray, class_nums: int = None) → numpy.ndarray¶
- Convert 1D multi-label vector (each element is a sample’s label) to onehot matrix. The label should be a integer - Parameters
- label (np.ndarray) – a one-dimensional integer vector, with shape (n,). n is the number of samples 
- class_nums (int, default None) – the number of classes. If None, the number of classes is automatically determined. 
 
- Returns
- Return type
- a ndarray of onehot matrix with shape (n, class_nums) 
 - Examples - >>> onehot(np.array([1,2,4])) array([[0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]], dtype=int32) - >>> onehot(np.array([1,2,4]), 6) array([[0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0]], dtype=int32) 
- 
socube.data.items(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶
- Convert a dataFrame to a dataframe with row, col, val three columns - Parameters
- data (pd.DataFrame) – 
- Returns
- Return type
- a dataframe with row, col, val three columns 
 - Examples - >>> import pandas as pd >>> import socube.data.preprocess as pre >>> data = pd.DataFrame(np.random.rand(10, 10)) >>> pre.items(data) 
- 
class socube.data.DatasetBase(labels: pandas.core.frame.DataFrame, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify')¶
- Bases: - torch.utils.data.dataset.Dataset- Abstract base class for datasets. All SoCube extended datasets must inherit and implement its abstract interface. - Parameters
- labels (pd.DataFrame) – Dataframe containing labels for each sample. 
- shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled. 
- seed (int, default None) – Random seed for shuffling. 
- k (int, default 5) – Number of folds for k-fold cross-validation. 
- task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”. 
 
 - 
property kFold¶
- Get generator for k-fold cross-validation dataset - Returns
- kFold – An generator for k-fold cross-validation dataset. Each iteration generates a tuple of two Subset objects for training and validating 
- Return type
- generator 
 
 - 
abstract sampler(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.Sampler¶
- Abstract method for sampling a subset of this dataset. - Parameters
- subset (Subset) – A subset of this dataset. 
- Returns
- sampler – A sampler for the subset. 
- Return type
- Sampler 
 
 
- 
class socube.data.ConvDatasetBase(data_dir: str, labels: pandas.core.frame.DataFrame, transform: torch.nn.modules.module.Module = None, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify', use_index: bool = True)¶
- Bases: - socube.data.loading.DatasetBase- Basical dataset designed for CNN. - Parameters
- data_dir (str) – Path to the directory containing dataset. 
- labels (pd.DataFrame) – Dataframe containing labels for each sample. 
- transform (torch.nn.Module, default None) – Transform to apply to each sample. 
- shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches. 
- seed (int, default None) – Random seed for shuffling. 
- k (int, default 5) – Number of folds for k-fold cross-validation. 
- task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”. 
- use_index (bool, default True) – Whether to use the numeric index as the sample file name, such as “0.npy”, if False, then use the sample name in the labels as the sample file name, such as “sample_name.npy”. 
 
 
- 
socube.data.getHeatColor(intensity: float, topColor: str, bottomColor: str = '#ffffff') → str¶
- Calculation of heat map color values based on intensity values - Parameters
- intensity (float value) – color intensity values between 0 and 1 
- topColor (hexadecimal color string) – Color value when intensity value is 1 
- bottomColor (hexadecimal color string) – Color value when intensity value is 0, default is white 
 
- Returns
- Return type
- The hexadecimal color string corresponding to the intensity 
 - Examples - >>> getHeatColor(0.5, "#ff0000") '#ff7f7f' 
- 
socube.data.convertHexToRGB(hex_color: str) → Tuple[int]¶
- Convert hexadecimal color strings to RGB tri-color integer tuples - Parameters
- hex_color (hexadecimal color string, such as '#ff0000') – 
- Returns
- Return type
- RGB tri-color integer tuples 
 - Examples - >>> hexToRGB('#ff0000') (255, 0, 0) 
- 
socube.data.convertRGBToHex(color: Tuple[int]) → str¶
- Convert RGB tricolor integer tuple to hexadecimal color string - Parameters
- color (RGB tricolor integer tuple) – such as (255, 0, 0) 
- Returns
- Return type
- hexadecimal color string, such as ‘#ff0000’ 
 - Examples - >>> rgbToHex((255, 0, 0)) '#ff0000' 
- 
socube.data.plotScatter(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], title: str, subtitle: str, filename: str = None, scatter_symbol: str = 'circle', width: int = 1000, height: int = 850, radius: int = 3, x_min: Optional[int] = None, y_min: Optional[int] = None, x_max: Optional[int] = None, y_max: Optional[int] = None, x_title: Optional[str] = None, y_title: Optional[str] = None)¶
- Draw the scatter image of socube - Parameters
- data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value. 
- colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color. 
- title (str) – The title of the plot 
- subtitle (str) – The subtitle of the plot 
- filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it. 
- width (int) – The width of the plot, unit is pixel 
- height (int) – The height of the plot, unit is pixel 
- radius (int) – The radius of the scatter point, unit is pixel 
 
- Returns
- Return type
- The plot object 
 - Examples - >>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]}) >>> colormap = {'0': '#ff0000', '1': '#00ff00'} >>> plotScatter(data2d, colormap, 'title', 'subtitle', 'test.html') 
- 
socube.data.plotGrid(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], shape: Tuple[int], title: str, subtitle: str, filename: str, width: int = 1000, height: int = 850) → highcharts.highcharts.highcharts.Highchart¶
- Draw socube’s Grid image - Parameters
- data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value. 
- colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color. 
- shape (Tuple[int]) – The shape of the grid, (row, col) 
- title (str) – The title of the plot 
- subtitle (str) – The subtitle of the plot 
- filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it. 
- width (int) – The width of the plot, unit is pixel 
- height (int) – The height of the plot, unit is pixel 
 
- Returns
- Return type
- The plot object 
 - Examples - >>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]}) >>> colormap = {'0': '#ff0000', '1': '#00ff00'} >>> plotGrid(data2d, colormap, (6, 3), 'title', 'subtitle', 'test.html') 
- 
socube.data.plotAUC(data: Dict[str, Tuple[Sequence]], title: str, xlabel: str, ylabel: str, file: Optional[str] = None, slash: int = 0)¶
- Plot a AUC curve - Parameters
- data (Dict[str, Tuple[Sequence]]) – The data to be plotted, with key as the name of the curve and value as the (x, y) data. 
- title (str) – The title of the plot 
- xlabel (str) – The xlabel of the plot 
- ylabel (str) – The ylabel of the plot 
- file (str) – The filename of the plot, if None, the plot will not be saved 
- slash (int) – if slash is positve, will plot a forward slash, if slash is negative, will plot a back slash. 
 
 - Examples - >>> data = {'AUC': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3]), 'AUC2': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3])} >>> plotAUC(data, 'AUC', 'xlabel', 'ylabel', 'auc.png')