socube.task.doublet package¶
Submodules¶
socube.task.doublet.data module¶
-
class
socube.task.doublet.data.
ConvClassifyDataset
(data_dir: str, transform: torch.nn.modules.module.Module = None, labels: str = 'label.csv', shuffle: bool = False, seed: int = None, k: int = 5, use_index: bool = True)¶ Bases:
socube.data.loading.ConvDatasetBase
Class of dataset for socube
- Parameters
data_dir (string) – the dataset’s directory
transform (torch module) – sample transform, such as Resize
labels (string) – the label file csv name
shuffle (boolean value) – if True, data will be shuffled while k-fold cross-valid
seed (random seed) – random seed for k-fold cross-valid or sample
k (integer scalar value) – k value of k-fold cross-valid
use_index (boolean value) – If True, it will read sample file by index.
-
sampler
(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.WeightedRandomSampler¶ Generate weighted random sampler for a subset of this dataset
- Parameters
subset (Subset) – the subset of this dataset
- Returns
- Return type
Weighted random sampler
-
property
typeCounts
¶ Numbers of different types
-
socube.task.doublet.data.
generateDoublet
(samples: pandas.core.frame.DataFrame, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, size: Optional[int] = None, mode: Optional[str] = 'balance') → Tuple[pandas.core.frame.DataFrame]¶ Generate training set from samples. in silico doublet will be simulated as positive samples.
- Parameters
samples (pd.DataFrame) – the samples dataframe, with row as cells (droplets, simples) and column as genes.
ratio (float, default 1.0) – The ratio of the number of doublet and singlet.
adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.
seed (int, default None) – The random seed for the generation of the doublet.
size (int, default None) – The size of the generated training set. If None, the size of the training set will be the same as the size of the samples.
mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced.
- Returns
a tuple of two pd.DataFrame, the first is the positive (doublet) samples,
the second is the negative (singlet) samples.
-
socube.task.doublet.data.
checkShape
(path: str, shape: tuple = (10, None, None)) → None¶ Check dataset shape
- Parameters
path (string) – the dataset’s directory
shape (tuple, default (10, None, None)) – the expected shape of the dataset, None means any shape
- Raises
AssertionError – if the shape of the dataset is not the same as the expected:
-
socube.task.doublet.data.
checkData
(data: pandas.core.frame.DataFrame)¶ Data legitimacy verification
- Parameters
data (pd.DataFrame) – Data to be checked, a dataframe of scRNA-seq data
- Raises
ValueError – If data contains NaN or inf IndexError If data contains duplicate column or row names, or if droplet name begins with “doublet”
-
socube.task.doublet.data.
createTrainData
(samples: pandas.core.frame.DataFrame, output_path: str, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, mode: Optional[int] = 'balance') → Tuple[pandas.core.generic.NDFrame]¶ Based on the original data, doublets are generated as the positive data and a subset of the original data is used as the negative data to obtain the training dataset.
- Parameters
samples (pd.DataFrame) – Original data, a dataframe of scRNA-seq data. Shape is (n_droplets, n_genes)
output_path (str) – Path to save the generated training data.
ratio (float, default 1.0) – The ratio of the number of doublet and singlet.
adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.
seed (int, default None) – The random seed for the generation of the doublet.
mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced. see generateDoublet for more details.
- Returns
A tuple of NDFrames. The first element is training data, the second element
is the training label.
socube.task.doublet.model module¶
-
class
socube.task.doublet.model.
SoCubeNet
(in_channels: int, out_channels: int, freeze: bool = False, binary: bool = True, **kwargs)¶ Bases:
socube.net._base.NetBase
Neural network model constructed for doublet detection task. previous name is Conv2DClassifyNet.
- Parameters
in_channels (integer scalar value) – input data’s channel count
out_channels (integer scalar value) – output data’s channel count
freeze (boolean value) – Whether to freeze the feature extraction layer, default is False
binary (boolean value) – Whether the output probability is binary or multicategorical, default is True for binary
Examples
>>> SoCubeNet(10, 2)
-
criterion
(yPredict: torch.Tensor, yTrue: torch.Tensor) → torch.Tensor¶ Abstract methods to calculate the loss of the model.
- Parameters
y_predict (torch.Tensor) – The predicted data.
y_true (torch.Tensor) – The true data.
- Returns
- Return type
The loss of the model.
-
forward
(x1: torch.Tensor) → torch.Tensor¶ Data forward for a neural network waited to be implemented.
- Parameters
x (torch.Tensor) – The input data.
- Returns
- Return type
The output data.
-
training
= None¶
socube.task.doublet.train module¶
-
socube.task.doublet.train.
fit
(home_dir: str, data_dir: str, lr: float = 0.001, gamma: float = 0.99, epochs: int = 100, train_batch: int = 32, valid_batch: int = 500, transform: torch.nn.modules.module.Module = None, in_channels: int = 10, num_workers: int = 0, shuffle: bool = False, seed: int = None, label_file: str = 'label.csv', threshold: float = 0.5, k: int = 5, once: bool = False, use_index: bool = True, gpu_ids: List[str] = None, step: int = 5, model_id: str = None, pretrain_model_path: str = None, max_acc_limit: float = 1.0, multi_process: bool = False, **kwargs) → str¶ Train socube model.
- Parameters
home_dir (str) – the home directory of the specfic job
data_dir (str) – the dataset’s directory
lr (float) – learning rate, default: 0.001
gamma (float) – learning rate decay, default: 0.99
epochs (int) – training epochs, default: 100
train_batch (int) – training batch size, default: 32
valid_batch (int) – validation batch size, default: 500
transform (nn.Module) – sample transform, such as Resize
in_channels (int) – the number of input channels, default: 10
num_workers (int) – the number of workers for data loading, default: 0
shuffle (bool) – if True, data will be shuffled while k-fold cross-valid
seed (int) – random seed for k-fold cross-valid or sample
label_file (str) – the label file csv name, default: “label.csv”,
threshold (float) – the threshold for classification, default: 0.5
k (int) – k value of k-fold cross-valid
once (bool) – if True, k-fold cross-validation runs first fold only
use_index (bool) – If True, it will read sample file by index. Otherwise, it will read sample file by sample name.
device_name (str) – the device name, default: “cpu”
step (int) – the epoch step of learning rate decay, default: 5
model_id (str) – the model id, If None, it will be generated automatically
pretrain_model_path (str) – the pretrain model path, if not None, it will load the pretrain model
max_acc_limit (float) – the max accuracy limit, if the accuracy is higher than this limit, the training will stop to prevent overfitting.
multi_process (bool) – if True, it will use multi-process to train the model.
**kwargs (dict) – the other parameters wanted to be saved in the log file.
- Returns
- Return type
job id string
-
socube.task.doublet.train.
validate
(data_loader: torch.utils.data.dataloader.DataLoader, model: socube.net._base.NetBase, device: torch.device, with_progress: bool = False) → tuple¶ Validate model performance basically
- Parameters
dataLoader (the torch dataloader object used for validation) –
model (Network model implemented NetBase) – the model waited for validation
device (the cpu/gpu device) –
- Returns
- Return type
a quadra tuple of (average loss, average ACC, true label, predict score)
-
socube.task.doublet.train.
infer
(data_dir: str, home_dir: str, model_id: str, label_file: str, in_channels: int = 10, k: int = 5, threshold: float = 0.5, batch_size: int = 400, gpu_ids: List[str] = None, with_eval: bool = False, seed: Optional[int] = None, multi_process: bool = False, once: bool = False)¶ Model inference
- Parameters
data_dir (str) – the directory of data
home_dir (str) – the home directory of output
model_id (str) – the id of model
label_file (str) – the label file used to inference
in_channels (int) – the number of input channels
k (int) – k value for k-fold cross validation
threshold (float) – the threshold for binary classification
batch_size (int) – the batch size for inference
gpu_ids (List[str]) – the list of gpu ids
with_eval (bool) – whether to evaluate the model performance
seed (int) – the seed for random
multi_process (bool) – whether to use multi-process for inference
once (bool) – whether use emsemble for inference
Module contents¶
-
class
socube.task.doublet.
SoCubeNet
(in_channels: int, out_channels: int, freeze: bool = False, binary: bool = True, **kwargs)¶ Bases:
socube.net._base.NetBase
Neural network model constructed for doublet detection task. previous name is Conv2DClassifyNet.
- Parameters
in_channels (integer scalar value) – input data’s channel count
out_channels (integer scalar value) – output data’s channel count
freeze (boolean value) – Whether to freeze the feature extraction layer, default is False
binary (boolean value) – Whether the output probability is binary or multicategorical, default is True for binary
Examples
>>> SoCubeNet(10, 2)
-
criterion
(yPredict: torch.Tensor, yTrue: torch.Tensor) → torch.Tensor¶ Abstract methods to calculate the loss of the model.
- Parameters
y_predict (torch.Tensor) – The predicted data.
y_true (torch.Tensor) – The true data.
- Returns
- Return type
The loss of the model.
-
forward
(x1: torch.Tensor) → torch.Tensor¶ Data forward for a neural network waited to be implemented.
- Parameters
x (torch.Tensor) – The input data.
- Returns
- Return type
The output data.
-
training
= None¶
-
class
socube.task.doublet.
ConvClassifyDataset
(data_dir: str, transform: torch.nn.modules.module.Module = None, labels: str = 'label.csv', shuffle: bool = False, seed: int = None, k: int = 5, use_index: bool = True)¶ Bases:
socube.data.loading.ConvDatasetBase
Class of dataset for socube
- Parameters
data_dir (string) – the dataset’s directory
transform (torch module) – sample transform, such as Resize
labels (string) – the label file csv name
shuffle (boolean value) – if True, data will be shuffled while k-fold cross-valid
seed (random seed) – random seed for k-fold cross-valid or sample
k (integer scalar value) – k value of k-fold cross-valid
use_index (boolean value) – If True, it will read sample file by index.
-
sampler
(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.WeightedRandomSampler¶ Generate weighted random sampler for a subset of this dataset
- Parameters
subset (Subset) – the subset of this dataset
- Returns
- Return type
Weighted random sampler
-
property
typeCounts
¶ Numbers of different types
-
socube.task.doublet.
generateDoublet
(samples: pandas.core.frame.DataFrame, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, size: Optional[int] = None, mode: Optional[str] = 'balance') → Tuple[pandas.core.frame.DataFrame]¶ Generate training set from samples. in silico doublet will be simulated as positive samples.
- Parameters
samples (pd.DataFrame) – the samples dataframe, with row as cells (droplets, simples) and column as genes.
ratio (float, default 1.0) – The ratio of the number of doublet and singlet.
adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.
seed (int, default None) – The random seed for the generation of the doublet.
size (int, default None) – The size of the generated training set. If None, the size of the training set will be the same as the size of the samples.
mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced.
- Returns
a tuple of two pd.DataFrame, the first is the positive (doublet) samples,
the second is the negative (singlet) samples.
-
socube.task.doublet.
fit
(home_dir: str, data_dir: str, lr: float = 0.001, gamma: float = 0.99, epochs: int = 100, train_batch: int = 32, valid_batch: int = 500, transform: torch.nn.modules.module.Module = None, in_channels: int = 10, num_workers: int = 0, shuffle: bool = False, seed: int = None, label_file: str = 'label.csv', threshold: float = 0.5, k: int = 5, once: bool = False, use_index: bool = True, gpu_ids: List[str] = None, step: int = 5, model_id: str = None, pretrain_model_path: str = None, max_acc_limit: float = 1.0, multi_process: bool = False, **kwargs) → str¶ Train socube model.
- Parameters
home_dir (str) – the home directory of the specfic job
data_dir (str) – the dataset’s directory
lr (float) – learning rate, default: 0.001
gamma (float) – learning rate decay, default: 0.99
epochs (int) – training epochs, default: 100
train_batch (int) – training batch size, default: 32
valid_batch (int) – validation batch size, default: 500
transform (nn.Module) – sample transform, such as Resize
in_channels (int) – the number of input channels, default: 10
num_workers (int) – the number of workers for data loading, default: 0
shuffle (bool) – if True, data will be shuffled while k-fold cross-valid
seed (int) – random seed for k-fold cross-valid or sample
label_file (str) – the label file csv name, default: “label.csv”,
threshold (float) – the threshold for classification, default: 0.5
k (int) – k value of k-fold cross-valid
once (bool) – if True, k-fold cross-validation runs first fold only
use_index (bool) – If True, it will read sample file by index. Otherwise, it will read sample file by sample name.
device_name (str) – the device name, default: “cpu”
step (int) – the epoch step of learning rate decay, default: 5
model_id (str) – the model id, If None, it will be generated automatically
pretrain_model_path (str) – the pretrain model path, if not None, it will load the pretrain model
max_acc_limit (float) – the max accuracy limit, if the accuracy is higher than this limit, the training will stop to prevent overfitting.
multi_process (bool) – if True, it will use multi-process to train the model.
**kwargs (dict) – the other parameters wanted to be saved in the log file.
- Returns
- Return type
job id string
-
socube.task.doublet.
validate
(data_loader: torch.utils.data.dataloader.DataLoader, model: socube.net._base.NetBase, device: torch.device, with_progress: bool = False) → tuple¶ Validate model performance basically
- Parameters
dataLoader (the torch dataloader object used for validation) –
model (Network model implemented NetBase) – the model waited for validation
device (the cpu/gpu device) –
- Returns
- Return type
a quadra tuple of (average loss, average ACC, true label, predict score)
-
socube.task.doublet.
infer
(data_dir: str, home_dir: str, model_id: str, label_file: str, in_channels: int = 10, k: int = 5, threshold: float = 0.5, batch_size: int = 400, gpu_ids: List[str] = None, with_eval: bool = False, seed: Optional[int] = None, multi_process: bool = False, once: bool = False)¶ Model inference
- Parameters
data_dir (str) – the directory of data
home_dir (str) – the home directory of output
model_id (str) – the id of model
label_file (str) – the label file used to inference
in_channels (int) – the number of input channels
k (int) – k value for k-fold cross validation
threshold (float) – the threshold for binary classification
batch_size (int) – the batch size for inference
gpu_ids (List[str]) – the list of gpu ids
with_eval (bool) – whether to evaluate the model performance
seed (int) – the seed for random
multi_process (bool) – whether to use multi-process for inference
once (bool) – whether use emsemble for inference
-
socube.task.doublet.
checkShape
(path: str, shape: tuple = (10, None, None)) → None¶ Check dataset shape
- Parameters
path (string) – the dataset’s directory
shape (tuple, default (10, None, None)) – the expected shape of the dataset, None means any shape
- Raises
AssertionError – if the shape of the dataset is not the same as the expected:
-
socube.task.doublet.
checkData
(data: pandas.core.frame.DataFrame)¶ Data legitimacy verification
- Parameters
data (pd.DataFrame) – Data to be checked, a dataframe of scRNA-seq data
- Raises
ValueError – If data contains NaN or inf IndexError If data contains duplicate column or row names, or if droplet name begins with “doublet”
-
socube.task.doublet.
createTrainData
(samples: pandas.core.frame.DataFrame, output_path: str, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, mode: Optional[int] = 'balance') → Tuple[pandas.core.generic.NDFrame]¶ Based on the original data, doublets are generated as the positive data and a subset of the original data is used as the negative data to obtain the training dataset.
- Parameters
samples (pd.DataFrame) – Original data, a dataframe of scRNA-seq data. Shape is (n_droplets, n_genes)
output_path (str) – Path to save the generated training data.
ratio (float, default 1.0) – The ratio of the number of doublet and singlet.
adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.
seed (int, default None) – The random seed for the generation of the doublet.
mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced. see generateDoublet for more details.
- Returns
A tuple of NDFrames. The first element is training data, the second element
is the training label.