Module transact.TRANSACT
TRANSACT
-
Tumor Response Assessment by Non-linear Subspace Alignment of Cell-lines and Tumors.
@author: Soufiane Mourragui soufiane.mourragui@gmail.com
Method supporting the design of drug response models that translate from pre-clinical models to tumors. The complete
methodological details can be found in our
pre-print.
Example
::
import numpy as np
from transact.TRANSACT import TRANSACT
# Generate data
n_source = 100
n_target = 200
n_features = 500
X_source = np.random.normal(size=(n_source, n_features))
y_source = X_source.dot(np.random.normal(size=(n_features)))
X_target = np.random.normal(size=(n_target, n_features))
# Create a TRANSACT instance
clf = TRANSACT(
kernel='rbf',
kernel_params={'gamma':1/np.sqrt(n_features)},
n_components={'source': 20, 'target':40},
n_jobs=1,
verbose=1
)
# Compute consensus features
clf.fit(
X_source,
X_target,
n_pv=10,
step=100,
with_interpolation=True
)
::
Notes
TRANSACT required Python 3.6 or higher, and the following packages are required: scikit-learn, numpy, scipy, joblib.
Please relate any issue on the GitHub, or contact me (s.mourragui@nki.nl).
References
[1] Mourragui et al 2021, Predicting clinical drug response from model systems by non-linear subspace-based transfer learning, Biorxiv.
Expand source code
""" <h3> <b>TRANSACT</b></h3>: <b>T</b>umor <b>R</b>esponse <b>A</b>ssessment by <b>N</b>on-linear <b>S</b>ubspace
<b>A</b>lignment of <b>C</b>ell-lines and <b>T</b>umors.
@author: Soufiane Mourragui <soufiane.mourragui@gmail.com>
Method supporting the design of drug response models that translate from pre-clinical models to tumors. The complete
methodological details can be found in our <a href="https://www.biorxiv.org/content/10.1101/2020.06.29.177139v3">
pre-print</a>.
<br/><br/>
Example
-------
::
import numpy as np
from transact.TRANSACT import TRANSACT
# Generate data
n_source = 100
n_target = 200
n_features = 500
X_source = np.random.normal(size=(n_source, n_features))
y_source = X_source.dot(np.random.normal(size=(n_features)))
X_target = np.random.normal(size=(n_target, n_features))
# Create a TRANSACT instance
clf = TRANSACT(
kernel='rbf',
kernel_params={'gamma':1/np.sqrt(n_features)},
n_components={'source': 20, 'target':40},
n_jobs=1,
verbose=1
)
# Compute consensus features
clf.fit(
X_source,
X_target,
n_pv=10,
step=100,
with_interpolation=True
)
::
Notes
-------
TRANSACT required Python 3.6 or higher, and the following packages are required: scikit-learn, numpy, scipy, joblib.
<br/>
Please relate any issue on the GitHub, or contact me (s.mourragui@nki.nl).
References
-------
[1] Mourragui et al 2021, Predicting clinical drug response from model systems by non-linear subspace-based transfer
learning, Biorxiv.
"""
import numpy as np
import scipy
from joblib import Parallel, delayed
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import Ridge, ElasticNet, Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import clone
from transact.pv_computation import PVComputation
from transact.interpolation import Interpolation
from transact.kernel_computer import KernelComputer
class TRANSACT:
"""
TRANSACT is a package designed to adapt predictors of drug response from pre-clinical models to the clinic.
<br/><br/>
This class contains all the tasks and sub-routines required for training the domain adaptation framework, i.e.:
<ul>
<li> Kernel PCA decomposition on source and target independently.
<li> Kernel principal components comparison.
<li> Computation of Principal Vectors (PVs).
<li> Interpolation between source and target PVs and extraction of Consensus Features (CFs).
<li> Out-of-sample extension: project new dataset onto the consensus features.
</ul>
"""
def __init__(self,
kernel='linear',
kernel_params=None,
n_components=None,
n_pv=None,
method='two-stage',
step=100,
n_jobs=1,
verbose=False):
"""
Parameters
----------
kernel : str, default to 'linear'
Name of the kernel to be used in the algorithm. Has to be compliant with
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.kernel_metrics.html#sklearn.metrics.pairwise.kernel_metrics">
scikit-learn kernel</a>, e.g., "rbf", "polynomial", "laplacian", "linear", ...
kernel_params : dict, default to None
Parameters of the kernel (degree for polynomial kernel, gamma for RBF).
Naming has to be compliant with scikit-learn, e.g., {"gamma": 0.0005}.
n_components : int or dict, default to None
Number of components for kernel PCA.
<br/> If int, then indicates the same number of components for source and target.
<br/> If dict, then must be of the form {'source':int, 'target':int}.
n_pv : int, default to None
Number of principal vectors.
method : str, default to 'two-stage'
Method used for computing the principal vectors. Only 'two-stage' has been implemented.
step: int, default to 100
Number of interpolation steps.
n_jobs: int, default to 1
Number of concurrent threads to use for tasks that can be parallelized.
verbose: bool or int, default to False
Degree of verbosity in joblib routines.
"""
self.kernel = kernel
self.kernel_params_ = kernel_params or {}
self.kernel_values_ = KernelComputer(self.kernel, self.kernel_params_)
self.source_data_ = None
self.target_data_ = None
self.is_fitted = False
self.n_components = n_components
self.n_pv = n_pv
self.method = method
self.step = step
self.predictive_clf = None
self.n_jobs = n_jobs
self.verbose = verbose
def fit(self,
source_data,
target_data,
n_components=None,
n_pv=None,
method='two-stage',
step=100,
with_interpolation=True,
left_center=True):
"""
Compute the Consensus Features (CFs) onto which predictive models can be trained.
<br/> Specifically:
<ul>
<li> Compute the kernel matrices.
<li> Compute the cosine similarity matrix.
<li> Compute principal vectors.
<li> Interpolate between the PVs.
<li> Find optimal interpolation time.
</ul>
Parameters
----------
source_data : np.ndarray, dtype=float
Source data, matrix with samples in the rows, i.e. shape (n_source_samples, n_features).
<br./> pandas.DataFrame are supported.
target_data : np.ndarray, dtype=float
Source data, matrix with samples in the rows, i.e. shape (n_target_samples, n_features).
<br./> pandas.DataFrame are supported.
<br/><b>WARNING</b>: features need to be ordered in the same way as in source_data.
n_components: int, default to None
Number of components. If not set here or in __init__, then use the maximum number of principal components
possible for source and target.
n_pv: int, default to None
Number of Principal Vectors. If not set here or in __init__, then maximum number of PV will be computed.
method : str, default to 'two-stage'
Method used for computing the principal vectors. Only 'two-stage' has been implemented.
step: int, default to 100
Number of interpolation steps.
with_interpolation: bool, default to True
Bool indicating whether interpolation shall also be fitted. Useful for just computing PV
prior to null distribution fitting (and choose of PV number).
left_center: bool, default to True
Bool indicating whether the output should be mean-centered, i.e. whether source and target
consensus features values (or PVs if no interpolation) must have an independent mean-centering.
Returns
-------
self : TRANSACT
Fitted instance.
"""
# Save parameters
self.source_data_ = source_data
self.target_data_ = target_data
self.method = method or self.method
self.n_components = n_components or self.n_components
self.n_pv = n_pv or self.n_pv
self.step = step or self.step
self.left_center = left_center
# Compute kernel values
self.kernel_values_.fit(source_data, target_data, center=False)
# Compute principal vectors
self.principal_vectors_ = PVComputation(self.kernel, self.kernel_params_)
self.principal_vectors_.fit(self.source_data_,
self.target_data_,
method=self.method,
n_components=self.n_components,
n_pv=self.n_pv)
# Stop here if interpolation should not be computed.
if not with_interpolation:
return self
# Set up interpolation scheme
self.interpolation_ = Interpolation(self.kernel, self.kernel_params_)
self.interpolation_.fit(self.principal_vectors_, self.kernel_values_)
# Compute optimal interpolation time
self._compute_optimal_time(step=self.step, left_center=self.left_center)
self.is_fitted = True
return self
def null_distribution_pv_similarity(self, method='gene_shuffling', n_iter=100):
"""
Generate a null distribution for the PV similarity function:
<ul>
<li> Gene shuffling: genes get shuffled in source to destroy any structure existing
at the gene-level while preserving the sample structure. PV get recomputed and
similarity is saved.
</ul>
Parameters
----------
method : string, default to gene_shuffling
Method used for generating the null distribution.
Only method developped: gene_shuffling
n_iter: int, default to 100
Number of iterations
Returns
-------
np.ndarray, dtype=float, shape (n_iter, n_pv)
Array containing the distribution of similarity after shuffling. Each row
contains the values of one shuffling across PVs.
"""
if method.lower() == 'gene_shuffling':
null_method = self._gene_shuffling
else:
raise NotImplementedError('%s is not a proper method for generating null distribution'%(method))
null_distribution = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)\
(delayed(null_method)() for _ in range(n_iter))
return np.array(null_distribution)
def _gene_shuffling(self):
perm = np.random.permutation(self.source_data_.shape[1])
pv = PVComputation(self.kernel, self.kernel_params_)
pv.fit(self.source_data_[:,perm],
self.target_data_,
method=self.method,
n_components=self.n_components,
n_pv=self.n_pv)
return np.cos(pv.canonical_angles)
def fit_predictor(self, X, y, alpha_values=None, l1_ratio=0.5):
"""
Project X on consensus features and train a predictor of drug response.
Parameters
----------
X : np.ndarray of shape (n_samples, n_features), dtype=float
Dataset to project. Features should be ordered in same way as in source_data
and target_data.
y : np.ndarray of shape (n_samples, 1), dtype=float
Output to predict
Returns
-------
"""
self.alpha_values = alpha_values if alpha_values is not None else np.logspace(-10,5,34)
self.l1_ratio_values = [0., .1, .2, .4, .5, .6, .8, .9, 1.]
param_grid ={
'regression__alpha': self.alpha_values,
'regression__l1_ratio': self.l1_ratio_values
}
#Grid search setup
self.predictive_clf = GridSearchCV(Pipeline([
('regression', ElasticNet())
]),\
cv=10,
n_jobs=self.n_jobs,
param_grid=param_grid,
verbose=self.verbose,
scoring='neg_mean_squared_error')
self.predictive_clf.fit(self.transform(X, center=False), y)
return self
def compute_pred_performance(self, X, y, cv=10):
"""
Compute predictive performance of predictive model by cross-validation
on X and y.
Parameters
----------
X : np.ndarray of shape (n_samples, n_features), dtype=float
Dataset to project. Features should be ordered in same way as in source_data
and target_data.
Returns
-------
np.ndarray of shape (n_samples, n_pv), dtype=float
Dataset projected on consensus features.
"""
kf = KFold(n_splits=cv, shuffle=True)
X_projected = self.transform(X)
if self.predictive_clf is None:
print('BEWARE: NOT FITTED INSTANCE')
self.fit_predictor(X,y)
clf = clone(self.predictive_clf)
y_predicted = np.zeros(X.shape[0])
for train_index, test_index in kf.split(X_projected):
clf.fit(X_projected[train_index], y[train_index])
y_predicted[test_index] = clf.predict(X_projected[test_index])
return scipy.stats.pearsonr(y_predicted, y)
def predict(self, X):
"""
Predict the drug response of a set of samples, i.e.:
<ul>
<li> Project data on consensus features.
<li> Use the Elastic Net model to predict based on the consensus features.
</ul>
Parameters
----------
X : np.ndarray, dtype=float
Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as
in source_data and target_data.
Returns
-------
np.ndarray of shape (n_samples, 1), dtype=float
Predicted drug response values.
"""
return self.predictive_clf.predict(self.transform(X, center=False))
def transform(self, X, center=False):
"""
Project a dataset X onto the consensus features.
Parameters
----------
X : np.ndarray, dtype=float
Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as
in source_data and target_data.
Returns
-------
np.ndarray of shape (n_samples, n_pv), dtype=float
Dataset projected on consensus features.
"""
return self.interpolation_.transform(X, self.optimal_time, center=center)
def _compute_optimal_time(self, step=100, left_center=True):
# Based on Kolmogorov Smirnov statistics, find interpolation time
# Compute the interpolated values
interpolated_values = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)\
(delayed(self.interpolation_.project_data)(s/step, center=left_center)
for s in range(step+1))
interpolated_values = np.array(interpolated_values).transpose(2,0,1)
source_interpolated_values = interpolated_values[:,:,:self.source_data_.shape[0]]
target_interpolated_values = interpolated_values[:,:,self.source_data_.shape[0]:]
self.optimal_time = []
self.ks_statistics = []
self.ks_p_values = []
# For each PV, find the time when interpolation has the largest overlap.
for source_pv, target_pv in zip(source_interpolated_values, target_interpolated_values):
self.ks_statistics.append([])
for s, t in zip(source_pv, target_pv):
self.ks_statistics[-1].append(scipy.stats.ks_2samp(s,t))
self.ks_statistics[-1] = list(zip(*self.ks_statistics[-1]))
self.ks_p_values.append(self.ks_statistics[-1][-1])
self.ks_statistics[-1] = self.ks_statistics[-1][0]
self.optimal_time.append(np.argmin(self.ks_statistics[-1])/step)
# Save the different statistics
self.optimal_time = np.array(self.optimal_time) # Optimal tau for each PV.
self.ks_statistics = np.array(self.ks_statistics) # Computed KS statistics between each PV.
self.ks_p_values = np.array(self.ks_p_values) # Corresponding p_values.
Classes
class TRANSACT (kernel='linear', kernel_params=None, n_components=None, n_pv=None, method='two-stage', step=100, n_jobs=1, verbose=False)
-
TRANSACT is a package designed to adapt predictors of drug response from pre-clinical models to the clinic.
This class contains all the tasks and sub-routines required for training the domain adaptation framework, i.e.:- Kernel PCA decomposition on source and target independently.
- Kernel principal components comparison.
- Computation of Principal Vectors (PVs).
- Interpolation between source and target PVs and extraction of Consensus Features (CFs).
- Out-of-sample extension: project new dataset onto the consensus features.
Parameters
kernel
:str
, defaultto 'linear'
- Name of the kernel to be used in the algorithm. Has to be compliant with scikit-learn kernel, e.g., "rbf", "polynomial", "laplacian", "linear", …
kernel_params
:dict
, defaultto None
- Parameters of the kernel (degree for polynomial kernel, gamma for RBF). Naming has to be compliant with scikit-learn, e.g., {"gamma": 0.0005}.
n_components
:int
ordict
, defaultto None
- Number of components for kernel PCA.
If int, then indicates the same number of components for source and target.
If dict, then must be of the form {'source':int, 'target':int}. n_pv
:int
, defaultto None
- Number of principal vectors.
method
:str
, defaultto 'two-stage'
- Method used for computing the principal vectors. Only 'two-stage' has been implemented.
step
:int
, defaultto 100
- Number of interpolation steps.
n_jobs
:int
, defaultto 1
- Number of concurrent threads to use for tasks that can be parallelized.
verbose
:bool
orint
, defaultto False
- Degree of verbosity in joblib routines.
Expand source code
class TRANSACT: """ TRANSACT is a package designed to adapt predictors of drug response from pre-clinical models to the clinic. <br/><br/> This class contains all the tasks and sub-routines required for training the domain adaptation framework, i.e.: <ul> <li> Kernel PCA decomposition on source and target independently. <li> Kernel principal components comparison. <li> Computation of Principal Vectors (PVs). <li> Interpolation between source and target PVs and extraction of Consensus Features (CFs). <li> Out-of-sample extension: project new dataset onto the consensus features. </ul> """ def __init__(self, kernel='linear', kernel_params=None, n_components=None, n_pv=None, method='two-stage', step=100, n_jobs=1, verbose=False): """ Parameters ---------- kernel : str, default to 'linear' Name of the kernel to be used in the algorithm. Has to be compliant with <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.kernel_metrics.html#sklearn.metrics.pairwise.kernel_metrics"> scikit-learn kernel</a>, e.g., "rbf", "polynomial", "laplacian", "linear", ... kernel_params : dict, default to None Parameters of the kernel (degree for polynomial kernel, gamma for RBF). Naming has to be compliant with scikit-learn, e.g., {"gamma": 0.0005}. n_components : int or dict, default to None Number of components for kernel PCA. <br/> If int, then indicates the same number of components for source and target. <br/> If dict, then must be of the form {'source':int, 'target':int}. n_pv : int, default to None Number of principal vectors. method : str, default to 'two-stage' Method used for computing the principal vectors. Only 'two-stage' has been implemented. step: int, default to 100 Number of interpolation steps. n_jobs: int, default to 1 Number of concurrent threads to use for tasks that can be parallelized. verbose: bool or int, default to False Degree of verbosity in joblib routines. """ self.kernel = kernel self.kernel_params_ = kernel_params or {} self.kernel_values_ = KernelComputer(self.kernel, self.kernel_params_) self.source_data_ = None self.target_data_ = None self.is_fitted = False self.n_components = n_components self.n_pv = n_pv self.method = method self.step = step self.predictive_clf = None self.n_jobs = n_jobs self.verbose = verbose def fit(self, source_data, target_data, n_components=None, n_pv=None, method='two-stage', step=100, with_interpolation=True, left_center=True): """ Compute the Consensus Features (CFs) onto which predictive models can be trained. <br/> Specifically: <ul> <li> Compute the kernel matrices. <li> Compute the cosine similarity matrix. <li> Compute principal vectors. <li> Interpolate between the PVs. <li> Find optimal interpolation time. </ul> Parameters ---------- source_data : np.ndarray, dtype=float Source data, matrix with samples in the rows, i.e. shape (n_source_samples, n_features). <br./> pandas.DataFrame are supported. target_data : np.ndarray, dtype=float Source data, matrix with samples in the rows, i.e. shape (n_target_samples, n_features). <br./> pandas.DataFrame are supported. <br/><b>WARNING</b>: features need to be ordered in the same way as in source_data. n_components: int, default to None Number of components. If not set here or in __init__, then use the maximum number of principal components possible for source and target. n_pv: int, default to None Number of Principal Vectors. If not set here or in __init__, then maximum number of PV will be computed. method : str, default to 'two-stage' Method used for computing the principal vectors. Only 'two-stage' has been implemented. step: int, default to 100 Number of interpolation steps. with_interpolation: bool, default to True Bool indicating whether interpolation shall also be fitted. Useful for just computing PV prior to null distribution fitting (and choose of PV number). left_center: bool, default to True Bool indicating whether the output should be mean-centered, i.e. whether source and target consensus features values (or PVs if no interpolation) must have an independent mean-centering. Returns ------- self : TRANSACT Fitted instance. """ # Save parameters self.source_data_ = source_data self.target_data_ = target_data self.method = method or self.method self.n_components = n_components or self.n_components self.n_pv = n_pv or self.n_pv self.step = step or self.step self.left_center = left_center # Compute kernel values self.kernel_values_.fit(source_data, target_data, center=False) # Compute principal vectors self.principal_vectors_ = PVComputation(self.kernel, self.kernel_params_) self.principal_vectors_.fit(self.source_data_, self.target_data_, method=self.method, n_components=self.n_components, n_pv=self.n_pv) # Stop here if interpolation should not be computed. if not with_interpolation: return self # Set up interpolation scheme self.interpolation_ = Interpolation(self.kernel, self.kernel_params_) self.interpolation_.fit(self.principal_vectors_, self.kernel_values_) # Compute optimal interpolation time self._compute_optimal_time(step=self.step, left_center=self.left_center) self.is_fitted = True return self def null_distribution_pv_similarity(self, method='gene_shuffling', n_iter=100): """ Generate a null distribution for the PV similarity function: <ul> <li> Gene shuffling: genes get shuffled in source to destroy any structure existing at the gene-level while preserving the sample structure. PV get recomputed and similarity is saved. </ul> Parameters ---------- method : string, default to gene_shuffling Method used for generating the null distribution. Only method developped: gene_shuffling n_iter: int, default to 100 Number of iterations Returns ------- np.ndarray, dtype=float, shape (n_iter, n_pv) Array containing the distribution of similarity after shuffling. Each row contains the values of one shuffling across PVs. """ if method.lower() == 'gene_shuffling': null_method = self._gene_shuffling else: raise NotImplementedError('%s is not a proper method for generating null distribution'%(method)) null_distribution = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)\ (delayed(null_method)() for _ in range(n_iter)) return np.array(null_distribution) def _gene_shuffling(self): perm = np.random.permutation(self.source_data_.shape[1]) pv = PVComputation(self.kernel, self.kernel_params_) pv.fit(self.source_data_[:,perm], self.target_data_, method=self.method, n_components=self.n_components, n_pv=self.n_pv) return np.cos(pv.canonical_angles) def fit_predictor(self, X, y, alpha_values=None, l1_ratio=0.5): """ Project X on consensus features and train a predictor of drug response. Parameters ---------- X : np.ndarray of shape (n_samples, n_features), dtype=float Dataset to project. Features should be ordered in same way as in source_data and target_data. y : np.ndarray of shape (n_samples, 1), dtype=float Output to predict Returns ------- """ self.alpha_values = alpha_values if alpha_values is not None else np.logspace(-10,5,34) self.l1_ratio_values = [0., .1, .2, .4, .5, .6, .8, .9, 1.] param_grid ={ 'regression__alpha': self.alpha_values, 'regression__l1_ratio': self.l1_ratio_values } #Grid search setup self.predictive_clf = GridSearchCV(Pipeline([ ('regression', ElasticNet()) ]),\ cv=10, n_jobs=self.n_jobs, param_grid=param_grid, verbose=self.verbose, scoring='neg_mean_squared_error') self.predictive_clf.fit(self.transform(X, center=False), y) return self def compute_pred_performance(self, X, y, cv=10): """ Compute predictive performance of predictive model by cross-validation on X and y. Parameters ---------- X : np.ndarray of shape (n_samples, n_features), dtype=float Dataset to project. Features should be ordered in same way as in source_data and target_data. Returns ------- np.ndarray of shape (n_samples, n_pv), dtype=float Dataset projected on consensus features. """ kf = KFold(n_splits=cv, shuffle=True) X_projected = self.transform(X) if self.predictive_clf is None: print('BEWARE: NOT FITTED INSTANCE') self.fit_predictor(X,y) clf = clone(self.predictive_clf) y_predicted = np.zeros(X.shape[0]) for train_index, test_index in kf.split(X_projected): clf.fit(X_projected[train_index], y[train_index]) y_predicted[test_index] = clf.predict(X_projected[test_index]) return scipy.stats.pearsonr(y_predicted, y) def predict(self, X): """ Predict the drug response of a set of samples, i.e.: <ul> <li> Project data on consensus features. <li> Use the Elastic Net model to predict based on the consensus features. </ul> Parameters ---------- X : np.ndarray, dtype=float Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as in source_data and target_data. Returns ------- np.ndarray of shape (n_samples, 1), dtype=float Predicted drug response values. """ return self.predictive_clf.predict(self.transform(X, center=False)) def transform(self, X, center=False): """ Project a dataset X onto the consensus features. Parameters ---------- X : np.ndarray, dtype=float Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as in source_data and target_data. Returns ------- np.ndarray of shape (n_samples, n_pv), dtype=float Dataset projected on consensus features. """ return self.interpolation_.transform(X, self.optimal_time, center=center) def _compute_optimal_time(self, step=100, left_center=True): # Based on Kolmogorov Smirnov statistics, find interpolation time # Compute the interpolated values interpolated_values = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)\ (delayed(self.interpolation_.project_data)(s/step, center=left_center) for s in range(step+1)) interpolated_values = np.array(interpolated_values).transpose(2,0,1) source_interpolated_values = interpolated_values[:,:,:self.source_data_.shape[0]] target_interpolated_values = interpolated_values[:,:,self.source_data_.shape[0]:] self.optimal_time = [] self.ks_statistics = [] self.ks_p_values = [] # For each PV, find the time when interpolation has the largest overlap. for source_pv, target_pv in zip(source_interpolated_values, target_interpolated_values): self.ks_statistics.append([]) for s, t in zip(source_pv, target_pv): self.ks_statistics[-1].append(scipy.stats.ks_2samp(s,t)) self.ks_statistics[-1] = list(zip(*self.ks_statistics[-1])) self.ks_p_values.append(self.ks_statistics[-1][-1]) self.ks_statistics[-1] = self.ks_statistics[-1][0] self.optimal_time.append(np.argmin(self.ks_statistics[-1])/step) # Save the different statistics self.optimal_time = np.array(self.optimal_time) # Optimal tau for each PV. self.ks_statistics = np.array(self.ks_statistics) # Computed KS statistics between each PV. self.ks_p_values = np.array(self.ks_p_values) # Corresponding p_values.
Methods
def compute_pred_performance(self, X, y, cv=10)
-
Compute predictive performance of predictive model by cross-validation on X and y.
Parameters
X
:np.ndarray
ofshape (n_samples, n_features), dtype=float
- Dataset to project. Features should be ordered in same way as in source_data and target_data.
Returns
np.ndarray
ofshape (n_samples, n_pv), dtype=float
- Dataset projected on consensus features.
Expand source code
def compute_pred_performance(self, X, y, cv=10): """ Compute predictive performance of predictive model by cross-validation on X and y. Parameters ---------- X : np.ndarray of shape (n_samples, n_features), dtype=float Dataset to project. Features should be ordered in same way as in source_data and target_data. Returns ------- np.ndarray of shape (n_samples, n_pv), dtype=float Dataset projected on consensus features. """ kf = KFold(n_splits=cv, shuffle=True) X_projected = self.transform(X) if self.predictive_clf is None: print('BEWARE: NOT FITTED INSTANCE') self.fit_predictor(X,y) clf = clone(self.predictive_clf) y_predicted = np.zeros(X.shape[0]) for train_index, test_index in kf.split(X_projected): clf.fit(X_projected[train_index], y[train_index]) y_predicted[test_index] = clf.predict(X_projected[test_index]) return scipy.stats.pearsonr(y_predicted, y)
def fit(self, source_data, target_data, n_components=None, n_pv=None, method='two-stage', step=100, with_interpolation=True, left_center=True)
-
Compute the Consensus Features (CFs) onto which predictive models can be trained.
Specifically:- Compute the kernel matrices.
- Compute the cosine similarity matrix.
- Compute principal vectors.
- Interpolate between the PVs.
- Find optimal interpolation time.
Parameters
source_data
:np.ndarray, dtype=float
- Source data, matrix with samples in the rows, i.e. shape (n_source_samples, n_features).
pandas.DataFrame are supported. target_data
:np.ndarray, dtype=float
- Source data, matrix with samples in the rows, i.e. shape (n_target_samples, n_features).
pandas.DataFrame are supported.
WARNING: features need to be ordered in the same way as in source_data. n_components
:int
, defaultto None
- Number of components. If not set here or in init, then use the maximum number of principal components possible for source and target.
n_pv
:int
, defaultto None
- Number of Principal Vectors. If not set here or in init, then maximum number of PV will be computed.
method
:str
, defaultto 'two-stage'
- Method used for computing the principal vectors. Only 'two-stage' has been implemented.
step
:int
, defaultto 100
- Number of interpolation steps.
with_interpolation
:bool
, defaultto True
- Bool indicating whether interpolation shall also be fitted. Useful for just computing PV prior to null distribution fitting (and choose of PV number).
left_center
:bool
, defaultto True
- Bool indicating whether the output should be mean-centered, i.e. whether source and target consensus features values (or PVs if no interpolation) must have an independent mean-centering.
Returns
self
:TRANSACT
- Fitted instance.
Expand source code
def fit(self, source_data, target_data, n_components=None, n_pv=None, method='two-stage', step=100, with_interpolation=True, left_center=True): """ Compute the Consensus Features (CFs) onto which predictive models can be trained. <br/> Specifically: <ul> <li> Compute the kernel matrices. <li> Compute the cosine similarity matrix. <li> Compute principal vectors. <li> Interpolate between the PVs. <li> Find optimal interpolation time. </ul> Parameters ---------- source_data : np.ndarray, dtype=float Source data, matrix with samples in the rows, i.e. shape (n_source_samples, n_features). <br./> pandas.DataFrame are supported. target_data : np.ndarray, dtype=float Source data, matrix with samples in the rows, i.e. shape (n_target_samples, n_features). <br./> pandas.DataFrame are supported. <br/><b>WARNING</b>: features need to be ordered in the same way as in source_data. n_components: int, default to None Number of components. If not set here or in __init__, then use the maximum number of principal components possible for source and target. n_pv: int, default to None Number of Principal Vectors. If not set here or in __init__, then maximum number of PV will be computed. method : str, default to 'two-stage' Method used for computing the principal vectors. Only 'two-stage' has been implemented. step: int, default to 100 Number of interpolation steps. with_interpolation: bool, default to True Bool indicating whether interpolation shall also be fitted. Useful for just computing PV prior to null distribution fitting (and choose of PV number). left_center: bool, default to True Bool indicating whether the output should be mean-centered, i.e. whether source and target consensus features values (or PVs if no interpolation) must have an independent mean-centering. Returns ------- self : TRANSACT Fitted instance. """ # Save parameters self.source_data_ = source_data self.target_data_ = target_data self.method = method or self.method self.n_components = n_components or self.n_components self.n_pv = n_pv or self.n_pv self.step = step or self.step self.left_center = left_center # Compute kernel values self.kernel_values_.fit(source_data, target_data, center=False) # Compute principal vectors self.principal_vectors_ = PVComputation(self.kernel, self.kernel_params_) self.principal_vectors_.fit(self.source_data_, self.target_data_, method=self.method, n_components=self.n_components, n_pv=self.n_pv) # Stop here if interpolation should not be computed. if not with_interpolation: return self # Set up interpolation scheme self.interpolation_ = Interpolation(self.kernel, self.kernel_params_) self.interpolation_.fit(self.principal_vectors_, self.kernel_values_) # Compute optimal interpolation time self._compute_optimal_time(step=self.step, left_center=self.left_center) self.is_fitted = True return self
def fit_predictor(self, X, y, alpha_values=None, l1_ratio=0.5)
-
Project X on consensus features and train a predictor of drug response.
Parameters
X
:np.ndarray
ofshape (n_samples, n_features), dtype=float
- Dataset to project. Features should be ordered in same way as in source_data and target_data.
y
:np.ndarray
ofshape (n_samples, 1), dtype=float
- Output to predict
Returns
Expand source code
def fit_predictor(self, X, y, alpha_values=None, l1_ratio=0.5): """ Project X on consensus features and train a predictor of drug response. Parameters ---------- X : np.ndarray of shape (n_samples, n_features), dtype=float Dataset to project. Features should be ordered in same way as in source_data and target_data. y : np.ndarray of shape (n_samples, 1), dtype=float Output to predict Returns ------- """ self.alpha_values = alpha_values if alpha_values is not None else np.logspace(-10,5,34) self.l1_ratio_values = [0., .1, .2, .4, .5, .6, .8, .9, 1.] param_grid ={ 'regression__alpha': self.alpha_values, 'regression__l1_ratio': self.l1_ratio_values } #Grid search setup self.predictive_clf = GridSearchCV(Pipeline([ ('regression', ElasticNet()) ]),\ cv=10, n_jobs=self.n_jobs, param_grid=param_grid, verbose=self.verbose, scoring='neg_mean_squared_error') self.predictive_clf.fit(self.transform(X, center=False), y) return self
def null_distribution_pv_similarity(self, method='gene_shuffling', n_iter=100)
-
Generate a null distribution for the PV similarity function:
- Gene shuffling: genes get shuffled in source to destroy any structure existing at the gene-level while preserving the sample structure. PV get recomputed and similarity is saved.
Parameters
method
:string
, defaultto gene_shuffling
- Method used for generating the null distribution. Only method developped: gene_shuffling
n_iter
:int
, defaultto 100
- Number of iterations
Returns
np.ndarray, dtype=float, shape (n_iter, n_pv)
- Array containing the distribution of similarity after shuffling. Each row contains the values of one shuffling across PVs.
Expand source code
def null_distribution_pv_similarity(self, method='gene_shuffling', n_iter=100): """ Generate a null distribution for the PV similarity function: <ul> <li> Gene shuffling: genes get shuffled in source to destroy any structure existing at the gene-level while preserving the sample structure. PV get recomputed and similarity is saved. </ul> Parameters ---------- method : string, default to gene_shuffling Method used for generating the null distribution. Only method developped: gene_shuffling n_iter: int, default to 100 Number of iterations Returns ------- np.ndarray, dtype=float, shape (n_iter, n_pv) Array containing the distribution of similarity after shuffling. Each row contains the values of one shuffling across PVs. """ if method.lower() == 'gene_shuffling': null_method = self._gene_shuffling else: raise NotImplementedError('%s is not a proper method for generating null distribution'%(method)) null_distribution = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)\ (delayed(null_method)() for _ in range(n_iter)) return np.array(null_distribution)
def predict(self, X)
-
Predict the drug response of a set of samples, i.e.:
- Project data on consensus features.
- Use the Elastic Net model to predict based on the consensus features.
Parameters
X
:np.ndarray, dtype=float
- Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as in source_data and target_data.
Returns
np.ndarray
ofshape (n_samples, 1), dtype=float
- Predicted drug response values.
Expand source code
def predict(self, X): """ Predict the drug response of a set of samples, i.e.: <ul> <li> Project data on consensus features. <li> Use the Elastic Net model to predict based on the consensus features. </ul> Parameters ---------- X : np.ndarray, dtype=float Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as in source_data and target_data. Returns ------- np.ndarray of shape (n_samples, 1), dtype=float Predicted drug response values. """ return self.predictive_clf.predict(self.transform(X, center=False))
def transform(self, X, center=False)
-
Project a dataset X onto the consensus features.
Parameters
X
:np.ndarray, dtype=float
- Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as in source_data and target_data.
Returns
np.ndarray
ofshape (n_samples, n_pv), dtype=float
- Dataset projected on consensus features.
Expand source code
def transform(self, X, center=False): """ Project a dataset X onto the consensus features. Parameters ---------- X : np.ndarray, dtype=float Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as in source_data and target_data. Returns ------- np.ndarray of shape (n_samples, n_pv), dtype=float Dataset projected on consensus features. """ return self.interpolation_.transform(X, self.optimal_time, center=center)