Module transact.TRANSACT

TRANSACT

Tumor Response Assessment by Non-linear Subspace Alignment of Cell-lines and Tumors.

@author: Soufiane Mourragui soufiane.mourragui@gmail.com

Method supporting the design of drug response models that translate from pre-clinical models to tumors. The complete methodological details can be found in our pre-print.

Example

::
import numpy as np
from transact.TRANSACT import TRANSACT

# Generate data
n_source = 100
n_target = 200
n_features = 500

X_source = np.random.normal(size=(n_source, n_features))
y_source = X_source.dot(np.random.normal(size=(n_features)))
X_target = np.random.normal(size=(n_target, n_features))


# Create a TRANSACT instance
clf = TRANSACT(
    kernel='rbf',
    kernel_params={'gamma':1/np.sqrt(n_features)},
    n_components={'source': 20, 'target':40},
    n_jobs=1,
    verbose=1
)

# Compute consensus features
clf.fit(
    X_source,
    X_target,
    n_pv=10,
    step=100,
    with_interpolation=True
)
::

Notes

TRANSACT required Python 3.6 or higher, and the following packages are required: scikit-learn, numpy, scipy, joblib.
Please relate any issue on the GitHub, or contact me (s.mourragui@nki.nl).

References

[1] Mourragui et al 2021, Predicting clinical drug response from model systems by non-linear subspace-based transfer learning, Biorxiv.

Expand source code
""" <h3> <b>TRANSACT</b></h3>: <b>T</b>umor <b>R</b>esponse <b>A</b>ssessment by <b>N</b>on-linear <b>S</b>ubspace
<b>A</b>lignment of <b>C</b>ell-lines and <b>T</b>umors.

@author: Soufiane Mourragui <soufiane.mourragui@gmail.com>

Method supporting the design of drug response models that translate from pre-clinical models to tumors. The complete
methodological details can be found in our <a href="https://www.biorxiv.org/content/10.1101/2020.06.29.177139v3">
pre-print</a>.
<br/><br/>


Example
-------
    ::
    import numpy as np
    from transact.TRANSACT import TRANSACT

    # Generate data
    n_source = 100
    n_target = 200
    n_features = 500

    X_source = np.random.normal(size=(n_source, n_features))
    y_source = X_source.dot(np.random.normal(size=(n_features)))
    X_target = np.random.normal(size=(n_target, n_features))


    # Create a TRANSACT instance
    clf = TRANSACT(
        kernel='rbf',
        kernel_params={'gamma':1/np.sqrt(n_features)},
        n_components={'source': 20, 'target':40},
        n_jobs=1,
        verbose=1
    )

    # Compute consensus features
    clf.fit(
        X_source,
        X_target,
        n_pv=10,
        step=100,
        with_interpolation=True
    )
    ::
    
Notes
-------

TRANSACT required Python 3.6 or higher, and the following packages are required: scikit-learn, numpy, scipy, joblib.
<br/>
Please relate any issue on the GitHub, or contact me (s.mourragui@nki.nl).


References
-------
[1] Mourragui et al 2021, Predicting clinical drug response from model systems by non-linear subspace-based transfer
learning, Biorxiv.


"""


import numpy as np
import scipy
from joblib import Parallel, delayed

from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import Ridge, ElasticNet, Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import clone

from transact.pv_computation import PVComputation
from transact.interpolation import Interpolation
from transact.kernel_computer import KernelComputer


class TRANSACT:
    """
    TRANSACT is a package designed to adapt predictors of drug response from pre-clinical models to the clinic.
    <br/><br/>
    This class contains all the tasks and sub-routines required for training the domain adaptation framework, i.e.:
    <ul>
        <li> Kernel PCA decomposition on source and target independently.
         <li> Kernel principal components comparison.
         <li> Computation of Principal Vectors (PVs).
         <li> Interpolation between source and target PVs and extraction of Consensus Features (CFs).
         <li> Out-of-sample extension: project new dataset onto the consensus features.
    </ul>
    """

    def __init__(self,
                kernel='linear',
                kernel_params=None,
                n_components=None,
                n_pv=None,
                method='two-stage',
                step=100,
                n_jobs=1,
                verbose=False):
        """
        Parameters
        ----------
        kernel : str, default to 'linear'
            Name of the kernel to be used in the algorithm. Has to be compliant with
            <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.kernel_metrics.html#sklearn.metrics.pairwise.kernel_metrics">
            scikit-learn kernel</a>, e.g., "rbf", "polynomial", "laplacian", "linear", ...

        kernel_params : dict, default to None
            Parameters of the kernel (degree for polynomial kernel, gamma for RBF).
            Naming has to be compliant with scikit-learn, e.g., {"gamma": 0.0005}.

        n_components : int or dict, default to None
            Number of components for kernel PCA.
            <br/> If int, then indicates the same number of components for source and target.
            <br/> If dict, then must be of the form {'source':int, 'target':int}.

        n_pv : int, default to None
            Number of principal vectors.

        method : str, default to 'two-stage'
            Method used for computing the principal vectors. Only 'two-stage' has been implemented.

        step: int, default to 100
            Number of interpolation steps.

        n_jobs: int, default to 1
            Number of concurrent threads to use for tasks that can be parallelized.

        verbose: bool or int, default to False
            Degree of verbosity in joblib routines.
        """

        self.kernel = kernel
        self.kernel_params_ = kernel_params or {}
        self.kernel_values_ = KernelComputer(self.kernel, self.kernel_params_)

        self.source_data_ = None
        self.target_data_ = None

        self.is_fitted = False

        self.n_components = n_components
        self.n_pv = n_pv
        self.method = method
        self.step = step

        self.predictive_clf = None

        self.n_jobs = n_jobs
        self.verbose = verbose


    def fit(self,
            source_data,
            target_data,
            n_components=None,
            n_pv=None,
            method='two-stage',
            step=100,
            with_interpolation=True,
            left_center=True):

        """
        Compute the Consensus Features (CFs) onto which predictive models can be trained.
        <br/> Specifically:
        <ul>
            <li> Compute the kernel matrices.
            <li> Compute the cosine similarity matrix.
            <li> Compute principal vectors.
            <li> Interpolate between the PVs.
            <li> Find optimal interpolation time.
        </ul>

        Parameters
        ----------
        source_data : np.ndarray, dtype=float
            Source data, matrix with samples in the rows, i.e. shape (n_source_samples, n_features).
            <br./> pandas.DataFrame are supported.

        target_data : np.ndarray, dtype=float
            Source data, matrix with samples in the rows, i.e. shape (n_target_samples, n_features).
            <br./> pandas.DataFrame are supported.
            <br/><b>WARNING</b>: features need to be ordered in the same way as in source_data.

        n_components: int, default to None
            Number of components. If not set here or in __init__, then use the maximum number of principal components
            possible for source and target.

        n_pv: int, default to None
            Number of Principal Vectors. If not set here or in __init__, then maximum number of PV will be computed.

        method : str, default to 'two-stage'
            Method used for computing the principal vectors. Only 'two-stage' has been implemented.

        step: int, default to 100
            Number of interpolation steps.

        with_interpolation: bool, default to True
            Bool indicating whether interpolation shall also be fitted. Useful for just computing PV
            prior to null distribution fitting (and choose of PV number).

        left_center: bool, default to True
            Bool indicating whether the output should be mean-centered, i.e. whether source and target
            consensus features values (or PVs if no interpolation) must have an independent mean-centering.

        Returns
        -------
        self : TRANSACT
            Fitted instance.
        """

        # Save parameters
        self.source_data_ = source_data
        self.target_data_ = target_data
        self.method = method or self.method
        self.n_components = n_components or self.n_components
        self.n_pv = n_pv or self.n_pv
        self.step = step or self.step
        self.left_center = left_center

        # Compute kernel values
        self.kernel_values_.fit(source_data, target_data, center=False)

        # Compute principal vectors
        self.principal_vectors_ = PVComputation(self.kernel, self.kernel_params_)
        self.principal_vectors_.fit(self.source_data_,
                                    self.target_data_,
                                    method=self.method,
                                    n_components=self.n_components,
                                    n_pv=self.n_pv)

        # Stop here if interpolation should not be computed.
        if not with_interpolation:
            return self

        # Set up interpolation scheme
        self.interpolation_ = Interpolation(self.kernel, self.kernel_params_)
        self.interpolation_.fit(self.principal_vectors_, self.kernel_values_)

        # Compute optimal interpolation time
        self._compute_optimal_time(step=self.step, left_center=self.left_center)

        self.is_fitted = True

        return self


    def null_distribution_pv_similarity(self, method='gene_shuffling', n_iter=100):
        """
        Generate a null distribution for the PV similarity function:
        <ul>
            <li> Gene shuffling: genes get shuffled in source to destroy any structure existing
            at the gene-level while preserving the sample structure. PV get recomputed and 
            similarity is saved.
        </ul>

        Parameters
        ----------
        method : string, default to gene_shuffling
            Method used for generating the null distribution.
            Only method developped: gene_shuffling

        n_iter: int, default to 100
            Number of iterations

        Returns
        -------
        np.ndarray, dtype=float, shape (n_iter, n_pv)
            Array containing the distribution of similarity after shuffling. Each row
            contains the values of one shuffling across PVs.
        """

        if method.lower() == 'gene_shuffling':
            null_method = self._gene_shuffling
        else:
            raise NotImplementedError('%s is not a proper method for generating null distribution'%(method))

        null_distribution = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)\
                                    (delayed(null_method)() for _ in range(n_iter))

        return np.array(null_distribution)


    def _gene_shuffling(self):
        perm = np.random.permutation(self.source_data_.shape[1])
        pv = PVComputation(self.kernel, self.kernel_params_)
        pv.fit(self.source_data_[:,perm],
            self.target_data_,
            method=self.method,
            n_components=self.n_components,
            n_pv=self.n_pv)

        return np.cos(pv.canonical_angles)


    def fit_predictor(self, X, y, alpha_values=None, l1_ratio=0.5):
        """
        Project X on consensus features and train a predictor of drug response.

        Parameters
        ----------
        X : np.ndarray of shape (n_samples, n_features), dtype=float
            Dataset to project. Features should be ordered in same way as in source_data
            and target_data.

        y : np.ndarray of shape (n_samples, 1), dtype=float
            Output to predict

        Returns
        -------
        """
        self.alpha_values = alpha_values if alpha_values is not None else np.logspace(-10,5,34)
        self.l1_ratio_values = [0., .1, .2, .4, .5, .6, .8, .9, 1.]
        param_grid ={
            'regression__alpha': self.alpha_values,
            'regression__l1_ratio': self.l1_ratio_values
        }

        #Grid search setup
        self.predictive_clf = GridSearchCV(Pipeline([
                                ('regression', ElasticNet())
                                ]),\
                                cv=10,
                                n_jobs=self.n_jobs,
                                param_grid=param_grid,
                                verbose=self.verbose,
                                scoring='neg_mean_squared_error')
        self.predictive_clf.fit(self.transform(X, center=False), y)

        return self


    def compute_pred_performance(self, X, y, cv=10):
        """
        Compute predictive performance of predictive model by cross-validation
        on X and y.

        Parameters
        ----------
        X : np.ndarray of shape (n_samples, n_features), dtype=float
            Dataset to project. Features should be ordered in same way as in source_data
            and target_data.

        Returns
        -------
        np.ndarray of shape (n_samples, n_pv), dtype=float
            Dataset projected on consensus features.
        """

        kf = KFold(n_splits=cv, shuffle=True)
        X_projected = self.transform(X)

        if self.predictive_clf is None:
            print('BEWARE: NOT FITTED INSTANCE')
            self.fit_predictor(X,y)
        clf = clone(self.predictive_clf)

        y_predicted = np.zeros(X.shape[0])
        for train_index, test_index in kf.split(X_projected):
            clf.fit(X_projected[train_index], y[train_index])
            y_predicted[test_index] = clf.predict(X_projected[test_index])

        return scipy.stats.pearsonr(y_predicted, y)


    def predict(self, X):
        """
        Predict the drug response of a set of samples, i.e.:
        <ul>
            <li> Project data on consensus features.
            <li> Use the Elastic Net model to predict based on the consensus features.
        </ul>

        Parameters
        ----------
        X : np.ndarray, dtype=float
            Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as
            in source_data and target_data.

        Returns
        -------
        np.ndarray of shape (n_samples, 1), dtype=float
            Predicted drug response values.
        """
        return self.predictive_clf.predict(self.transform(X, center=False))


    def transform(self, X, center=False):
        """
        Project a dataset X onto the consensus features.

        Parameters
        ----------
        X : np.ndarray, dtype=float
            Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as
            in source_data and target_data.

        Returns
        -------
        np.ndarray of shape (n_samples, n_pv), dtype=float
            Dataset projected on consensus features.
        """
        return self.interpolation_.transform(X, self.optimal_time, center=center)


    def _compute_optimal_time(self, step=100, left_center=True):
        # Based on Kolmogorov Smirnov statistics, find interpolation time

        # Compute the interpolated values
        interpolated_values = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)\
                            (delayed(self.interpolation_.project_data)(s/step, center=left_center)
                                for s in range(step+1))
        interpolated_values = np.array(interpolated_values).transpose(2,0,1)
        source_interpolated_values = interpolated_values[:,:,:self.source_data_.shape[0]]
        target_interpolated_values = interpolated_values[:,:,self.source_data_.shape[0]:]

        self.optimal_time = []
        self.ks_statistics = []
        self.ks_p_values = []

        # For each PV, find the time when interpolation has the largest overlap.
        for source_pv, target_pv in zip(source_interpolated_values, target_interpolated_values):
            self.ks_statistics.append([])
            for s, t in zip(source_pv, target_pv):
                self.ks_statistics[-1].append(scipy.stats.ks_2samp(s,t))
            self.ks_statistics[-1] = list(zip(*self.ks_statistics[-1]))
            self.ks_p_values.append(self.ks_statistics[-1][-1])
            self.ks_statistics[-1] = self.ks_statistics[-1][0]
            self.optimal_time.append(np.argmin(self.ks_statistics[-1])/step)

        # Save the different statistics
        self.optimal_time = np.array(self.optimal_time) # Optimal tau for each PV.
        self.ks_statistics = np.array(self.ks_statistics) # Computed KS statistics between each PV.
        self.ks_p_values = np.array(self.ks_p_values) # Corresponding p_values.

Classes

class TRANSACT (kernel='linear', kernel_params=None, n_components=None, n_pv=None, method='two-stage', step=100, n_jobs=1, verbose=False)

TRANSACT is a package designed to adapt predictors of drug response from pre-clinical models to the clinic.

This class contains all the tasks and sub-routines required for training the domain adaptation framework, i.e.:

  • Kernel PCA decomposition on source and target independently.
  • Kernel principal components comparison.
  • Computation of Principal Vectors (PVs).
  • Interpolation between source and target PVs and extraction of Consensus Features (CFs).
  • Out-of-sample extension: project new dataset onto the consensus features.

Parameters

kernel : str, default to 'linear'
Name of the kernel to be used in the algorithm. Has to be compliant with scikit-learn kernel, e.g., "rbf", "polynomial", "laplacian", "linear", …
kernel_params : dict, default to None
Parameters of the kernel (degree for polynomial kernel, gamma for RBF). Naming has to be compliant with scikit-learn, e.g., {"gamma": 0.0005}.
n_components : int or dict, default to None
Number of components for kernel PCA.
If int, then indicates the same number of components for source and target.
If dict, then must be of the form {'source':int, 'target':int}.
n_pv : int, default to None
Number of principal vectors.
method : str, default to 'two-stage'
Method used for computing the principal vectors. Only 'two-stage' has been implemented.
step : int, default to 100
Number of interpolation steps.
n_jobs : int, default to 1
Number of concurrent threads to use for tasks that can be parallelized.
verbose : bool or int, default to False
Degree of verbosity in joblib routines.
Expand source code
class TRANSACT:
    """
    TRANSACT is a package designed to adapt predictors of drug response from pre-clinical models to the clinic.
    <br/><br/>
    This class contains all the tasks and sub-routines required for training the domain adaptation framework, i.e.:
    <ul>
        <li> Kernel PCA decomposition on source and target independently.
         <li> Kernel principal components comparison.
         <li> Computation of Principal Vectors (PVs).
         <li> Interpolation between source and target PVs and extraction of Consensus Features (CFs).
         <li> Out-of-sample extension: project new dataset onto the consensus features.
    </ul>
    """

    def __init__(self,
                kernel='linear',
                kernel_params=None,
                n_components=None,
                n_pv=None,
                method='two-stage',
                step=100,
                n_jobs=1,
                verbose=False):
        """
        Parameters
        ----------
        kernel : str, default to 'linear'
            Name of the kernel to be used in the algorithm. Has to be compliant with
            <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.kernel_metrics.html#sklearn.metrics.pairwise.kernel_metrics">
            scikit-learn kernel</a>, e.g., "rbf", "polynomial", "laplacian", "linear", ...

        kernel_params : dict, default to None
            Parameters of the kernel (degree for polynomial kernel, gamma for RBF).
            Naming has to be compliant with scikit-learn, e.g., {"gamma": 0.0005}.

        n_components : int or dict, default to None
            Number of components for kernel PCA.
            <br/> If int, then indicates the same number of components for source and target.
            <br/> If dict, then must be of the form {'source':int, 'target':int}.

        n_pv : int, default to None
            Number of principal vectors.

        method : str, default to 'two-stage'
            Method used for computing the principal vectors. Only 'two-stage' has been implemented.

        step: int, default to 100
            Number of interpolation steps.

        n_jobs: int, default to 1
            Number of concurrent threads to use for tasks that can be parallelized.

        verbose: bool or int, default to False
            Degree of verbosity in joblib routines.
        """

        self.kernel = kernel
        self.kernel_params_ = kernel_params or {}
        self.kernel_values_ = KernelComputer(self.kernel, self.kernel_params_)

        self.source_data_ = None
        self.target_data_ = None

        self.is_fitted = False

        self.n_components = n_components
        self.n_pv = n_pv
        self.method = method
        self.step = step

        self.predictive_clf = None

        self.n_jobs = n_jobs
        self.verbose = verbose


    def fit(self,
            source_data,
            target_data,
            n_components=None,
            n_pv=None,
            method='two-stage',
            step=100,
            with_interpolation=True,
            left_center=True):

        """
        Compute the Consensus Features (CFs) onto which predictive models can be trained.
        <br/> Specifically:
        <ul>
            <li> Compute the kernel matrices.
            <li> Compute the cosine similarity matrix.
            <li> Compute principal vectors.
            <li> Interpolate between the PVs.
            <li> Find optimal interpolation time.
        </ul>

        Parameters
        ----------
        source_data : np.ndarray, dtype=float
            Source data, matrix with samples in the rows, i.e. shape (n_source_samples, n_features).
            <br./> pandas.DataFrame are supported.

        target_data : np.ndarray, dtype=float
            Source data, matrix with samples in the rows, i.e. shape (n_target_samples, n_features).
            <br./> pandas.DataFrame are supported.
            <br/><b>WARNING</b>: features need to be ordered in the same way as in source_data.

        n_components: int, default to None
            Number of components. If not set here or in __init__, then use the maximum number of principal components
            possible for source and target.

        n_pv: int, default to None
            Number of Principal Vectors. If not set here or in __init__, then maximum number of PV will be computed.

        method : str, default to 'two-stage'
            Method used for computing the principal vectors. Only 'two-stage' has been implemented.

        step: int, default to 100
            Number of interpolation steps.

        with_interpolation: bool, default to True
            Bool indicating whether interpolation shall also be fitted. Useful for just computing PV
            prior to null distribution fitting (and choose of PV number).

        left_center: bool, default to True
            Bool indicating whether the output should be mean-centered, i.e. whether source and target
            consensus features values (or PVs if no interpolation) must have an independent mean-centering.

        Returns
        -------
        self : TRANSACT
            Fitted instance.
        """

        # Save parameters
        self.source_data_ = source_data
        self.target_data_ = target_data
        self.method = method or self.method
        self.n_components = n_components or self.n_components
        self.n_pv = n_pv or self.n_pv
        self.step = step or self.step
        self.left_center = left_center

        # Compute kernel values
        self.kernel_values_.fit(source_data, target_data, center=False)

        # Compute principal vectors
        self.principal_vectors_ = PVComputation(self.kernel, self.kernel_params_)
        self.principal_vectors_.fit(self.source_data_,
                                    self.target_data_,
                                    method=self.method,
                                    n_components=self.n_components,
                                    n_pv=self.n_pv)

        # Stop here if interpolation should not be computed.
        if not with_interpolation:
            return self

        # Set up interpolation scheme
        self.interpolation_ = Interpolation(self.kernel, self.kernel_params_)
        self.interpolation_.fit(self.principal_vectors_, self.kernel_values_)

        # Compute optimal interpolation time
        self._compute_optimal_time(step=self.step, left_center=self.left_center)

        self.is_fitted = True

        return self


    def null_distribution_pv_similarity(self, method='gene_shuffling', n_iter=100):
        """
        Generate a null distribution for the PV similarity function:
        <ul>
            <li> Gene shuffling: genes get shuffled in source to destroy any structure existing
            at the gene-level while preserving the sample structure. PV get recomputed and 
            similarity is saved.
        </ul>

        Parameters
        ----------
        method : string, default to gene_shuffling
            Method used for generating the null distribution.
            Only method developped: gene_shuffling

        n_iter: int, default to 100
            Number of iterations

        Returns
        -------
        np.ndarray, dtype=float, shape (n_iter, n_pv)
            Array containing the distribution of similarity after shuffling. Each row
            contains the values of one shuffling across PVs.
        """

        if method.lower() == 'gene_shuffling':
            null_method = self._gene_shuffling
        else:
            raise NotImplementedError('%s is not a proper method for generating null distribution'%(method))

        null_distribution = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)\
                                    (delayed(null_method)() for _ in range(n_iter))

        return np.array(null_distribution)


    def _gene_shuffling(self):
        perm = np.random.permutation(self.source_data_.shape[1])
        pv = PVComputation(self.kernel, self.kernel_params_)
        pv.fit(self.source_data_[:,perm],
            self.target_data_,
            method=self.method,
            n_components=self.n_components,
            n_pv=self.n_pv)

        return np.cos(pv.canonical_angles)


    def fit_predictor(self, X, y, alpha_values=None, l1_ratio=0.5):
        """
        Project X on consensus features and train a predictor of drug response.

        Parameters
        ----------
        X : np.ndarray of shape (n_samples, n_features), dtype=float
            Dataset to project. Features should be ordered in same way as in source_data
            and target_data.

        y : np.ndarray of shape (n_samples, 1), dtype=float
            Output to predict

        Returns
        -------
        """
        self.alpha_values = alpha_values if alpha_values is not None else np.logspace(-10,5,34)
        self.l1_ratio_values = [0., .1, .2, .4, .5, .6, .8, .9, 1.]
        param_grid ={
            'regression__alpha': self.alpha_values,
            'regression__l1_ratio': self.l1_ratio_values
        }

        #Grid search setup
        self.predictive_clf = GridSearchCV(Pipeline([
                                ('regression', ElasticNet())
                                ]),\
                                cv=10,
                                n_jobs=self.n_jobs,
                                param_grid=param_grid,
                                verbose=self.verbose,
                                scoring='neg_mean_squared_error')
        self.predictive_clf.fit(self.transform(X, center=False), y)

        return self


    def compute_pred_performance(self, X, y, cv=10):
        """
        Compute predictive performance of predictive model by cross-validation
        on X and y.

        Parameters
        ----------
        X : np.ndarray of shape (n_samples, n_features), dtype=float
            Dataset to project. Features should be ordered in same way as in source_data
            and target_data.

        Returns
        -------
        np.ndarray of shape (n_samples, n_pv), dtype=float
            Dataset projected on consensus features.
        """

        kf = KFold(n_splits=cv, shuffle=True)
        X_projected = self.transform(X)

        if self.predictive_clf is None:
            print('BEWARE: NOT FITTED INSTANCE')
            self.fit_predictor(X,y)
        clf = clone(self.predictive_clf)

        y_predicted = np.zeros(X.shape[0])
        for train_index, test_index in kf.split(X_projected):
            clf.fit(X_projected[train_index], y[train_index])
            y_predicted[test_index] = clf.predict(X_projected[test_index])

        return scipy.stats.pearsonr(y_predicted, y)


    def predict(self, X):
        """
        Predict the drug response of a set of samples, i.e.:
        <ul>
            <li> Project data on consensus features.
            <li> Use the Elastic Net model to predict based on the consensus features.
        </ul>

        Parameters
        ----------
        X : np.ndarray, dtype=float
            Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as
            in source_data and target_data.

        Returns
        -------
        np.ndarray of shape (n_samples, 1), dtype=float
            Predicted drug response values.
        """
        return self.predictive_clf.predict(self.transform(X, center=False))


    def transform(self, X, center=False):
        """
        Project a dataset X onto the consensus features.

        Parameters
        ----------
        X : np.ndarray, dtype=float
            Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as
            in source_data and target_data.

        Returns
        -------
        np.ndarray of shape (n_samples, n_pv), dtype=float
            Dataset projected on consensus features.
        """
        return self.interpolation_.transform(X, self.optimal_time, center=center)


    def _compute_optimal_time(self, step=100, left_center=True):
        # Based on Kolmogorov Smirnov statistics, find interpolation time

        # Compute the interpolated values
        interpolated_values = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)\
                            (delayed(self.interpolation_.project_data)(s/step, center=left_center)
                                for s in range(step+1))
        interpolated_values = np.array(interpolated_values).transpose(2,0,1)
        source_interpolated_values = interpolated_values[:,:,:self.source_data_.shape[0]]
        target_interpolated_values = interpolated_values[:,:,self.source_data_.shape[0]:]

        self.optimal_time = []
        self.ks_statistics = []
        self.ks_p_values = []

        # For each PV, find the time when interpolation has the largest overlap.
        for source_pv, target_pv in zip(source_interpolated_values, target_interpolated_values):
            self.ks_statistics.append([])
            for s, t in zip(source_pv, target_pv):
                self.ks_statistics[-1].append(scipy.stats.ks_2samp(s,t))
            self.ks_statistics[-1] = list(zip(*self.ks_statistics[-1]))
            self.ks_p_values.append(self.ks_statistics[-1][-1])
            self.ks_statistics[-1] = self.ks_statistics[-1][0]
            self.optimal_time.append(np.argmin(self.ks_statistics[-1])/step)

        # Save the different statistics
        self.optimal_time = np.array(self.optimal_time) # Optimal tau for each PV.
        self.ks_statistics = np.array(self.ks_statistics) # Computed KS statistics between each PV.
        self.ks_p_values = np.array(self.ks_p_values) # Corresponding p_values.

Methods

def compute_pred_performance(self, X, y, cv=10)

Compute predictive performance of predictive model by cross-validation on X and y.

Parameters

X : np.ndarray of shape (n_samples, n_features), dtype=float
Dataset to project. Features should be ordered in same way as in source_data and target_data.

Returns

np.ndarray of shape (n_samples, n_pv), dtype=float
Dataset projected on consensus features.
Expand source code
def compute_pred_performance(self, X, y, cv=10):
    """
    Compute predictive performance of predictive model by cross-validation
    on X and y.

    Parameters
    ----------
    X : np.ndarray of shape (n_samples, n_features), dtype=float
        Dataset to project. Features should be ordered in same way as in source_data
        and target_data.

    Returns
    -------
    np.ndarray of shape (n_samples, n_pv), dtype=float
        Dataset projected on consensus features.
    """

    kf = KFold(n_splits=cv, shuffle=True)
    X_projected = self.transform(X)

    if self.predictive_clf is None:
        print('BEWARE: NOT FITTED INSTANCE')
        self.fit_predictor(X,y)
    clf = clone(self.predictive_clf)

    y_predicted = np.zeros(X.shape[0])
    for train_index, test_index in kf.split(X_projected):
        clf.fit(X_projected[train_index], y[train_index])
        y_predicted[test_index] = clf.predict(X_projected[test_index])

    return scipy.stats.pearsonr(y_predicted, y)
def fit(self, source_data, target_data, n_components=None, n_pv=None, method='two-stage', step=100, with_interpolation=True, left_center=True)

Compute the Consensus Features (CFs) onto which predictive models can be trained.
Specifically:

  • Compute the kernel matrices.
  • Compute the cosine similarity matrix.
  • Compute principal vectors.
  • Interpolate between the PVs.
  • Find optimal interpolation time.

Parameters

source_data : np.ndarray, dtype=float
Source data, matrix with samples in the rows, i.e. shape (n_source_samples, n_features). pandas.DataFrame are supported.
target_data : np.ndarray, dtype=float
Source data, matrix with samples in the rows, i.e. shape (n_target_samples, n_features). pandas.DataFrame are supported.
WARNING: features need to be ordered in the same way as in source_data.
n_components : int, default to None
Number of components. If not set here or in init, then use the maximum number of principal components possible for source and target.
n_pv : int, default to None
Number of Principal Vectors. If not set here or in init, then maximum number of PV will be computed.
method : str, default to 'two-stage'
Method used for computing the principal vectors. Only 'two-stage' has been implemented.
step : int, default to 100
Number of interpolation steps.
with_interpolation : bool, default to True
Bool indicating whether interpolation shall also be fitted. Useful for just computing PV prior to null distribution fitting (and choose of PV number).
left_center : bool, default to True
Bool indicating whether the output should be mean-centered, i.e. whether source and target consensus features values (or PVs if no interpolation) must have an independent mean-centering.

Returns

self : TRANSACT
Fitted instance.
Expand source code
def fit(self,
        source_data,
        target_data,
        n_components=None,
        n_pv=None,
        method='two-stage',
        step=100,
        with_interpolation=True,
        left_center=True):

    """
    Compute the Consensus Features (CFs) onto which predictive models can be trained.
    <br/> Specifically:
    <ul>
        <li> Compute the kernel matrices.
        <li> Compute the cosine similarity matrix.
        <li> Compute principal vectors.
        <li> Interpolate between the PVs.
        <li> Find optimal interpolation time.
    </ul>

    Parameters
    ----------
    source_data : np.ndarray, dtype=float
        Source data, matrix with samples in the rows, i.e. shape (n_source_samples, n_features).
        <br./> pandas.DataFrame are supported.

    target_data : np.ndarray, dtype=float
        Source data, matrix with samples in the rows, i.e. shape (n_target_samples, n_features).
        <br./> pandas.DataFrame are supported.
        <br/><b>WARNING</b>: features need to be ordered in the same way as in source_data.

    n_components: int, default to None
        Number of components. If not set here or in __init__, then use the maximum number of principal components
        possible for source and target.

    n_pv: int, default to None
        Number of Principal Vectors. If not set here or in __init__, then maximum number of PV will be computed.

    method : str, default to 'two-stage'
        Method used for computing the principal vectors. Only 'two-stage' has been implemented.

    step: int, default to 100
        Number of interpolation steps.

    with_interpolation: bool, default to True
        Bool indicating whether interpolation shall also be fitted. Useful for just computing PV
        prior to null distribution fitting (and choose of PV number).

    left_center: bool, default to True
        Bool indicating whether the output should be mean-centered, i.e. whether source and target
        consensus features values (or PVs if no interpolation) must have an independent mean-centering.

    Returns
    -------
    self : TRANSACT
        Fitted instance.
    """

    # Save parameters
    self.source_data_ = source_data
    self.target_data_ = target_data
    self.method = method or self.method
    self.n_components = n_components or self.n_components
    self.n_pv = n_pv or self.n_pv
    self.step = step or self.step
    self.left_center = left_center

    # Compute kernel values
    self.kernel_values_.fit(source_data, target_data, center=False)

    # Compute principal vectors
    self.principal_vectors_ = PVComputation(self.kernel, self.kernel_params_)
    self.principal_vectors_.fit(self.source_data_,
                                self.target_data_,
                                method=self.method,
                                n_components=self.n_components,
                                n_pv=self.n_pv)

    # Stop here if interpolation should not be computed.
    if not with_interpolation:
        return self

    # Set up interpolation scheme
    self.interpolation_ = Interpolation(self.kernel, self.kernel_params_)
    self.interpolation_.fit(self.principal_vectors_, self.kernel_values_)

    # Compute optimal interpolation time
    self._compute_optimal_time(step=self.step, left_center=self.left_center)

    self.is_fitted = True

    return self
def fit_predictor(self, X, y, alpha_values=None, l1_ratio=0.5)

Project X on consensus features and train a predictor of drug response.

Parameters

X : np.ndarray of shape (n_samples, n_features), dtype=float
Dataset to project. Features should be ordered in same way as in source_data and target_data.
y : np.ndarray of shape (n_samples, 1), dtype=float
Output to predict

Returns

Expand source code
def fit_predictor(self, X, y, alpha_values=None, l1_ratio=0.5):
    """
    Project X on consensus features and train a predictor of drug response.

    Parameters
    ----------
    X : np.ndarray of shape (n_samples, n_features), dtype=float
        Dataset to project. Features should be ordered in same way as in source_data
        and target_data.

    y : np.ndarray of shape (n_samples, 1), dtype=float
        Output to predict

    Returns
    -------
    """
    self.alpha_values = alpha_values if alpha_values is not None else np.logspace(-10,5,34)
    self.l1_ratio_values = [0., .1, .2, .4, .5, .6, .8, .9, 1.]
    param_grid ={
        'regression__alpha': self.alpha_values,
        'regression__l1_ratio': self.l1_ratio_values
    }

    #Grid search setup
    self.predictive_clf = GridSearchCV(Pipeline([
                            ('regression', ElasticNet())
                            ]),\
                            cv=10,
                            n_jobs=self.n_jobs,
                            param_grid=param_grid,
                            verbose=self.verbose,
                            scoring='neg_mean_squared_error')
    self.predictive_clf.fit(self.transform(X, center=False), y)

    return self
def null_distribution_pv_similarity(self, method='gene_shuffling', n_iter=100)

Generate a null distribution for the PV similarity function:

  • Gene shuffling: genes get shuffled in source to destroy any structure existing at the gene-level while preserving the sample structure. PV get recomputed and similarity is saved.

Parameters

method : string, default to gene_shuffling
Method used for generating the null distribution. Only method developped: gene_shuffling
n_iter : int, default to 100
Number of iterations

Returns

np.ndarray, dtype=float, shape (n_iter, n_pv)
Array containing the distribution of similarity after shuffling. Each row contains the values of one shuffling across PVs.
Expand source code
def null_distribution_pv_similarity(self, method='gene_shuffling', n_iter=100):
    """
    Generate a null distribution for the PV similarity function:
    <ul>
        <li> Gene shuffling: genes get shuffled in source to destroy any structure existing
        at the gene-level while preserving the sample structure. PV get recomputed and 
        similarity is saved.
    </ul>

    Parameters
    ----------
    method : string, default to gene_shuffling
        Method used for generating the null distribution.
        Only method developped: gene_shuffling

    n_iter: int, default to 100
        Number of iterations

    Returns
    -------
    np.ndarray, dtype=float, shape (n_iter, n_pv)
        Array containing the distribution of similarity after shuffling. Each row
        contains the values of one shuffling across PVs.
    """

    if method.lower() == 'gene_shuffling':
        null_method = self._gene_shuffling
    else:
        raise NotImplementedError('%s is not a proper method for generating null distribution'%(method))

    null_distribution = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)\
                                (delayed(null_method)() for _ in range(n_iter))

    return np.array(null_distribution)
def predict(self, X)

Predict the drug response of a set of samples, i.e.:

  • Project data on consensus features.
  • Use the Elastic Net model to predict based on the consensus features.

Parameters

X : np.ndarray, dtype=float
Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as in source_data and target_data.

Returns

np.ndarray of shape (n_samples, 1), dtype=float
Predicted drug response values.
Expand source code
def predict(self, X):
    """
    Predict the drug response of a set of samples, i.e.:
    <ul>
        <li> Project data on consensus features.
        <li> Use the Elastic Net model to predict based on the consensus features.
    </ul>

    Parameters
    ----------
    X : np.ndarray, dtype=float
        Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as
        in source_data and target_data.

    Returns
    -------
    np.ndarray of shape (n_samples, 1), dtype=float
        Predicted drug response values.
    """
    return self.predictive_clf.predict(self.transform(X, center=False))
def transform(self, X, center=False)

Project a dataset X onto the consensus features.

Parameters

X : np.ndarray, dtype=float
Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as in source_data and target_data.

Returns

np.ndarray of shape (n_samples, n_pv), dtype=float
Dataset projected on consensus features.
Expand source code
def transform(self, X, center=False):
    """
    Project a dataset X onto the consensus features.

    Parameters
    ----------
    X : np.ndarray, dtype=float
        Dataset to project, of shape (n_samples, n_features). Features should be ordered in same way as
        in source_data and target_data.

    Returns
    -------
    np.ndarray of shape (n_samples, n_pv), dtype=float
        Dataset projected on consensus features.
    """
    return self.interpolation_.transform(X, self.optimal_time, center=center)