=========== Description: =========== This is the main program to perform Functional Principal Component Analysis (FPCA) via PACE (principal analysis via conditional expectation). The principal component scores can be estimated through conditional expectation or via classical integration. For the latter, one can choose a shrinkage method for estimated scores. See Reference [2]. For shrinkage, see Reference [1]. ====== Usage: ====== function [no_opt,sigma,lambda,phi,eigen,xi_est,xi_var,mu,bw_mu,xcov,bw_xcov, xcovfit, AIC,BIC,FVE,y_pred, y_predOrig, out1,out21,... y,t, regular, rho_opt, mucopy, phicopy, eigencopy, out1copy, out21copy, xcovcopy, xcovfitcopy, xcorr]... = PCA(y,t,bwmu,bwmu_gcv, bwxcov,bwxcov_gcv,ntest1,ngrid1,selection_k, FVE_threshold, maxk,control,regular,error, ngrid,method,shrink,... newdata,kernel, numBins, yname, screePlot, designPlot, corrPlot, rho, verbose,method_mu) ================ Input Arguments: ================ Input y: 1*n cell array, y{i} is the vector of measurements for the ith subject, i=1,...,n. Input t: 1*n cell array, t{i} is the vector of time points for the ith subject for which corresponding measurements y{i} are available, i=1,...,n. Input bwmu: scalar bwmu>=0, bandwidth for mean curve mu, 0: use cross-validation or generalized cross-validation to choose bandwidth automatically. [Default] bwmu(>0): user-specified bandwidth. Input bwmu_gcv: For choice bwmu = 0, two types of cross-validation can be performed: One-curve-leave-out cross-validation (CV) or generalized cross-validation (GCV) 0: CV (may be time-consuming) 1: GCV (faster) [Default] Input bwxcov: 1*2 vector, bandwidths for covariance surface used for smoothing of cov(X(t),X(s)) bwxcov(i): ith coordinate of bandwidth vector, i=1,2. bwxcov(1)==0 & bwxcov(2)==0: use cross-validation (CV) or generalized cross-validation (GCV) for automatic selection. [Default] bwxcov(1)>0 & bwxcov(2)>0: user-specified bandwidths. Input bwxcov_gcv: If setting bwxcov = [0 0], automatic bandwidth selection by CV or GCV choices 0: CV method (may be time-consuming, can be accelerated by choosing small values for ntest1 and ngrid1). 1: GCV method (faster) [Default] Input ntest1: integer(<=n), number of curves used for CV when choosing bandwidths for smoothing the covariance surface. The subjects in the test set are randomly selected from n subjects. Small ntest1 will accelerate CV at less accuracy. [Default is 30.] Input ngrid1: integer, number of support points for the covariance surface in the CV procedure (selecting bandwidths of covariance). Note that input ntest1 and ngrid1 provide options to save computing time when using CV or GCV. [Default is 30.] Input selection_k: the method of choosing the number of principal components K. 'AIC1': use AIC criterion with pseudo-likelihood of measurements (marginal likelihood). 'AIC2': use AIC criterion with likelihood of measurements conditional on estimated random coeffecients. 'BIC1': use BIC criterion with pseudo-likelihood of measurements (marginal likelihood). [Default] 'BIC2': use BIC criterion with likelihood measurements conditional on estimated random coeffecients. 'FVE' (fraction of variance explained) : use scree plot approach to select number of principal components), see "FVE_threshold" below. positive integer K: user-specified number of principal components Note: BIC1 and FVE produce the most parsimonious models. Input FVE_threshold: a positive number that is between 0 and 1 [Default is 0.85.] It is used with the option selection_k = 'FVE' to select the number of principal components that explain at least "FVE_threshold" of total variation (the fraction of variance explained). Input maxk: integer, the maximum number of principal components to consider if using automatic methods to choose K, i.e., 'AIC1', 'AIC2', 'BIC1' or 'BIC2' defined by selection_k. [Default is 20.] Note: when selection_k = 'FVE' or 'AIC_R', maxk is ignored. Input control: 'auto', Select K by minimizing AIC or BIC, or find the first K such that the FVE_threshold is exceeded. [Default] 'look', a scree plot (FVE% Vs No. of PC) will be generated based on K <= 15. User will be prompted to enter user-specified K after viewing scree plot. This can be combined with any setting of selection_k. Input regular: 0, sparse (or irregular) functional data. 1, regular data with missing values 2, completely balanced (regular) data. [], automatically chosen based on the design in t. [Default] Input error: 0, no additional measurement error assumed. 1, additional measurement error is assumed. [Default] Input ngrid: integer, number of support points in each direction of covariance surface when performing principal component analysis ( ngrid > K). [Default is 51.] Input method: used for computing random effects \xi_{ik} 'CE': conditional expectation method [Default] 'IN': classical integration method Note: 'CE' can be applied for sparse data or regular data, but 'IN' only in the case of regular data. Input shrink: indicator of whether applying shrinkage to estimates of random coefficients (for regular data only) 0: no shrinkage when method = 'CE' or error = 0 [Default] 1: shrinkage when method = 'IN' and error =1, otherwise, this will be re-set to 0. Input newdata: a row vector of user-defined output time grids for all curves. This corresponds to "out1" in the output argument If newdata = [], then "out1" corresponds to the set of distinct time points from the pooled data. "newdata" is supposed to be a vector in ascending order on the domain of the functions. [Default is []] Input kernel: a character string to define the kernel to be used in the 1-D or 2-D smoothing kernel = 'epan' ==> Epanechnikov kernel [Default for dense designs with n_i >= 20] 'rect' ==> Rectangular kernel 'gauss' ==> Gaussian kernel [Default for sparse designs, regular designs with missings, dense designs for n_i < 20] Note: The Gaussian kernel is overall best for sparse designs but is slower than the other kernels and if computational speed is of concern then one may wish to use the Epanechnikov kernel also for the case of sparse designs. Input numBins: 0: no binning a positive interger (>= 10): prebin the data with user-defined number of bins. When numBins < 10, no binning will be performed. []: prebin the data with the following rule [Default] i) When the input data is regular = 1 or 2 m = max of n_i, where n_i is the number of repeated measurements for ith subject. ii) regular = 0 m = median of n_i When m <= 20 subjects, no binning. When n <= 5000 subjects and m <= 400, no binning. When n <= 5000 subjects and m > 400, numBins = 400. When n > 5000 subjects, compute m* = max(20, (5000-n)*19/2250+400) if m > m*, numBins = m* if m <= m*, no binning This option accelerates analysis, especially for data with many time measurements. Input yname: a character string which denotes the name of the current function to be estimated. [Default is []] It is used in the title part of the scree plot output. If it is set to [], then yname is set to be the same name as the first input argument from PCA() or FPCA(). For example, if this analysis is for function y, then yname = 'y'. Input screePlot: indicator of whether to create the scree plot 1 a scree plot will be created 0 no scree plot will be created [Default] Input designPlot: indicator of whether to create the design plot 1 a design plot will be created 0 no design plot will be created [Default] Interpretation of design plot: All subareas of the domain x domain support square of the covariance surface need to be covered more or less homogeneously with pairs of design points. Input corrPlot: indicator of whether to create the correlation surface plot 1 a correlation surface plot will be created 0 no correlation surface plot will be created [Default] Input rho: truncation threshold for the iterative residual that is used in the estimation of FPC scores. (see FPCscore.pdf under Help/ for more details) -1: compute unadjusted FPC scores (as in previous PACE versions) >0: user-defined choice of rho 0: do not set truncation threshold, but use iterative residuals for sigmanew (see in output description below) 'cv-random': a character string which specifies to use randomized leave-one-measurement-out CV approach to find the optimal value of rho. Note that this choice contains a random element and therefore the analysis is not exactly replicable when running the program twice. 'cv': use non-randomized leave-one-measurement-out CV approach to find the optimal value of rho. [default] Input verbose: a character string 'on': display diagnostic messages [Default] 'off': suppress diagnostic messages Input xmu: 1*N vector, user-defined mean functions valued at distinct input time points, in ascending order from all subjects, corresponding to out1. Input method_mu: method to estimate mu 'RARE': using response-adaptive method to update estimated 'PACE': using the original method [default] ======== Details: ======== 1) bwmu = 0 and bwmu_gcv = 1 ==> use GCV method to choose the bandwidth for mean function For Gaussian kernel, the optimal bandwidth from GCV is multiplied by 1.1. bwmu = 0 and bwmu_gcv = 0 ==> use CV method to choose the bandwidth for mean function bwmu > 0 then bwmu_gcv will be ignored, subjective bandwidth choice 2) bwxcov = [0 0] and bwxcov_gcv = 1 ==> use GCV method to choose the bandwidth for cov function For Gaussian kernel, the optimal bandwidth from GCV is multiplied by 1.1. bwxcov = [0 0] and bwxcov_gcv = 0 ==> use CV method to choose the bandwidth for cov function bwxcov(1) > 0 and bwxcov(2) > 0 then bwxcov_gcv will be ignored, subjective bandwidth choice If eigenvalues estimation is of primary interest, you may want to undersmooth the covariance surface for better estimates of eigenvalues. 3) If error=1, AIC1, BIC1, AIC2, BIC2, FVE all can be used for choosing K. If error=0, only AIC1, BIC1 and FVE can be used by definition of these criteria. 4) If some arguments are not needed or you want to use default values, just use [] for these arguments. Alternatively, use setOptions() to define nessary input arguments and call PCA through FPCA. 5) In general, any output in the form of 1-dimensional vectors will be a row vector and any output related to 1-dimensional cell arrays will be a row cell array. ================= Output Arguments: ================= Output no_opt: integer, automatically or subjectively selected value of K, the number of selected components. Output sigma: scalar, estimate of measurement error variance if error=1, while it is [] if error=0. Output lambda: 1*K vector, estimated eigenvalues (variances of functional principal components scores). Output phi: N*K matrix, estimated principal component functions valued at distinct input time points with ascending order of all subjects, corresponding to out1 Output eigen: ngrid*K matrix, estimated principal component functions, valued at out21, ngrid of the pooled distinct time points with ascending order of all subjects, phi is an interpolated version of eigen at out1 Output xi_est: n*K matrix, predictions for random coeffecients (PC scores) for n subjects. Output xi_var: K*K matrix, Var(PC score)-Var(estimated PC score). The omega matrix in equation (7) of the paper, which is used to construct the point-wise C.I. for X_i(t) Output mu: 1*N vector, estimated mean functions valued at distinct input time points (newdata = []), in ascending order from all subjects, corresponding to out1; when newdata is defined, corresponds to the time points from newdata, same as out1. Output bw_mu: scalar(>0), automatically or subjectively selected bandwidth for smoothing mean curve. Output xcov: ngrid*ngrid matrix, smoothed covariance surface (diagnal removed), corresponding to out21 Output bw_xcov: 1*2 vector(>0), automatically or subjectively selected bandwidths for smoothing covariance surface. Output xcovfit: ngrid * ngrid matrix, fitted covariance surface, based on truncated estimate of eigenvalues ("lambda") and principal component functions ("eigen"), corresponding to out21 Output xcorr: ngrid * ngrid matrix, fitted correlation surface, based on truncated estimate of eigenvalues ("lambda") and principal component functions ("eigen"), corresponding to out21 Output AIC: 1*K vector, AIC values obtained when choosing K from K=1 to K=maxk, where AIC(K) is the minimum. If AIC method is not applied, it is [] Output BIC: 1*K vector, BIC values obtained when choosing K from K=1 to K=maxk, where BIC(K) is the minimum. If BIC method is not applied, it is [] Output FVE: 1*ngrid vector of fraction of variance explained Output y_pred: cell array, y_pred{i} is the vector of predictions for the ith subject evaluated at time points from the output grid vector "out1". Output y_predOrig: cell array, y_predOrig{i} is the vector of predictions for the ith subject at the same time points as the input. Output out1: 1*N vector, distinct input time points with ascending order from all subjects if newdata = []; otherwise, it is the same as newdata. Output out21: 1*ngrid vector, a grid of time points for which the smoothed covariance surface assumes values, i.e., ngrids from out1. Output y: if no binning is performed, same as the input y if binning is performed, 1 * n cell array, y{i} is a vector of measurements after binning for subject i, i = 1,...,n Output t: if no binning is performed, same as the input t if binning, 1 * n cell array, t{i} is a vector of time points after binning for subject i, i = 1,...,n Each value of t{i} corresponds to the midpoints of each bin. Output regular: if no binning is performed or regular = 2, this is the same as the input if binning is performed and regular = 0, it will be reset to regular = 1. In other words, after binning, the sparse and irregular case is analyzed as regular with missings by sampled data. Output rho_opt: if rho is set as 'cv' or 'cv-random', then rho_opt is the optimal rho obtained from the CV method, otherwise, it is the same as the input rho When rho is 'cv', rho_opt is non-random for a give data set When rho is 'cv-random', rho_opt can be different due to the randomly leave-out measurements for the same data set. Output sigmanew: if rho is set as >0, 0 or 'cv', then sigmanew is the iterative residual sum of squares (see FPCscore.pdf for more details). It can be used as an estimate of the variance of the measurement errors. if rho is set as -1, then sigmanew is set to the same as output sigma. See also FPCA, setOptions, showOptionNames, example