The PCA Guidebook: Practical, Intuitive, and Thorough
Quick start
- Paste your table above (rows = observations, columns = numeric variables). Headers are
autoâdetected.
- Pick Basis: use Correlation when variables have different units or
spreads; use Covariance when scales are comparable and you want largeâvariance
features to dominate.
- Choose how to treat missing values: Drop rows (listwise deletion) or Impute
means (simple baseline).
- Click Run PCA. Read the Summary, then scan the Scree (elbow),
Loadings (variable weights), Scores (observation positions), and the
Correlation Circle (quality of representation on PC1âPC2).
- Export Scores and Loadings as CSV for downstream analysis.
PCA in plain language
PCA is a smart rotation of your data cloud. Imagine plotting every observation in pâdimensional space.
PCA finds perpendicular axes (principal components) that capture as much spread (variance) as possible,
one after another. PC1 points where the data is widest; PC2 is the
next widest direction orthogonal to PC1, and so on. Because these axes are uncorrelated, they remove
redundancy and make structure easier to see.
Preâprocessing: what to do before PCA
- Centering (subtract column means) is essential. This tool centers automatically for
both covariance and correlation PCA.
- Scaling: Correlation PCA divides by sample standard deviation (nâ1), putting all
variables on the same footing. Use this when units differ (cm vs. kg vs. $) or when some variables
are naturally highâvariance.
- Missing values: We support listwise deletion or mean imputation. For serious work
consider EM/PPCA/knn imputationâthe choice can change components.
- Outliers: A few extreme points can rotate PCs. Inspect scores and consider robust
methods (see Robustness).
- Collinearity: Highly correlated variables are fineâPCA thrives on correlationâbut
perfect duplicates yield zeroâvariance directions.
- Categorical variables: PCA expects numeric, continuous inputs. If needed, oneâhot
encode categories, then consider correlation PCA and interpret carefully.
Mathematics of PCA (clear but compact)
Let X be the nĂp data matrix after centering (and optional standardizing). The sample
covariance (or correlation) matrix is:
PCA solves the eigenproblem and orthonormality conditions:
The scores and a lowârank reconstruction are:
(Add back the column means you subtracted to return to the original space.)
Explained variance ratio for PC j is:
SVD view (why itâs numerically stable)
Singular Value Decomposition factors the centered/standardized matrix as:
The eigenvalues of C relate to the singular values via:
Scores are equivalently:
Scores, Loadings, Communalities & Contributions
Choosing the number of components (k)
- Scree elbow: Look for the bend in the scree plot where additional PCs add little
variance.
- Cumulative EVR: Keep PCs until 80â95% of variance is explained (contextâdependent:
scientific data often uses â„90%).
- Kaiser criterion (only for correlation PCA): keep PCs with
λ > 1
(they explain more variance than a single standardized variable). Use with caution.
- Brokenâstick: Compare eigenvalues to a random âbrokenâstickâ distribution; keep
those exceeding the expectation.
- Parallel analysis: Compare to eigenvalues from random data with the same shape;
keep PCs exceeding random. (Not implemented here; run offline for rigor.)
- Crossâvalidation: For predictive tasks (PCR), pick k that maximizes validation
performance.
Interpreting components like a pro
- Read loadings first. Identify which variables drive PC1, PC2, ⊠High positive vs.
negative weights can indicate meaningful tradeâoffs (e.g., price â while efficiency â).
- Use the scores plot to detect groups/outliers. Clusters in PC1âPC2 space often
correspond to meaningful segments; extreme scores flag outliers or novel cases.
- Check the correlation circle. Variables close together are positively correlated;
opposite sides indicate negative correlation; nearâorthogonal â weakly related.
- Relate back to the domain. Components are combinations of variablesâname
them by what they measure (e.g., âoverall sizeâ, âsweetness vs. acidityâ, âmarket riskâ).
- Remember nonâuniqueness. If λâs are tied or nearly equal, the corresponding PCs can
rotate within their subspace. Focus on the subspace, not exact axes.
Highâdimensional case (p â« n)
When variables outnumber observations, at most nâ1
eigenvalues are nonâzero. PCA still works
and is often essential. Computation is faster via the SVD of X
or by eigendecomposing
XXá”
and mapping to feature space. Interpretation is the same.
Outliers & robustness
- Diagnostics: Inspect score distances (e.g., Mahalanobis) to find leverage points
that can twist PCs.
- Mitigations: Winsorize/extremeâclip, transform (log), or use robust PCA variants
(e.g., Mâestimators, Sâestimators, medianâbased methods). This tool performs classical PCA; pair
with robust prep if needed.
Advanced PCA topics
- Whitening: Map to uncorrelated, unitâvariance features via
.
Useful for some ML pipelines; beware of amplifying noise for tiny eigenvalues.
- PCR (Principal Components Regression): Regress a target on PC scores to mitigate
multicollinearity.
- Sparse PCA: Encourages loadings with many zeros for interpretability.
- Kernel PCA: Applies PCA in a nonlinear feature space via kernels (RBF, polynomial)
for curved manifolds.
- tâSNE/UMAP vs. PCA: Nonlinear methods are great for visualization of clusters but
are not linear, global, or easily invertible; start with PCA.
Domainâspecific tips
- Finance: Yield curves typically yield PCs interpretable as level, slope, curvature.
Use correlation PCA if mixing units; otherwise covariance can highlight dominant risk factors.
- Biology/Genomics: Center and standardize; PC1/PC2 often capture batch effects or
population structure. Always check for confounders.
- Manufacturing/QC: PCA detects process drift and latent failure modes; monitor
scores over time.
- Imaging/Signals: PCA â KarhunenâLoĂšve transformâgreat for denoising/compression;
mind spatial structure when interpreting loadings.
Common pitfalls
- Mixing standardized and unstandardized variables when using covariance PCA.
- Overâinterpreting signs (they may flip between runs or tools).
- Assuming PCs imply causalityâthey summarize variance, not mechanisms.
- Keeping too many PCs (overfitting) or too few (information loss). Use scree + EVR + domain sense.
- Projecting new data without applying the same centering/standardization as training.
FAQ (extended)
How do I apply these loadings to new data? Store your training means (and stds for
correlation PCA). For a new row x
, compute xâČ = (x â mean)/std
as appropriate,
then scores = xâČ Â· V_k
. This tool reports means and stds so you can replicate
preprocessing.
Why donât my results match another package exactly? PCA is unique up to sign flips;
small differences arise from numerical methods, missingâvalue handling, and whether covariance or
correlation was used.
Can I rotate PCs (e.g., varimax)? Rotation is a factor analysis concept. PCA
already yields orthogonal components that maximize variance; rotated solutions optimize different
criteria.
Does scaling change scores? Yesâcorrelation PCA gives each variable equal variance,
shifting both loadings and scores; covariance PCA lets highâvariance variables dominate.
Glossary
- Scores (T): coordinates of observations in PC space (
T = X · V
).
- Loadings (V): weights defining PCs in terms of original variables (eigenvectors of
C).
- Eigenvalue (λ): variance captured by a PC.
- Explained Variance Ratio (EVR):
- Communality: how much of a variableâs variance is captured by the retained PCs.
- Whitening: rescaling scores to unit variance and zero correlation.