The PCA Guidebook: Practical, Intuitive, and Thorough
Quick start
-
Paste your table above (rows = observations, columns = numeric
variables). Headers are auto‑detected.
-
Pick Basis: use Correlation when variables
have different units or spreads; use
Covariance when scales are comparable and you want
large‑variance features to dominate.
-
Choose how to treat missing values:
Drop rows (listwise deletion) or
Impute means (simple baseline).
-
Click Run PCA. Read the Summary, then scan
the Scree (elbow), Loadings (variable weights),
Scores (observation positions), and the
Correlation Circle (quality of representation on PC1–PC2).
-
Export Scores and Loadings as CSV
for downstream analysis.
PCA in plain language
PCA is a smart rotation of your data cloud. Imagine plotting every
observation in p‑dimensional space. PCA finds perpendicular axes
(principal components) that capture as much spread (variance) as
possible, one after another. PC1 points where the
data is widest; PC2 is the next widest direction
orthogonal to PC1, and so on. Because these axes are uncorrelated,
they remove redundancy and make structure easier to see.
Pre‑processing: what to do before PCA
-
Centering (subtract column means) is essential.
This tool centers automatically for both covariance and correlation
PCA.
-
Scaling: Correlation PCA divides by sample standard
deviation (n‑1), putting all variables on the same footing. Use this
when units differ (cm vs. kg vs. $) or when some variables are
naturally high‑variance.
-
Missing values: We support listwise deletion or
mean imputation. For serious work consider EM/PPCA/knn
imputation—the choice can change components.
-
Outliers: A few extreme points can rotate PCs.
Inspect scores and consider robust methods (see
Robustness).
-
Collinearity: Highly correlated variables are
fine—PCA thrives on correlation—but perfect duplicates yield
zero‑variance directions.
-
Categorical variables: PCA expects numeric,
continuous inputs. If needed, one‑hot encode categories, then
consider correlation PCA and interpret carefully.
Mathematics of PCA (clear but compact)
Let X be the n×p data matrix after centering (and optional
standardizing). The sample covariance (or correlation) matrix is:
PCA solves the eigenproblem and orthonormality conditions:
The scores and a low‑rank reconstruction are:
(Add back the column means you subtracted to return to the original
space.)
Explained variance ratio for PC j is:
SVD view (why it’s numerically stable)
Singular Value Decomposition factors the centered/standardized matrix
as:
The eigenvalues of C relate to the singular values via:
Scores are equivalently:
Scores, Loadings, Communalities & Contributions
Choosing the number of components (k)
-
Scree elbow: Look for the bend in the scree plot
where additional PCs add little variance.
-
Cumulative EVR: Keep PCs until 80–95% of variance
is explained (context‑dependent: scientific data often uses ≥90%).
-
Kaiser criterion (only for correlation PCA): keep
PCs with
λ > 1 (they explain more variance than a
single standardized variable). Use with caution.
-
Broken‑stick: Compare eigenvalues to a random
“broken‑stick” distribution; keep those exceeding the expectation.
-
Parallel analysis: Compare to eigenvalues from
random data with the same shape; keep PCs exceeding random. (Not
implemented here; run offline for rigor.)
-
Cross‑validation: For predictive tasks (PCR), pick
k that maximizes validation performance.
Interpreting components like a pro
-
Read loadings first. Identify which variables drive
PC1, PC2, … High positive vs. negative weights can indicate
meaningful trade‑offs (e.g., price ↑ while efficiency ↓).
-
Use the scores plot to detect groups/outliers.
Clusters in PC1–PC2 space often correspond to meaningful segments;
extreme scores flag outliers or novel cases.
-
Check the correlation circle. Variables close
together are positively correlated; opposite sides indicate negative
correlation; near‑orthogonal ≈ weakly related.
-
Relate back to the domain. Components are
combinations of variables—name them by what they measure
(e.g., “overall size”, “sweetness vs. acidity”, “market risk”).
-
Remember non‑uniqueness. If λ’s are tied or nearly
equal, the corresponding PCs can rotate within their subspace. Focus
on the subspace, not exact axes.
High‑dimensional case (p ≫ n)
When variables outnumber observations, at most
n−1 eigenvalues are non‑zero. PCA still works and is
often essential. Computation is faster via the SVD of
X or by eigendecomposing XXᵀ and mapping to
feature space. Interpretation is the same.
Outliers & robustness
-
Diagnostics: Inspect score distances (e.g.,
Mahalanobis) to find leverage points that can twist PCs.
-
Mitigations: Winsorize/extreme‑clip, transform
(log), or use robust PCA variants (e.g., M‑estimators, S‑estimators,
median‑based methods). This tool performs classical PCA; pair with
robust prep if needed.
Advanced PCA topics
-
Whitening: Map to uncorrelated, unit‑variance
features via
. Useful for some ML pipelines; beware of amplifying noise for tiny
eigenvalues.
-
PCR (Principal Components Regression): Regress a
target on PC scores to mitigate multicollinearity.
-
Sparse PCA: Encourages loadings with many zeros for
interpretability.
-
Kernel PCA: Applies PCA in a nonlinear feature
space via kernels (RBF, polynomial) for curved manifolds.
-
t‑SNE/UMAP vs. PCA: Nonlinear methods are great for
visualization of clusters but are not linear, global, or easily
invertible; start with PCA.
Domain‑specific tips
-
Finance: Yield curves typically yield PCs
interpretable as level, slope, curvature. Use correlation PCA if
mixing units; otherwise covariance can highlight dominant risk
factors.
-
Biology/Genomics: Center and standardize; PC1/PC2
often capture batch effects or population structure. Always check
for confounders.
-
Manufacturing/QC: PCA detects process drift and
latent failure modes; monitor scores over time.
-
Imaging/Signals: PCA ≈ Karhunen–Loève
transform—great for denoising/compression; mind spatial structure
when interpreting loadings.
Common pitfalls
-
Mixing standardized and unstandardized variables when using
covariance PCA.
-
Over‑interpreting signs (they may flip between runs or tools).
-
Assuming PCs imply causality—they summarize variance, not
mechanisms.
-
Keeping too many PCs (overfitting) or too few (information loss).
Use scree + EVR + domain sense.
-
Projecting new data without applying the
same centering/standardization as training.
FAQ (extended)
How do I apply these loadings to new data? Store your
training means (and stds for correlation PCA). For a new row
x, compute x′ = (x − mean)/std as
appropriate, then scores = x′ · V_k. This tool reports
means and stds so you can replicate preprocessing.
Why don’t my results match another package exactly?
PCA is unique up to sign flips; small differences arise from numerical
methods, missing‑value handling, and whether covariance or correlation
was used.
Can I rotate PCs (e.g., varimax)? Rotation is a
factor analysis concept. PCA already yields orthogonal
components that maximize variance; rotated solutions optimize
different criteria.
Does scaling change scores? Yes—correlation PCA gives
each variable equal variance, shifting both loadings and scores;
covariance PCA lets high‑variance variables dominate.
Glossary
-
Scores (T): coordinates of observations in PC space
(
T = X · V).
-
Loadings (V): weights defining PCs in terms of
original variables (eigenvectors of C).
- Eigenvalue (λ): variance captured by a PC.
-
Explained Variance Ratio (EVR):
-
Communality: how much of a variable’s variance is
captured by the retained PCs.
-
Whitening: rescaling scores to unit variance and
zero correlation.