Types of Data
How data is classified determines which statistical methods are appropriate to use.
Two top-level categories: Categorical and Numeric. Categorical splits into Ordinal and Nominal. Numeric splits into Continuous and Discrete.
Categorical Data
Categorical Data
Data divided into groups or categories. Values are labels or names, not measurements. Arithmetic (+, ×) on categories is meaningless.
Species (cat, dog), colors (red, blue), garment sizes (S, M, L), ice cream flavors.
Ordinal Categorical Data
Categorical data with a meaningful natural ordering. You can say one value comes before or after another, but differences between levels aren't necessarily equal and arithmetic still doesn't apply.
{S, M, L, XL}, {January, February, …, December}, {Strongly Disagree → Strongly Agree}, {1st, 2nd, 3rd place}.
Nominal Categorical Data
Categorical data with no meaningful ordering. Labels are purely names — you cannot say one is greater than another in any inherent sense.
{Cat, Dog, Fish}, {Vanilla, Chocolate, Strawberry}, {Truck, Car}, {Red, Blue, Green}.
Common trap: Just because you CAN alphabetize categories doesn't make them ordinal. "Chocolate" before "Vanilla" alphabetically is not a meaningful ordering. Ordinal requires an ordering inherent in the concept (e.g., size, rank, time).
Numeric Data
Continuous Numeric Data
Can take any real value within a range — infinitely many possible values between any two points. Measured, not counted. Described by a Probability Density Function (PDF).
Height (1.7312 m), temperature (−3.5°C), weight, time elapsed, electrical voltage.
Discrete Numeric Data
Can only take countable, distinct values — typically non-negative integers. You cannot have a fractional value. Described by a Probability Mass Function (PMF).
Number of customers (0, 1, 2, …), die roll result (1–6), emails per day, defective items in a batch.
Quick Reference Table
| Type | Ordered? | Arithmetic? | Examples |
| Ordinal Categorical | Yes — meaningful | No | S/M/L, months, survey ratings |
| Nominal Categorical | No | No | Colors, flavors, pet species |
| Discrete Numeric | Yes | Yes | Counts, die rolls, number of heads |
| Continuous Numeric | Yes | Yes | Height, temperature, time |
Random Experiments
The foundation of probability — understanding how uncertainty arises and is formalized.
Random Experiment
A process whose outcome is uncertain before it is performed, with several possible results. Must be repeatable (in principle) under the same conditions.
Rolling a die, drawing a card, flipping a coin, measuring tomorrow's temperature.
Trial
A single execution of the random experiment. Each trial produces exactly one outcome.
One flip of a coin = one trial. Each card draw from a deck = one trial.
Outcome
The result of a single trial. It is one element of the sample space.
Rolling a 3 on a die is an outcome. "Heads" on a coin flip is an outcome.
Sample Space (S or Ω)
The complete set of all possible outcomes of a random experiment. Every conceivable result must be listed.
Coin flip: S = {Heads, Tails}. Six-sided die: S = {1, 2, 3, 4, 5, 6}. Card draw: S = all 52 cards.
NOT a random experiment: Computing the area of a 1 m × 1 m square (result = 1 m² always — deterministic). For a random experiment, the outcome must be genuinely uncertain before the trial.
A computed mean is NOT an outcome: A class's average satisfaction score is a statistic aggregated from many measurements — not the result of a single trial.
Random Variables
A mathematical bridge that maps experimental outcomes to numbers on the real line.
Random Variable (X)
A function that maps outcomes to real numbers. For numeric data the mapping is direct (die shows 3 → X=3). For categorical data we assign numeric codes. We use capital letters (X, Y, Z) to denote random variables.
Discrete vs Continuous
Discrete Random Variable
Takes a countable set of distinct values (usually integers). There are gaps between possible values. Described by a PMF.
Number of heads in 10 flips (0–10), die roll (1–6), daily customer count (0, 1, 2, …).
Continuous Random Variable
Can take any real value in an interval — infinitely many possibilities. The probability of any exact value is 0; probabilities are computed over intervals. Described by a PDF.
Height, time between arrivals, temperature, weight, electrical resistance.
Quick test: Can the variable take fractional values meaningfully? Yes → continuous. Is it always a whole number by nature? → discrete. "3.7 customers" is impossible → discrete. "3.7 minutes" is perfectly valid → continuous.
Probability Functions
PMF — Probability Mass Function
For discrete variables. \(P[X=i]\) = probability X equals exactly i. Properties: P[X=i] ≥ 0 for all i, and \(\sum_{i \in S} P[X=i] = 1\).
PDF — Probability Density Function
For continuous variables. f(x) is probability density, not probability. Probability is the area under the curve: \(P[a \leq X \leq b] = \int_a^b f(x)\,dx\). Total area must equal 1.
CDF — Cumulative Distribution Function
\(F(x) = P[X \leq x]\). The probability that the variable is at most x. Always non-decreasing from 0 to 1. The CDF of the standard normal is written Φ(x).
Events
Any subset of the sample space we care about assigning a probability to.
Event (E)
Any subset of the sample space. An event "occurs" when the outcome of a trial falls within that subset.
Die roll S = {1,2,3,4,5,6}. Event "roll even" = {2,4,6}. Event "roll a 3" = {3}. Event "roll ≥ 4" = {4,5,6}.
Special events:
• Empty set ∅ — the impossible event, P[∅] = 0.
• The full sample space S — the certain event, P[S] = 1.
• Events can be infinite: "more than 100 customers" = {101, 102, 103, …} is a valid event.
Event ≠ Outcome. An outcome is a single element of S. An event is a subset (could contain one element, many, or none). "Rolling a 3" as an outcome is just 3; as an event it is {3}.
Axioms of Probability
Three Basic Axioms (Kolmogorov)
1. P[E] ≥ 0 for every event E.
2. P[S] = 1 (something always happens).
3. For mutually exclusive events E₁, E₂ (no overlap): P[E₁ ∪ E₂] = P[E₁] + P[E₂].
Distributions
How probability is spread across the possible values of a random variable.
Distribution
An assignment of probabilities to all possible outcomes. Requirements: (1) each probability ≥ 0, (2) all probabilities sum (or integrate) to 1.
Uniform Distribution
Discrete Uniform — all outcomes equally likely
For n outcomes: P[X = i] = 1/n for every outcome i.
Fair six-sided die: P[X=1] = … = P[X=6] = 1/6.
Continuous Uniform[a, b]
Constant density over [a, b], zero outside. PDF: \(f(x) = \tfrac{1}{b-a}\) for \(a \leq x \leq b\).
\[\mu = \frac{a+b}{2} \qquad \sigma^2 = \frac{(b-a)^2}{12}\]
Uniform[20,40]: mean = (20+40)/2 = 30, variance = (20)²/12 = 400/12 ≈ 33.33.
Gaussian (Normal) Distribution N(μ, σ)
Gaussian / Normal Distribution
The classic "bell curve." Two parameters: mean μ (center/location) and standard deviation σ (spread). Symmetric around μ. Total area under curve = 1.
\[f(x \mid \mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]
Key Properties of the Gaussian
• Increasing σ → curve becomes wider and flatter (same area = 1).
• Increasing μ → curve shifts right, shape unchanged.
• Empirical rule: ~68% of data within 1σ, ~95% within 2σ, ~99.7% within 3σ.
• Notation: N(μ, σ) means std dev = σ, so variance = σ².
As sample size grows (100 → 1,000 → 10,000), a relative-frequency histogram converges to the true PDF — this is the Law of Large Numbers in action.
Exponential Distribution (λ)
Exponential Distribution
Models time between events in a Poisson process (arrivals, radioactive decay). Rate parameter λ. PDF: \(f(x) = \lambda e^{-\lambda x}\) for x ≥ 0.
\[\mu = \frac{1}{\lambda} \qquad \sigma^2 = \frac{1}{\lambda^2}\]
f(x) = 0.1e^{−0.1x} → λ = 0.1, mean = 1/0.1 = 10.
Triangular Distribution Tri(a, b, c)
Triangular Distribution
Defined on [a, b] with peak (mode) at c. PDF rises linearly from a to c, falls linearly from c to b.
\[\mu = \frac{a+b+c}{3} \qquad \sigma^2 = \frac{a^2+b^2+c^2-ab-ac-bc}{18}\]
Absolute vs Relative Frequency
Absolute Frequency
Raw count of samples in a bin.
30 of 200 samples fall in bin [10,15] → absolute frequency = 30.
Relative Frequency
Count ÷ total samples = proportion. Approximates probability density as sample size → ∞.
30 / 200 = 0.15 = 15% relative frequency for that bin.
Descriptive Statistics
Measures that summarize the center and spread of a dataset.
Range
\[\text{Range} = \max(S) - \min(S)\]
S=[−1,3,−4,2,6]: Range = 6 − (−4) = 10.
Mean (Arithmetic Average)
Mean x̄
Sum of all values divided by n. Sensitive to outliers — a single extreme value can pull it far from where most data sits.
\[\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\]
S=[−1,3,−4,2,6]: Sum = 6, Mean = 6/5 = 1.2.
Median
Median
Middle value after sorting. Robust to outliers — extreme values in the tails do not affect it.
• n odd: median = element at index ⌊n/2⌋ (0-indexed).
• n even: median = average of elements at indices n/2−1 and n/2.
Sorted [−4,−1,2,3,6] (n=5 odd) → Median = 2. Sorted [−1,0,7,7,9,10] (n=6 even) → Median = (7+7)/2 = 7.
Percentile — Linear Interpolation Method
p-th Percentile (NumPy "linear" default)
Steps:
1. Sort the data into S′.
2. Fractional index (0-indexed): \(i = \tfrac{p}{100} \times (n-1)\)
3. If i is an integer → P_p = S′[i].
4. If i is fractional: \(P_p = S'[\lfloor i\rfloor] + (i - \lfloor i\rfloor)\cdot(S'[\lceil i\rceil] - S'[\lfloor i\rfloor])\)
S′=[−4,−1,2,3,6], n=5. 20th %ile: i = 0.20×4 = 0.8. Between S′[0]=−4 and S′[1]=−1: −4 + 0.8×(−1−(−4)) = −4 + 2.4 = −1.6.
Trimmed Mean
Trimmed Mean
Remove a fixed proportion from each tail of the sorted data, then compute the mean of what remains. Robust to extreme outliers in the tails. scipy.stats.trim_mean(data, p) removes proportion p from each end.
S = [−1,0,3,6,7,7,8,9,10,1000], 10% trimmed: remove 1 from each end → [0,3,6,7,7,8,9,10], mean = 50/8 = 6.25.
Robustness ranking (most to least): Median ≈ Trimmed Mean > Mean. Extreme percentiles (90th, 95th) are unreliable near outliers.
Measures of Dispersion
How spread out the data is — and how robust each measure is to outliers.
Why not use the sum of deviations? It always equals zero:
\(\frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}) = \bar{X} - \bar{X} = 0\)
Positive and negative deviations cancel perfectly. We need all deviations to be positive.
Mean Absolute Deviation
\[\text{Mean Abs Dev} = \frac{1}{n}\sum_{i=1}^n |X_i - \bar{X}|\]
Average of absolute deviations from the mean. Uses absolute values so nothing cancels. Less sensitive to large errors than variance, but not as robust as Median Absolute Deviation.
Median Absolute Deviation (MAD)
\[\text{MAD} = \text{median}(|x_i - \text{median}(x)|)\]
The median of the absolute deviations from the median. Very robust — uses the median twice (once to center, once to summarize). Outliers have minimal impact.
Variance & Standard Deviation
Population Variance σ² (use when you have ALL the data)
Divide by n (the full population size).
\[\sigma^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \mu)^2\]
Sample Variance s² — Bessel's Correction (use for a sample)
Divide by (n−1). Corrects the underestimation bias introduced by using x̄ (estimated from the same data) instead of the true μ.
\[s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\]
Units of variance: If data is in °C, variance is in °C². This is why variance is hard to interpret — the unit is squared. Standard deviation (√variance) restores the original unit.
Interquartile Range (IQR)
\[\text{IQR} = Q_3 - Q_1 = P_{75} - P_{25}\]
Range of the middle 50% of data. Most robust of all dispersion measures — extreme values in the tails have zero effect because they aren't used in the computation.
Robustness Comparison
| Measure | Robustness to Outliers | Why |
| IQR | Most robust | Uses only middle 50%; ignores tails |
| Median Abs Dev (MAD) | Very robust | Uses median twice |
| Mean Abs Dev | Moderate | Uses the mean, which outliers pull |
| Sample Std Dev (s) | Sensitive | Squared deviations amplify outliers |
| Variance (s²) | Most sensitive | Squaring makes one outlier dominant |
Box-and-Whisker Plot (Tukey Convention)
Box Plot
• Box spans Q1 to Q3 (the IQR).
• Line inside the box = median (Q2).
• Whiskers extend to the last data point within 1.5×IQR of Q1 (lower) and Q3 (upper).
• Points beyond whiskers are plotted as individual dots — these are flagged as outliers.
Outliers & Skew
How extreme values distort statistics and reveal the shape of a distribution.
Skewness
Right Skew (Positive Skew)
Long tail on the right. Large outliers pull the mean right of the median. Rule of thumb: mean > median → right skew.
S=[3,8,6,9,−1,10,1000,7,7,0]: mean=104.9, median=7. Mean ≫ median → strongly right skewed.
Left Skew (Negative Skew)
Long tail on the left. Small outliers pull the mean below the median. Rule of thumb: mean < median → left skew.
Symmetric (No Skew)
Mean ≈ median. The Gaussian distribution is perfectly symmetric.
Effect of One Outlier on Each Statistic
| Statistic | Outlier Effect | Robust? |
| Mean | Strongly pulled toward outlier | No |
| Median | Little to no effect | Yes |
| Trimmed Mean (10%) | Removed if outlier is in the tail | Yes |
| Variance / Std Dev | Heavily inflated (squared deviations) | No |
| IQR | No effect (outlier is in the tail) | Yes |
| Extreme %iles (90th+) | Heavily distorted by interpolation | No |
| Median percentile (50th) | Same as median — robust | Yes |
Extreme percentiles near outliers are unreliable. If the 90th percentile calculation interpolates between a normal data point and an outlier, the result is wildly inflated. Be cautious with percentiles above ~85th when outliers may exist.
Worked example — S = [3,8,6,9,−1,10,1000,7,7,0], sorted: [−1,0,3,6,7,7,8,9,10,1000]
Mean = 1049/10 = 104.9 | Median = (7+7)/2 = 7
80th %ile: i = 0.8×9 = 7.2 → 9 + 0.2×(10−9) = 9.2
90th %ile: i = 0.9×9 = 8.1 → 10 + 0.1×(1000−10) = 109 ← distorted by outlier!
Population vs Sample Statistics
The critical difference in formulas — and why Bessel's correction exists.
Population
The complete set of all subjects of interest. You know the true parameters (μ, σ²). Use n in the denominator.
Sample
A subset drawn to make inferences about the whole. Sample statistics estimate population parameters. Use (n−1) in variance (Bessel's correction).
Formulas Side by Side
| Statistic | Population (÷ N) | Sample (÷ n−1) |
| Mean | \(\mu = \frac{1}{N}\sum X_i\) | \(\bar{x} = \frac{1}{n}\sum X_i\) |
| Variance | \(\sigma^2 = \frac{\sum(X_i-\mu)^2}{N}\) | \(s^2 = \frac{\sum(X_i-\bar{x})^2}{n-1}\) |
| Std Dev | \(\sigma = \sqrt{\sigma^2}\) | \(s = \sqrt{s^2}\) |
Bessel's Correction — Why n−1?
When computing sample variance we use x̄ (estimated from the same data) instead of the true μ. The sample mean always sits at the center of the sample, making deviations from x̄ systematically smaller than deviations from the true μ. Dividing by (n−1) instead of n corrects this downward bias, making s² an unbiased estimator of σ².
Worked — S = {4, 7, 9, 12, 15, −9}, mean = 38/6 ≈ 6.33
Sum of squared deviations ≈ 355.33
Population variance (÷6) ≈ 59.22 | Sample variance (÷5) ≈ 71.07
Z-scores, Φ (Phi) & Erf
Computing probabilities under any Gaussian by standardizing to N(0,1).
Z-Score
Z-Score (Standardized Score)
Converts a raw value x from X~N(μ,σ) to a standard normal score: how many standard deviations x is above or below the mean.
\[Z = \frac{x - \mu}{\sigma}\]
Test: μ=500, σ=100, x=620 → Z = (620−500)/100 = 1.2. John is 1.2 standard deviations above the mean.
Φ (Phi) — CDF of N(0,1)
Φ(z) — Phi Function
\(\Phi(z) = P[Z \leq z]\) where Z~N(0,1). It's the cumulative area under the standard normal bell curve from −∞ to z. To use for any Gaussian: compute Z first, then look up Φ(Z).
Φ(0) = 0.500, Φ(1) ≈ 0.841, Φ(1.2) ≈ 0.885, Φ(2) ≈ 0.977.
Error Function (erf)
\[\text{erf}(x) = \frac{2}{\sqrt{\pi}}\int_0^x e^{-t^2}\,dt\]
\[\Phi(z) = \frac{1}{2}\!\left[1 + \text{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right]\]
Computing Probabilities
P(X < x) — One-tailed left
Compute Z = (x−μ)/σ, then P = Φ(Z).
P(score < 620): Z=1.2, Φ(1.2) ≈ 88.5%.
P(X > x) — One-tailed right
P(X > x) = 1 − Φ(Z).
P(a < X < b) — Interval probability
P(a < X < b) = Φ(Z_b) − Φ(Z_a), where Z_a=(a−μ)/σ and Z_b=(b−μ)/σ.
P(|X−μ| > k·σ) — Two-tailed outside ±kσ
P = 1 − erf(k/√2).
Steel rods: μ=100, σ=2. P(X<96 or X>104): k=2. P = 1−erf(√2) ≈ 1−0.9545 = 4.55%.
Key Φ Values to Memorize
| Z | Φ(Z) ≈ | Meaning |
| 0 | 0.500 | 50% below mean |
| 1.0 | 0.841 | ±1σ covers 68.3% of data |
| 1.2 | 0.885 | John's exam example |
| 1.645 | 0.950 | 90th percentile of N(0,1) |
| 1.96 | 0.975 | 95% CI boundary (two-sided) |
| 2.0 | 0.977 | ±2σ covers 95.4% of data |
| 3.0 | 0.9987 | ±3σ covers 99.7% of data |
Covariance & Pearson Correlation
Measuring how two variables move together — direction and strength.
Covariance
\[\text{Cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})\]
Interpreting the Sign of Cov(X,Y)
• Cov > 0: When X is above its mean, Y tends to be above its mean (move together).
• Cov < 0: When X is high, Y tends to be low (move opposite).
• Cov ≈ 0: No consistent linear relationship.
Covariance is scale-dependent — its magnitude depends on the units of X and Y. A covariance of 425 between height(cm) and weight(kg) is not comparable to 425 between two other variables. That is why we normalize to get Pearson r.
Pearson Correlation Coefficient (r)
\[r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}\]
Normalizes covariance by the product of standard deviations. Always in [−1, +1]. Scale-independent (unit-free).
| r value | Interpretation |
| +1 | Perfect positive linear relationship |
| 0.7 to 1 | Strong positive correlation |
| 0 to 0.3 | Weak positive correlation |
| ≈ 0 | No linear relationship (may have nonlinear!) |
| −1 to 0 | Negative correlation |
| −1 | Perfect negative linear relationship |
r = 0 does NOT mean no relationship! Y = X² gives r ≈ 0 (symmetric parabola — positive and negative halves cancel). Pearson r only detects linear relationships. Always plot the data with a scatter plot.
Y = constant → r is undefined. σ_Y = 0 means division by zero in the formula. A constant has no linear relationship with anything.
Worked — X=[1,3,5,7], Y=[8,6,5,3], μ_X=4, μ_Y=5.5
Deviations X: −3,−1,+1,+3 | Deviations Y: +2.5,+0.5,−0.5,−2.5
Products: −7.5, −0.5, −0.5, −7.5 → sum = −16
Cov = −16/3 ≈ −5.33 | σ_X=√(20/3), σ_Y=√(13/3)
r = −16/√260 ≈ −0.992 (strong negative).
Bias & Error Types
Understanding systematic and random sources of error in data collection.
Random Error
Unpredictable, unsystematic variation in measurements. Averages out over many trials. Not a flaw — every measurement has some random component.
A scale gives slightly different readings each time you weigh the same 100 g object due to vibrations.
Systematic Error (Bias)
Consistent, repeatable error in the same direction. Does not average out. Indicates a flaw in the study design, instrument, or data collection process.
A miscalibrated scale that always reads 5% too high.
Types of Bias
Measurement Bias
Systematic error in how data is measured — the instrument or procedure introduces consistent errors in one direction.
A laser rangefinder miscalibrated by ×1.2 → all reported distances are 20% too large.
Observer Bias
The researcher's expectations or prior beliefs influence how they collect, record, or interpret data. Also called experimenter bias or confirmation bias.
A police officer who expects a high-crime neighborhood interprets ambiguous behavior as suspicious.
Selection Bias
The sample is not representative of the target population because some groups are systematically over- or under-represented in the selection process.
A pollster calls only landlines during daytime → misses mobile-only users and working people → biased sample.
Non-Response Bias
Participants who do not respond differ systematically from those who do, making the final sample non-representative even if selection was random.
University survey with only 30% response rate — students with strong opinions (positive or negative) are more likely to respond.
Self-Reporting Bias
Participants inaccurately report their own behaviors or attitudes due to memory errors, misunderstanding, or discomfort with truthful answers.
Participants in a weight-loss study self-report monthly weight loss — imperfect memory leads to inaccuracies.
Social Desirability Bias
Participants respond in ways they believe are socially acceptable or will reflect well on them, rather than truthfully.
People over-report exercise frequency and charitable donations; under-report alcohol use and unhealthy eating.
Multiple biases can co-occur. Self-reporting bias and social desirability bias often appear together. Selection bias and non-response bias can affect the same study simultaneously.
Quick Reference
| Bias Type | Who is biased? | Key Signal |
| Measurement | The instrument | Consistent systematic offset in all readings |
| Observer | The researcher | Prior expectations alter observations |
| Selection | The sampling process | Some population groups excluded |
| Non-response | The respondents | Low response rate; non-responders differ |
| Self-reporting | The participant | Inaccurate recall or reporting |
| Social desirability | The participant | Reporting what sounds good, not true |
Sampling Distributions & CLT
Why sample means behave predictably — even when individual data doesn't.
Sampling Distribution
The distribution of a statistic (e.g., sample mean x̄) across many repeated samples from the same population. You draw n samples, compute x̄; repeat m times → the collection of m sample means follows the sampling distribution of the mean.
Central Limit Theorem (CLT)
Central Limit Theorem
For i.i.d. samples from any distribution with finite mean μ and variance σ², the distribution of the sample mean x̄ approaches Normal as n → ∞, regardless of the underlying distribution's shape.
\[\bar{X} \xrightarrow{d} N\!\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{as } n \to \infty\]
Key CLT facts:
• Works for ANY underlying distribution (log-normal, exponential, uniform, …) as long as it has finite variance.
• As n (samples per mean) increases → distribution of x̄ gets narrower and more bell-shaped.
• The mean of the sampling distribution = population mean μ (unbiased).
• More trials m (more sample means computed) → smoother histogram. More samples per mean n → narrower bell.
CLT applies to sample MEANS, not individual samples. Individual samples always follow the underlying distribution. Only the distribution of x̄ becomes approximately Normal.
Standard Error of the Mean
\[\text{SE} = \frac{\sigma}{\sqrt{n}}\]
The standard deviation of the sampling distribution of x̄. Measures how much sample means vary across repeated experiments. As n grows, SE shrinks → your estimate of μ becomes more precise.
Standard Deviation vs Standard Error
| Standard Deviation (σ) | Standard Error (SE) |
| Measures | Spread of individual data points | Spread of sample means across experiments |
| Formula | \(\sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}}\) | \(\sigma/\sqrt{n}\) |
| As n → ∞ | Stays the same (individual variation is real) | → 0 (means converge to μ) |
| Use when | Describing how variable individuals are | Describing precision of the mean estimate |
Spark & PySpark
Distributed computing for datasets too large for a single machine.
Core Concepts
SparkSession
The entry point to all Spark functionality. Wraps a SparkContext (which coordinates with the cluster). In PySpark: spark = SparkSession.builder.getOrCreate(). Use it to read data, execute SQL, and create DataFrames.
RDD — Resilient Distributed Dataset
Spark's fundamental data structure:
• Immutable: cannot be changed after creation; transformations produce new RDDs.
• Distributed: partitioned across cluster nodes for parallel processing.
• Fault-tolerant: lost partitions can be recomputed from lineage (transformation history).
DataFrames are built on top of RDDs.
DataFrame
A distributed dataset organized into named, typed columns (like a table). The preferred API in modern PySpark. Enables the Catalyst optimizer to work most effectively.
Lazy Evaluation
Lazy Evaluation
Transformations are not executed when called — they add nodes to a logical query plan. Execution is deferred until an Action is called. This allows Spark to optimize the entire pipeline before running any computation.
Transformations (Lazy — do not trigger execution)
| Method | Purpose |
df.select("col1","col2") | Choose specific columns |
df.filter("condition") / df.where(...) | Filter rows by condition |
df.groupBy("col").agg(...) | Group rows and aggregate |
df.withColumn("new", expr) | Add or replace a column |
df.orderBy("col") | Sort rows |
df.join(other, on, how) | Join two DataFrames |
Actions (Trigger full execution)
| Method | Purpose |
df.show(n) | Print first n rows to screen |
df.count() | Count total rows |
df.collect() | Return all rows to driver as Python list |
df.first() | Return first row |
df.write.*() | Write results to storage |
Catalyst Optimizer
Catalyst Optimizer
Spark's query optimizer that processes the logical plan before execution:
• Predicate pushdown: moves filter conditions close to the data source — reads less data.
• Projection pruning: drops unused columns early.
• Join reordering: reorders joins for efficiency.
Generates multiple physical plans, picks the cheapest (cost estimation), and compiles to JVM bytecode.
NumPy vs Pandas vs Spark
| Library | Best For | Limitation |
| NumPy | Fast math on arrays; SIMD-vectorized; C backend | Single machine only |
| Pandas | Tabular data manipulation on one machine; rich API | Single machine; limited by RAM |
| Spark | Datasets too big for one machine; cluster-scale | Overhead: scheduling, network, serialization — overkill for small data |
Python is slow for raw computation (heap allocation per object, GIL, no SIMD). But NumPy, Spark, and PyTorch do the heavy math in compiled C/Scala/CUDA — Python is just the glue/API layer.
Common PySpark Patterns
# Select specific columns
df.select("PassengerId", "Survived", "Pclass").show(5)
# Filter rows (two equivalent ways)
df.filter((col("Survived")==1) & (col("Pclass")==1)).show()
df.filter("Survived == 1").filter("Pclass == 1").show()
# Add a new column
df.withColumn("FamilySize", col("SibSp") + col("Parch")).show()
# Group and aggregate
df.groupBy("Pclass").count().orderBy("count", ascending=False).show()
# Read a CSV file
df = spark.read.csv("/path/file.csv", header=True, inferSchema=True)