CSCI 443

Types of Data

How data is classified determines which statistical methods are appropriate to use.

Two top-level categories: Categorical and Numeric. Categorical splits into Ordinal and Nominal. Numeric splits into Continuous and Discrete.

Categorical Data

Data divided into groups or categories. Values are labels or names, not measurements. Arithmetic (+, ×) on categories is meaningless.

Species (cat, dog), colors (red, blue), garment sizes (S, M, L), ice cream flavors.

Ordinal Categorical Data

Categorical data with a meaningful natural ordering. You can say one value comes before or after another, but differences between levels aren't necessarily equal and arithmetic still doesn't apply.

{S, M, L, XL}, {January, February, …, December}, {Strongly Disagree → Strongly Agree}, {1st, 2nd, 3rd place}.

Nominal Categorical Data

Categorical data with no meaningful ordering. Labels are purely names — you cannot say one is greater than another in any inherent sense.

{Cat, Dog, Fish}, {Vanilla, Chocolate, Strawberry}, {Truck, Car}, {Red, Blue, Green}.

Common trap: Just because you CAN alphabetize categories doesn't make them ordinal. "Chocolate" before "Vanilla" alphabetically is not a meaningful ordering. Ordinal requires an ordering inherent in the concept (e.g., size, rank, time).

Numeric Data

Continuous Numeric Data

Can take any real value within a range — infinitely many possible values between any two points. Measured, not counted. Described by a Probability Density Function (PDF).

Height (1.7312 m), temperature (−3.5°C), weight, time elapsed, electrical voltage.

Discrete Numeric Data

Can only take countable, distinct values — typically non-negative integers. You cannot have a fractional value. Described by a Probability Mass Function (PMF).

Number of customers (0, 1, 2, …), die roll result (1–6), emails per day, defective items in a batch.

Quick Reference Table

Type	Ordered?	Arithmetic?	Examples
Ordinal Categorical	Yes — meaningful	No	S/M/L, months, survey ratings
Nominal Categorical	No	No	Colors, flavors, pet species
Discrete Numeric	Yes	Yes	Counts, die rolls, number of heads
Continuous Numeric	Yes	Yes	Height, temperature, time

Random Experiments

The foundation of probability — understanding how uncertainty arises and is formalized.

Random Experiment

A process whose outcome is uncertain before it is performed, with several possible results. Must be repeatable (in principle) under the same conditions.

Rolling a die, drawing a card, flipping a coin, measuring tomorrow's temperature.

Trial

A single execution of the random experiment. Each trial produces exactly one outcome.

One flip of a coin = one trial. Each card draw from a deck = one trial.

Outcome

The result of a single trial. It is one element of the sample space.

Rolling a 3 on a die is an outcome. "Heads" on a coin flip is an outcome.

Sample Space (S or Ω)

The complete set of all possible outcomes of a random experiment. Every conceivable result must be listed.

Coin flip: S = {Heads, Tails}. Six-sided die: S = {1, 2, 3, 4, 5, 6}. Card draw: S = all 52 cards.

NOT a random experiment: Computing the area of a 1 m × 1 m square (result = 1 m² always — deterministic). For a random experiment, the outcome must be genuinely uncertain before the trial.

A computed mean is NOT an outcome: A class's average satisfaction score is a statistic aggregated from many measurements — not the result of a single trial.

Random Variables

A mathematical bridge that maps experimental outcomes to numbers on the real line.

Random Variable (X)

A function that maps outcomes to real numbers. For numeric data the mapping is direct (die shows 3 → X=3). For categorical data we assign numeric codes. We use capital letters (X, Y, Z) to denote random variables.

Discrete vs Continuous

Discrete Random Variable

Takes a countable set of distinct values (usually integers). There are gaps between possible values. Described by a PMF.

Number of heads in 10 flips (0–10), die roll (1–6), daily customer count (0, 1, 2, …).

Continuous Random Variable

Can take any real value in an interval — infinitely many possibilities. The probability of any exact value is 0; probabilities are computed over intervals. Described by a PDF.

Height, time between arrivals, temperature, weight, electrical resistance.

Quick test: Can the variable take fractional values meaningfully? Yes → continuous. Is it always a whole number by nature? → discrete. "3.7 customers" is impossible → discrete. "3.7 minutes" is perfectly valid → continuous.

Probability Functions

PMF — Probability Mass Function

For discrete variables. \(P[X=i]\) = probability X equals exactly i. Properties: P[X=i] ≥ 0 for all i, and \(\sum_{i \in S} P[X=i] = 1\).

PDF — Probability Density Function

For continuous variables. f(x) is probability density, not probability. Probability is the area under the curve: \(P[a \leq X \leq b] = \int_a^b f(x)\,dx\). Total area must equal 1.

CDF — Cumulative Distribution Function

\(F(x) = P[X \leq x]\). The probability that the variable is at most x. Always non-decreasing from 0 to 1. The CDF of the standard normal is written Φ(x).

Events

Any subset of the sample space we care about assigning a probability to.

Event (E)

Any subset of the sample space. An event "occurs" when the outcome of a trial falls within that subset.

Die roll S = {1,2,3,4,5,6}. Event "roll even" = {2,4,6}. Event "roll a 3" = {3}. Event "roll ≥ 4" = {4,5,6}.

Special events:
• Empty set ∅ — the impossible event, P[∅] = 0.
• The full sample space S — the certain event, P[S] = 1.
• Events can be infinite: "more than 100 customers" = {101, 102, 103, …} is a valid event.

Event ≠ Outcome. An outcome is a single element of S. An event is a subset (could contain one element, many, or none). "Rolling a 3" as an outcome is just 3; as an event it is {3}.

Axioms of Probability

Three Basic Axioms (Kolmogorov)

1. P[E] ≥ 0 for every event E.
2. P[S] = 1 (something always happens).
3. For mutually exclusive events E₁, E₂ (no overlap): P[E₁ ∪ E₂] = P[E₁] + P[E₂].

Distributions

How probability is spread across the possible values of a random variable.

Distribution

An assignment of probabilities to all possible outcomes. Requirements: (1) each probability ≥ 0, (2) all probabilities sum (or integrate) to 1.

Uniform Distribution

Discrete Uniform — all outcomes equally likely

For n outcomes: P[X = i] = 1/n for every outcome i.

Fair six-sided die: P[X=1] = … = P[X=6] = 1/6.

Continuous Uniform[a, b]

Constant density over [a, b], zero outside. PDF: \(f(x) = \tfrac{1}{b-a}\) for \(a \leq x \leq b\).

\[\mu = \frac{a+b}{2} \qquad \sigma^2 = \frac{(b-a)^2}{12}\]

Uniform[20,40]: mean = (20+40)/2 = 30, variance = (20)²/12 = 400/12 ≈ 33.33.

Gaussian (Normal) Distribution N(μ, σ)

Gaussian / Normal Distribution

The classic "bell curve." Two parameters: mean μ (center/location) and standard deviation σ (spread). Symmetric around μ. Total area under curve = 1.

\[f(x \mid \mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

Key Properties of the Gaussian

• Increasing σ → curve becomes wider and flatter (same area = 1).
• Increasing μ → curve shifts right, shape unchanged.
• Empirical rule: ~68% of data within 1σ, ~95% within 2σ, ~99.7% within 3σ.
• Notation: N(μ, σ) means std dev = σ, so variance = σ².

As sample size grows (100 → 1,000 → 10,000), a relative-frequency histogram converges to the true PDF — this is the Law of Large Numbers in action.

Exponential Distribution (λ)

Exponential Distribution

Models time between events in a Poisson process (arrivals, radioactive decay). Rate parameter λ. PDF: \(f(x) = \lambda e^{-\lambda x}\) for x ≥ 0.

\[\mu = \frac{1}{\lambda} \qquad \sigma^2 = \frac{1}{\lambda^2}\]

f(x) = 0.1e^{−0.1x} → λ = 0.1, mean = 1/0.1 = 10.

Triangular Distribution Tri(a, b, c)

Triangular Distribution

Defined on [a, b] with peak (mode) at c. PDF rises linearly from a to c, falls linearly from c to b.

\[\mu = \frac{a+b+c}{3} \qquad \sigma^2 = \frac{a^2+b^2+c^2-ab-ac-bc}{18}\]

Absolute vs Relative Frequency

Absolute Frequency

Raw count of samples in a bin.

30 of 200 samples fall in bin [10,15] → absolute frequency = 30.

Relative Frequency

Count ÷ total samples = proportion. Approximates probability density as sample size → ∞.

30 / 200 = 0.15 = 15% relative frequency for that bin.

Descriptive Statistics

Measures that summarize the center and spread of a dataset.

Range

\[\text{Range} = \max(S) - \min(S)\]

S=[−1,3,−4,2,6]: Range = 6 − (−4) = 10.

Mean (Arithmetic Average)

Mean x̄

Sum of all values divided by n. Sensitive to outliers — a single extreme value can pull it far from where most data sits.

\[\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\]

S=[−1,3,−4,2,6]: Sum = 6, Mean = 6/5 = 1.2.

Median

Middle value after sorting. Robust to outliers — extreme values in the tails do not affect it.

• n odd: median = element at index ⌊n/2⌋ (0-indexed).
• n even: median = average of elements at indices n/2−1 and n/2.

Sorted [−4,−1,2,3,6] (n=5 odd) → Median = 2. Sorted [−1,0,7,7,9,10] (n=6 even) → Median = (7+7)/2 = 7.

Percentile — Linear Interpolation Method

p-th Percentile (NumPy "linear" default)

Steps:
1. Sort the data into S′.
2. Fractional index (0-indexed): \(i = \tfrac{p}{100} \times (n-1)\)
3. If i is an integer → P_p = S′[i].
4. If i is fractional: \(P_p = S'[\lfloor i\rfloor] + (i - \lfloor i\rfloor)\cdot(S'[\lceil i\rceil] - S'[\lfloor i\rfloor])\)

S′=[−4,−1,2,3,6], n=5. 20th %ile: i = 0.20×4 = 0.8. Between S′[0]=−4 and S′[1]=−1: −4 + 0.8×(−1−(−4)) = −4 + 2.4 = −1.6.

Trimmed Mean

Remove a fixed proportion from each tail of the sorted data, then compute the mean of what remains. Robust to extreme outliers in the tails. scipy.stats.trim_mean(data, p) removes proportion p from each end.

S = [−1,0,3,6,7,7,8,9,10,1000], 10% trimmed: remove 1 from each end → [0,3,6,7,7,8,9,10], mean = 50/8 = 6.25.

Robustness ranking (most to least): Median ≈ Trimmed Mean > Mean. Extreme percentiles (90th, 95th) are unreliable near outliers.

Measures of Dispersion

How spread out the data is — and how robust each measure is to outliers.

Why not use the sum of deviations? It always equals zero:
\(\frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}) = \bar{X} - \bar{X} = 0\)
Positive and negative deviations cancel perfectly. We need all deviations to be positive.

Mean Absolute Deviation

\[\text{Mean Abs Dev} = \frac{1}{n}\sum_{i=1}^n |X_i - \bar{X}|\]

Average of absolute deviations from the mean. Uses absolute values so nothing cancels. Less sensitive to large errors than variance, but not as robust as Median Absolute Deviation.

Median Absolute Deviation (MAD)

\[\text{MAD} = \text{median}(|x_i - \text{median}(x)|)\]

The median of the absolute deviations from the median. Very robust — uses the median twice (once to center, once to summarize). Outliers have minimal impact.

Variance & Standard Deviation

Population Variance σ² (use when you have ALL the data)

Divide by n (the full population size).

\[\sigma^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \mu)^2\]

Sample Variance s² — Bessel's Correction (use for a sample)

Divide by (n−1). Corrects the underestimation bias introduced by using x̄ (estimated from the same data) instead of the true μ.

\[s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\]

Units of variance: If data is in °C, variance is in °C². This is why variance is hard to interpret — the unit is squared. Standard deviation (√variance) restores the original unit.

Interquartile Range (IQR)

\[\text{IQR} = Q_3 - Q_1 = P_{75} - P_{25}\]

Range of the middle 50% of data. Most robust of all dispersion measures — extreme values in the tails have zero effect because they aren't used in the computation.

Robustness Comparison

Measure	Robustness to Outliers	Why
IQR	Most robust	Uses only middle 50%; ignores tails
Median Abs Dev (MAD)	Very robust	Uses median twice
Mean Abs Dev	Moderate	Uses the mean, which outliers pull
Sample Std Dev (s)	Sensitive	Squared deviations amplify outliers
Variance (s²)	Most sensitive	Squaring makes one outlier dominant

Box-and-Whisker Plot (Tukey Convention)

Box Plot

• Box spans Q1 to Q3 (the IQR).
• Line inside the box = median (Q2).
• Whiskers extend to the last data point within 1.5×IQR of Q1 (lower) and Q3 (upper).
• Points beyond whiskers are plotted as individual dots — these are flagged as outliers.

Outliers & Skew

How extreme values distort statistics and reveal the shape of a distribution.

Skewness

Right Skew (Positive Skew)

Long tail on the right. Large outliers pull the mean right of the median. Rule of thumb: mean > median → right skew.

S=[3,8,6,9,−1,10,1000,7,7,0]: mean=104.9, median=7. Mean ≫ median → strongly right skewed.

Left Skew (Negative Skew)

Long tail on the left. Small outliers pull the mean below the median. Rule of thumb: mean < median → left skew.

Symmetric (No Skew)

Mean ≈ median. The Gaussian distribution is perfectly symmetric.

Effect of One Outlier on Each Statistic

Statistic	Outlier Effect	Robust?
Mean	Strongly pulled toward outlier	No
Median	Little to no effect	Yes
Trimmed Mean (10%)	Removed if outlier is in the tail	Yes
Variance / Std Dev	Heavily inflated (squared deviations)	No
IQR	No effect (outlier is in the tail)	Yes
Extreme %iles (90th+)	Heavily distorted by interpolation	No
Median percentile (50th)	Same as median — robust	Yes

Extreme percentiles near outliers are unreliable. If the 90th percentile calculation interpolates between a normal data point and an outlier, the result is wildly inflated. Be cautious with percentiles above ~85th when outliers may exist.

Worked example — S = [3,8,6,9,−1,10,1000,7,7,0], sorted: [−1,0,3,6,7,7,8,9,10,1000]
Mean = 1049/10 = 104.9 | Median = (7+7)/2 = 7
80th %ile: i = 0.8×9 = 7.2 → 9 + 0.2×(10−9) = 9.2
90th %ile: i = 0.9×9 = 8.1 → 10 + 0.1×(1000−10) = 109 ← distorted by outlier!

Population vs Sample Statistics

The critical difference in formulas — and why Bessel's correction exists.

Population

The complete set of all subjects of interest. You know the true parameters (μ, σ²). Use n in the denominator.

Sample

A subset drawn to make inferences about the whole. Sample statistics estimate population parameters. Use (n−1) in variance (Bessel's correction).

Formulas Side by Side

Statistic	Population (÷ N)	Sample (÷ n−1)
Mean	\(\mu = \frac{1}{N}\sum X_i\)	\(\bar{x} = \frac{1}{n}\sum X_i\)
Variance	\(\sigma^2 = \frac{\sum(X_i-\mu)^2}{N}\)	\(s^2 = \frac{\sum(X_i-\bar{x})^2}{n-1}\)
Std Dev	\(\sigma = \sqrt{\sigma^2}\)	\(s = \sqrt{s^2}\)

Bessel's Correction — Why n−1?

When computing sample variance we use x̄ (estimated from the same data) instead of the true μ. The sample mean always sits at the center of the sample, making deviations from x̄ systematically smaller than deviations from the true μ. Dividing by (n−1) instead of n corrects this downward bias, making s² an unbiased estimator of σ².

Worked — S = {4, 7, 9, 12, 15, −9}, mean = 38/6 ≈ 6.33
Sum of squared deviations ≈ 355.33
Population variance (÷6) ≈ 59.22 | Sample variance (÷5) ≈ 71.07

Z-scores, Φ (Phi) & Erf

Computing probabilities under any Gaussian by standardizing to N(0,1).

Z-Score

Z-Score (Standardized Score)

Converts a raw value x from X~N(μ,σ) to a standard normal score: how many standard deviations x is above or below the mean.

\[Z = \frac{x - \mu}{\sigma}\]

Test: μ=500, σ=100, x=620 → Z = (620−500)/100 = 1.2. John is 1.2 standard deviations above the mean.

Φ (Phi) — CDF of N(0,1)

Φ(z) — Phi Function

\(\Phi(z) = P[Z \leq z]\) where Z~N(0,1). It's the cumulative area under the standard normal bell curve from −∞ to z. To use for any Gaussian: compute Z first, then look up Φ(Z).

Φ(0) = 0.500, Φ(1) ≈ 0.841, Φ(1.2) ≈ 0.885, Φ(2) ≈ 0.977.

Error Function (erf)

\[\text{erf}(x) = \frac{2}{\sqrt{\pi}}\int_0^x e^{-t^2}\,dt\]

\[\Phi(z) = \frac{1}{2}\!\left[1 + \text{erf}\!\left(\frac{z}{\sqrt{2}}\right)\right]\]

Computing Probabilities

P(X < x) — One-tailed left

Compute Z = (x−μ)/σ, then P = Φ(Z).

P(score < 620): Z=1.2, Φ(1.2) ≈ 88.5%.

P(X > x) — One-tailed right

P(X > x) = 1 − Φ(Z).

P(a < X < b) — Interval probability

P(a < X < b) = Φ(Z_b) − Φ(Z_a), where Z_a=(a−μ)/σ and Z_b=(b−μ)/σ.

P(|X−μ| > k·σ) — Two-tailed outside ±kσ

P = 1 − erf(k/√2).

Steel rods: μ=100, σ=2. P(X<96 or X>104): k=2. P = 1−erf(√2) ≈ 1−0.9545 = 4.55%.

Key Φ Values to Memorize

Z	Φ(Z) ≈	Meaning
0	0.500	50% below mean
1.0	0.841	±1σ covers 68.3% of data
1.2	0.885	John's exam example
1.645	0.950	90th percentile of N(0,1)
1.96	0.975	95% CI boundary (two-sided)
2.0	0.977	±2σ covers 95.4% of data
3.0	0.9987	±3σ covers 99.7% of data

Covariance & Pearson Correlation

Measuring how two variables move together — direction and strength.

Covariance

\[\text{Cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})\]

Interpreting the Sign of Cov(X,Y)

• Cov > 0: When X is above its mean, Y tends to be above its mean (move together).
• Cov < 0: When X is high, Y tends to be low (move opposite).
• Cov ≈ 0: No consistent linear relationship.

Covariance is scale-dependent — its magnitude depends on the units of X and Y. A covariance of 425 between height(cm) and weight(kg) is not comparable to 425 between two other variables. That is why we normalize to get Pearson r.

Pearson Correlation Coefficient (r)

\[r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}\]

Normalizes covariance by the product of standard deviations. Always in [−1, +1]. Scale-independent (unit-free).

r value	Interpretation
+1	Perfect positive linear relationship
0.7 to 1	Strong positive correlation
0 to 0.3	Weak positive correlation
≈ 0	No linear relationship (may have nonlinear!)
−1 to 0	Negative correlation
−1	Perfect negative linear relationship

r = 0 does NOT mean no relationship! Y = X² gives r ≈ 0 (symmetric parabola — positive and negative halves cancel). Pearson r only detects linear relationships. Always plot the data with a scatter plot.

Y = constant → r is undefined. σ_Y = 0 means division by zero in the formula. A constant has no linear relationship with anything.

Worked — X=[1,3,5,7], Y=[8,6,5,3], μ_X=4, μ_Y=5.5
Deviations X: −3,−1,+1,+3 | Deviations Y: +2.5,+0.5,−0.5,−2.5
Products: −7.5, −0.5, −0.5, −7.5 → sum = −16
Cov = −16/3 ≈ −5.33 | σ_X=√(20/3), σ_Y=√(13/3)
r = −16/√260 ≈ −0.992 (strong negative).

Bias & Error Types

Understanding systematic and random sources of error in data collection.

Random Error

Unpredictable, unsystematic variation in measurements. Averages out over many trials. Not a flaw — every measurement has some random component.

A scale gives slightly different readings each time you weigh the same 100 g object due to vibrations.

Systematic Error (Bias)

Consistent, repeatable error in the same direction. Does not average out. Indicates a flaw in the study design, instrument, or data collection process.

A miscalibrated scale that always reads 5% too high.

Types of Bias

Measurement Bias

Systematic error in how data is measured — the instrument or procedure introduces consistent errors in one direction.

A laser rangefinder miscalibrated by ×1.2 → all reported distances are 20% too large.

Observer Bias

The researcher's expectations or prior beliefs influence how they collect, record, or interpret data. Also called experimenter bias or confirmation bias.

A police officer who expects a high-crime neighborhood interprets ambiguous behavior as suspicious.

Selection Bias

The sample is not representative of the target population because some groups are systematically over- or under-represented in the selection process.

A pollster calls only landlines during daytime → misses mobile-only users and working people → biased sample.

Non-Response Bias

Participants who do not respond differ systematically from those who do, making the final sample non-representative even if selection was random.

University survey with only 30% response rate — students with strong opinions (positive or negative) are more likely to respond.

Self-Reporting Bias

Participants inaccurately report their own behaviors or attitudes due to memory errors, misunderstanding, or discomfort with truthful answers.

Participants in a weight-loss study self-report monthly weight loss — imperfect memory leads to inaccuracies.

Social Desirability Bias

Participants respond in ways they believe are socially acceptable or will reflect well on them, rather than truthfully.

People over-report exercise frequency and charitable donations; under-report alcohol use and unhealthy eating.

Multiple biases can co-occur. Self-reporting bias and social desirability bias often appear together. Selection bias and non-response bias can affect the same study simultaneously.

Quick Reference

Bias Type	Who is biased?	Key Signal
Measurement	The instrument	Consistent systematic offset in all readings
Observer	The researcher	Prior expectations alter observations
Selection	The sampling process	Some population groups excluded
Non-response	The respondents	Low response rate; non-responders differ
Self-reporting	The participant	Inaccurate recall or reporting
Social desirability	The participant	Reporting what sounds good, not true

Sampling Distributions & CLT

Why sample means behave predictably — even when individual data doesn't.

Sampling Distribution

The distribution of a statistic (e.g., sample mean x̄) across many repeated samples from the same population. You draw n samples, compute x̄; repeat m times → the collection of m sample means follows the sampling distribution of the mean.

Central Limit Theorem (CLT)

Central Limit Theorem

For i.i.d. samples from any distribution with finite mean μ and variance σ², the distribution of the sample mean x̄ approaches Normal as n → ∞, regardless of the underlying distribution's shape.

\[\bar{X} \xrightarrow{d} N\!\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{as } n \to \infty\]

Key CLT facts:
• Works for ANY underlying distribution (log-normal, exponential, uniform, …) as long as it has finite variance.
• As n (samples per mean) increases → distribution of x̄ gets narrower and more bell-shaped.
• The mean of the sampling distribution = population mean μ (unbiased).
• More trials m (more sample means computed) → smoother histogram. More samples per mean n → narrower bell.

CLT applies to sample MEANS, not individual samples. Individual samples always follow the underlying distribution. Only the distribution of x̄ becomes approximately Normal.

Standard Error of the Mean

\[\text{SE} = \frac{\sigma}{\sqrt{n}}\]

The standard deviation of the sampling distribution of x̄. Measures how much sample means vary across repeated experiments. As n grows, SE shrinks → your estimate of μ becomes more precise.

Standard Deviation vs Standard Error

	Standard Deviation (σ)	Standard Error (SE)
Measures	Spread of individual data points	Spread of sample means across experiments
Formula	\(\sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}}\)	\(\sigma/\sqrt{n}\)
As n → ∞	Stays the same (individual variation is real)	→ 0 (means converge to μ)
Use when	Describing how variable individuals are	Describing precision of the mean estimate

Spark & PySpark

Distributed computing for datasets too large for a single machine.

Core Concepts

SparkSession

The entry point to all Spark functionality. Wraps a SparkContext (which coordinates with the cluster). In PySpark: spark = SparkSession.builder.getOrCreate(). Use it to read data, execute SQL, and create DataFrames.

RDD — Resilient Distributed Dataset

Spark's fundamental data structure:
• Immutable: cannot be changed after creation; transformations produce new RDDs.
• Distributed: partitioned across cluster nodes for parallel processing.
• Fault-tolerant: lost partitions can be recomputed from lineage (transformation history).
DataFrames are built on top of RDDs.

DataFrame

A distributed dataset organized into named, typed columns (like a table). The preferred API in modern PySpark. Enables the Catalyst optimizer to work most effectively.

Lazy Evaluation

Transformations are not executed when called — they add nodes to a logical query plan. Execution is deferred until an Action is called. This allows Spark to optimize the entire pipeline before running any computation.

Transformations (Lazy — do not trigger execution)

Method	Purpose
`df.select("col1","col2")`	Choose specific columns
`df.filter("condition")` / `df.where(...)`	Filter rows by condition
`df.groupBy("col").agg(...)`	Group rows and aggregate
`df.withColumn("new", expr)`	Add or replace a column
`df.orderBy("col")`	Sort rows
`df.join(other, on, how)`	Join two DataFrames

Actions (Trigger full execution)

Method	Purpose
`df.show(n)`	Print first n rows to screen
`df.count()`	Count total rows
`df.collect()`	Return all rows to driver as Python list
`df.first()`	Return first row
`df.write.*()`	Write results to storage

Catalyst Optimizer

Spark's query optimizer that processes the logical plan before execution:
• Predicate pushdown: moves filter conditions close to the data source — reads less data.
• Projection pruning: drops unused columns early.
• Join reordering: reorders joins for efficiency.
Generates multiple physical plans, picks the cheapest (cost estimation), and compiles to JVM bytecode.

NumPy vs Pandas vs Spark

Library	Best For	Limitation
NumPy	Fast math on arrays; SIMD-vectorized; C backend	Single machine only
Pandas	Tabular data manipulation on one machine; rich API	Single machine; limited by RAM
Spark	Datasets too big for one machine; cluster-scale	Overhead: scheduling, network, serialization — overkill for small data

Python is slow for raw computation (heap allocation per object, GIL, no SIMD). But NumPy, Spark, and PyTorch do the heavy math in compiled C/Scala/CUDA — Python is just the glue/API layer.

Common PySpark Patterns

# Select specific columns
df.select("PassengerId", "Survived", "Pclass").show(5)

# Filter rows (two equivalent ways)
df.filter((col("Survived")==1) & (col("Pclass")==1)).show()
df.filter("Survived == 1").filter("Pclass == 1").show()

# Add a new column
df.withColumn("FamilySize", col("SibSp") + col("Parch")).show()

# Group and aggregate
df.groupBy("Pclass").count().orderBy("count", ascending=False).show()

# Read a CSV file
df = spark.read.csv("/path/file.csv", header=True, inferSchema=True)

Object Classes

A regular object class in Java has exactly 4 types of methods.

4 types: Constructors · Accessors (getters) · Mutators (setters) · toString

Constructors

Constructor

Initializes a new object. No return type (not even void). Same name as the class. A class can have multiple constructors (overloading). A no-arg constructor takes no parameters.

public Person(String name, int age){ this.name = name; this.age = age; }

Accessors (Getters)

Accessor / Getter

Returns the value of a private field. Return type matches the field type. No parameters. Name convention: getFieldName().

public String getName(){ return name; }
public int getAge(){ return age; }

Mutators (Setters)

Mutator / Setter

Sets/updates the value of a private field. Return type is always void. Takes one parameter matching the field type. Name convention: setFieldName(type param).

public void setName(String name){ this.name = name; }
public void setAge(int age){ this.age = age; }

toString

toString()

Returns a String representation of the object. Return type is always String. No parameters. Called automatically when you print an object.

public String toString(){
  return "Person: " + name + ", Age: " + age;
}

Quick Reference

Method Type	Return Type	Parameters	Purpose
Constructor	none (not void)	0 or more	Initialize object
Accessor	matches field type	none	Get field value
Mutator	void	1 (matching field)	Set field value
toString	String	none	Text representation

The `this` Keyword & Shadowing

Shadowing is one of the most common bugs in object classes.

Shadowing

Occurs when a local variable or parameter has the same name as an instance field. The local name "shadows" (hides) the field — assignments go to the local variable, not the field.

// BUG — shadowing! name = name assigns param to itself
public void setName(String name){ name = name; }

// FIX — use this.name
public void setName(String name){ this.name = name; }

Two ways to prevent shadowing:
1. Use the this. keyword to explicitly refer to the instance field.
2. Use different names for the parameter and the field (e.g., field: name, param: n).

The this keyword

Refers to the current instance of the class. this.fieldName always means the instance field, not a local variable. Also used to call another constructor: this(args).

Method Overloading & Return Types

Same name, different parameters — Java picks the right version automatically.

Method Overloading

Defining multiple methods with the same name but different parameter lists (different number, type, or order of parameters). Return type alone cannot distinguish overloads.

public void print(int x){ ... }       // version 1
public void print(String s){ ... }    // version 2 — different param type
public void print(int x, int y){ ... } // version 3 — different param count

Return type alone cannot overload. Two methods with the same name and same parameters but different return types will cause a compile error.

Return Types & return keyword

Every non-void method must return a value with the return statement. The return type in the method signature must match exactly. void methods do not return values (but can use bare return; to exit early).

public int add(int a, int b){ return a + b; }   // returns int
public String greet(){ return "Hello"; }         // returns String
public void print(){ System.out.println("Hi"); } // void, no return value

The `static` Keyword

Static members belong to the class, not any particular instance.

static field / variable

A single shared memory location for all instances of the class. Regardless of how many objects are created, there is only one copy of a static field. Accessed via the class name: ClassName.fieldName.

public class Counter {
  public static int count = 0;  // one copy, shared
  public Counter(){ count++; }  // every new object increments the same count
}

static method

Called on the class (not an object). Cannot access instance fields (non-static). Good for utility functions that don't depend on object state.

public static int square(int n){ return n * n; }
// called as: ClassName.square(5);

When to use static: constants (static final), counters shared across all objects, utility/helper methods like Math.max().

A static method cannot call a non-static method or use this — there is no object instance in scope.

Aggregation (Has-A Relationship)

One class uses another class as a field — modeling real-world ownership.

Aggregation

When a class contains an object of another class as a field. This models a "has-a" relationship. Example: a University has a Person array.

public class University {
  private String name;
  private Person[] faculty;  // aggregation: University HAS-A Person[]
  private int count;
  ...
}

Has-A relationship: The containing class "has" objects of another class. Distinguish from "Is-A" (inheritance) — e.g., "a Dog is an Animal."

Add Method in Intermediate (Aggregation) Class

To add an object to the internal array, you instantiate the object inside the method and store it at the logical index (counter), then increment the counter. The getter for the array is not always needed — but an add method is.

public void addPerson(String name, int age){
  faculty[count] = new Person(name, age);
  count++;
}

Arrays

Fixed-size, ordered collection of elements of the same type.

Declaration & Instantiation

type[] name = new type[size]; — allocates a fixed block of memory. Size cannot change after creation.

int[] scores = new int[5];       // size 5, all zeros
String[] names = new String[3];  // size 3, all null

Default Values

Numeric types default to 0, booleans to false, objects/Strings to null.

Physical vs Logical Length

Physical length = arr.length — total slots allocated.
Logical length = a separate counter variable tracking how many elements have actually been added. Always ≤ physical length.

int[] data = new int[10];  // physical = 10
int count = 0;             // logical starts at 0
data[count] = 42; count++; // logical becomes 1

ArrayIndexOutOfBoundsException — thrown when you access index < 0 or ≥ arr.length. Always use logical length as the loop bound when the array isn't full.

Iterating

// Print all elements (full array):
for(int i = 0; i < arr.length; i++){
  System.out.println(arr[i]);
}
// Enhanced for-loop:
for(int val : arr){ System.out.println(val); }

ArrayList

Dynamic array that grows and shrinks automatically. From java.util.ArrayList.

Declaration & Instantiation

import java.util.ArrayList;
ArrayList<String> list = new ArrayList<String>();

Key Methods

Method	Purpose
`list.add(element)`	Append element to end
`list.add(index, element)`	Insert at position
`list.get(index)`	Get element at index
`list.set(index, element)`	Replace element at index
`list.remove(index)`	Remove element at index
`list.size()`	Number of elements (logical size)
`list.contains(obj)`	Check if element exists

ArrayList vs Array: ArrayList is dynamic (resizes automatically), can only hold objects (not primitives — use wrappers). Arrays have fixed size, can hold primitives, use .length. ArrayList uses .size().

ArrayList cannot hold primitives directly. Use ArrayList<Integer> not ArrayList<int>. Autoboxing converts automatically.

Wrapper Classes

Object versions of Java primitives. Required for collections like ArrayList.

Primitive	Wrapper Class
int	Integer
double	Double
char	Character
boolean	Boolean
long	Long
float	Float

Useful Methods

Integer.parseInt("42")     // String → int
Integer.toString(42)       // int → String
Double.parseDouble("3.14") // String → double
Character.isLetter('a')    // true
Character.toUpperCase('a') // 'A'

Autoboxing & Unboxing

Autoboxing: Java automatically converts a primitive to its wrapper when needed (e.g., adding int to ArrayList<Integer>).
Unboxing: automatic conversion from wrapper back to primitive.

ArrayList<Integer> nums = new ArrayList<>();
nums.add(5);      // autoboxing: int 5 → Integer(5)
int x = nums.get(0); // unboxing: Integer(5) → int 5

String Methods

Strings are objects in Java — immutable and full of useful methods.

Method	Returns	Description
`str.length()`	int	Number of characters
`str.charAt(i)`	char	Character at index i
`str.substring(a,b)`	String	Chars from index a up to (not including) b
`str.toUpperCase()`	String	All uppercase
`str.toLowerCase()`	String	All lowercase
`str.trim()`	String	Remove leading/trailing whitespace
`str.equals(other)`	boolean	Content equality (use instead of ==)
`str.equalsIgnoreCase(other)`	boolean	Case-insensitive equality
`str.contains(s)`	boolean	True if s is a substring
`str.indexOf(s)`	int	First index of s, or -1
`str.replace(old,new)`	String	Replace all occurrences
`str.split(regex)`	String[]	Split into array by delimiter
`String.valueOf(x)`	String	Convert any type to String

Concatenation: Use + to join strings. "Hello" + " " + "World" → "Hello World". Any type concatenated with a String is automatically converted.

Use .equals() not == to compare String content. == compares references (memory addresses), not values.

Formatting Output

Three ways to control decimal precision and field width.

printf / String.format Format Specifiers

Specifier	Meaning	Example
`%d`	Integer	`printf("%d", 42)` → `42`
`%f`	Floating point	`printf("%f", 3.14)` → `3.140000`
`%.2f`	2 decimal places	`printf("%.2f", 3.14159)` → `3.14`
`%8d`	Width 8, right-align	pads with spaces on left
`%-8d`	Width 8, left-align	pads with spaces on right
`%s`	String	`printf("%s", "hi")` → `hi`
`%n`	Newline	platform-independent newline

DecimalFormat

From java.text.DecimalFormat. Formats numbers using a pattern string.

import java.text.DecimalFormat;
DecimalFormat df = new DecimalFormat("0.00");
System.out.println(df.format(3.14159)); // → 3.14

DecimalFormat df2 = new DecimalFormat("#,##0.00");
System.out.println(df2.format(1234567.8)); // → 1,234,567.80

String.format

Same specifiers as printf but returns a String instead of printing.

String s = String.format("%.2f", 3.14159); // s = "3.14"
String t = String.format("%-10s|", "Hi");  // t = "Hi        |"

Primitives vs Objects & Pass by Value/Reference

Understanding how Java passes data to methods.

The 8 Primitive Types

byte · short · int · long · float · double · boolean · char

Primitive vs Object

Primitives store the value directly in memory. Objects store a reference (address) to the object on the heap. Strings are objects, not primitives.

Pass by Value (Primitives)

Pass by Value

A copy of the value is passed to the method. Changes inside the method do NOT affect the original variable.

void addTen(int x){ x += 10; } // only changes the copy
int n = 5;
addTen(n);
System.out.println(n); // still 5!

Pass by Reference (Objects)

Pass by Reference

A copy of the reference (address) is passed. The method accesses the same object. Changes to the object's fields ARE visible outside the method.

void rename(Person p){ p.setName("Bob"); }
Person person = new Person("Alice", 20);
rename(person);
System.out.println(person.getName()); // "Bob" — object was changed!

Java always passes by value. For objects, the value passed is the reference (address). Reassigning the parameter inside the method doesn't affect the caller's variable — but modifying the object via that reference does.

Random Class & Nested Loops

Generating random numbers and using nested iteration.

Random Class

import java.util.Random;
Random rng = new Random();

rng.nextInt(n)         // random int: 0 to n-1 (inclusive)
rng.nextInt(10)        // 0–9
rng.nextInt(10) + 1    // 1–10 (shift range up)
rng.nextInt(6) + 1     // 1–6 (simulate die roll)

// General range [min, max] inclusive:
rng.nextInt(max - min + 1) + min

Formula for range [a, b] inclusive: rng.nextInt(b - a + 1) + a

Nested For Loops

An inner loop runs completely for each iteration of the outer loop. Total iterations = outer count × inner count.

for(int i = 1; i <= 3; i++){
  for(int j = 1; j <= 3; j++){
    System.out.print(i * j + "\t");
  }
  System.out.println();
}
// Output: multiplication table 1..3

JUnit Testing

Testing individual methods in isolation.

Test Structure

import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;

public class PersonTest {
  @Test
  public void testGetName(){
    Person p = new Person("Alice", 20);
    assertEquals("Alice", p.getName());
  }
}

Common Assert Methods

Assert	What it checks
`assertEquals(expected, actual)`	Two values are equal
`assertNotEquals(a, b)`	Two values are not equal
`assertTrue(condition)`	Condition is true
`assertFalse(condition)`	Condition is false
`assertNull(obj)`	Object is null
`assertNotNull(obj)`	Object is not null

Always test getters right after construction. Test setters by setting a value then getting it back. Test toString for expected string output.

Each test method must be annotated with @Test. If an assertion fails, the test immediately stops and reports failure. One assert per test is a good practice.

⚡ Quick Reference

Every term from class — one simple table each. No fluff.

Keywords — what each word does

Keyword	What it means	Example
`public`	Anyone can use this	`public void sayHi()`
`private`	Only this class can use it	`private int age;`
`void`	Method returns nothing	`public void setAge(int a)`
`return`	Send a value back to the caller	`return name;`
`new`	Create an object in memory	`Dog d = new Dog();`
`this`	The current object (refers to its own fields)	`this.name = name;`
`static`	Belongs to the class, not each object — one shared copy	`static int count;`
`final`	Value can never change (constant)	`static final double TAX = 0.07;`
`int`	Whole number: …-2, -1, 0, 1, 2…	`int score = 95;`
`double`	Decimal number	`double price = 3.99;`
`boolean`	true or false only	`boolean isOn = true;`
`char`	A single character	`char grade = 'A';`
`String`	Text (sequence of characters) — it's an object	`String name = "Ana";`
`import`	Bring in a class from a library	`import java.util.ArrayList;`

Class vs Object

Term	What it is	Real-world analogy
Class	The blueprint/template — defines what fields and methods exist	Cookie cutter
Object	A specific instance created from the class	An actual cookie
Field (attribute)	A variable that belongs to the object — stores its data	Cookie's flavor, size
Method	A function that belongs to the class — does something	eat(), bake()
Constructor	Special method that creates/initializes the object — no return type, same name as class	The act of cutting the cookie
Instantiation	Using `new` to create an object from a class	`Dog d = new Dog("Rex");`

The 4 Method Types — how to write each one

Type	Return type	Parameters	Template
Constructor	none (not even void)	Whatever you need to initialize	`public Dog(String name){ this.name = name; }`
Getter	Same type as the field	none	`public String getName(){ return name; }`
Setter	`void`	One — same type as the field	`public void setName(String name){ this.name = name; }`
toString	`String`	none	`public String toString(){ return "Dog: " + name; }`

Shadowing: When a parameter has the same name as a field, the parameter "shadows" it. Fix with this.name = name; — this.name = the field, name = the parameter.

access modifiers — who can see what

Modifier	Who can access it	When to use
`private`	Only this class	All fields — always make fields private
`public`	Everyone	All methods (constructors, getters, setters, toString)

static — shared vs per-object

	Instance field (no static)	Static field
Copies	One per object	One total — shared by all objects
Use when	Each object has its own value (name, age)	All objects share the value (counter, tax rate)
Call it	`obj.field`	`ClassName.field`
Example	`private String name;`	`private static int count = 0;`

Constant = static final. Example: public static final double PI = 3.14159; — never changes, one copy, available everywhere.

Overloading — same name, different parameters

What it means	Example
Two or more methods with the same name but different parameter lists	`public Dog()` and `public Dog(String name, int age)`
Java picks the right one based on what you pass	`new Dog()` calls no-arg · `new Dog("Rex",3)` calls the other
Constructors are commonly overloaded — no-arg + parameterized	Both are valid and useful

Changing only the return type is NOT overloading and will cause a compile error. The parameter list must differ.

Arrays — fixed-size list

Concept	What it means	Code
Declare + create	Make an array of 5 ints	`int[] scores = new int[5];`
Store a value	Put 90 at index 0	`scores[0] = 90;`
Read a value	Get the value at index 2	`int x = scores[2];`
Physical length	Total slots allocated (fixed)	`scores.length` → 5
Logical length	How many slots are actually filled — you track this with a counter variable	`int count = 0;`
Index	Position — starts at 0, ends at length-1	array of 5 → indices 0,1,2,3,4
Default values	int→0, double→0.0, boolean→false, String→null	after `new int[5]` all slots = 0

Loop patterns for arrays

Task	Code
Write into every slot	for (int i = 0; i < arr.length; i++) { arr[i] = someValue; }
Write into next open slot (logical fill)	arr[count] = newValue; count++;
Read every slot (for loop)	for (int i = 0; i < count; i++) { System.out.println(arr[i]); }
Read every slot (enhanced for)	for (String s : arr) { System.out.println(s); }
Sum all values	double total = 0; // declare OUTSIDE loop for (int i = 0; i < count; i++) { total += arr[i]; }

Always use count (logical length) in your loop condition — not arr.length — so you don't print empty/null slots.

ArrayList — dynamic-size list

Concept	What it means	Code
Import first	ArrayList lives in java.util	`import java.util.ArrayList;`
Declare + create	Make an ArrayList of Strings	`ArrayList<String> list = new ArrayList<>();`
Add to end	Append an element	`list.add("hello");`
Get by index	Read element at position	`String s = list.get(0);`
Size	How many elements are in it	`list.size()`
Remove by index	Delete element, rest shift up	`list.remove(1);`
Replace at index	Overwrite element	`list.set(0, "world");`
Loop through	Enhanced for loop	`for (String s : list) { ... }`

ArrayList cannot hold primitives. Use wrapper types: ArrayList<Integer> not ArrayList<int>.

Wrapper Classes — primitive wrapped as an object

Primitive	Wrapper Class	String → Primitive	Primitive → String
`int`	`Integer`	`Integer.parseInt("5")`	`Integer.toString(5)`
`double`	`Double`	`Double.parseDouble("3.14")`	`Double.toString(3.14)`
`boolean`	`Boolean`	`Boolean.parseBoolean("true")`	`Boolean.toString(true)`
`char`	`Character`	use `charAt(0)`	`Character.toString('A')`

Autoboxing: Java auto-converts int → Integer when needed (e.g., adding to an ArrayList). Unboxing: auto-converts back Integer → int.

String Methods — most used ones

Method	What it returns	Example
`.length()`	Number of characters	`"hello".length()` → 5
`.charAt(i)`	Character at index i	`"hello".charAt(1)` → 'e'
`.substring(start)`	From start to end	`"hello".substring(2)` → "llo"
`.substring(start, end)`	From start up to (not including) end	`"hello".substring(1,3)` → "el"
`.indexOf("x")`	Index of first occurrence (-1 if not found)	`"hello".indexOf("l")` → 2
`.contains("x")`	true / false	`"hello".contains("ell")` → true
`.equals("x")`	true if same content (case-sensitive)	`"Hi".equals("hi")` → false
`.equalsIgnoreCase("x")`	true if same content ignoring case	`"Hi".equalsIgnoreCase("hi")` → true
`.toUpperCase()`	All caps version	`"hi".toUpperCase()` → "HI"
`.toLowerCase()`	All lowercase version	`"HI".toLowerCase()` → "hi"
`.trim()`	Removes spaces at start and end	`" hi ".trim()` → "hi"
`.replace("a","b")`	New string with all "a" replaced by "b"	`"cat".replace("a","o")` → "cot"
`.split(",")`	String array split by delimiter	`"a,b,c".split(",")` → ["a","b","c"]

Always use .equals() to compare Strings — never ==. The == operator compares memory addresses, not content.

Formatting Numbers — 3 ways

Tool	How to use it	Output
DecimalFormat	`new DecimalFormat("#.00").format(3.5)`	`3.50`
printf	`System.out.printf("%.2f", 3.5);`	prints `3.50`
String.format	`String s = String.format("%.2f", 3.5);`	stores `"3.50"` in s

printf / String.format — format string rules

Specifier	Means	Example	Result
`%d`	Integer	`printf("%d", 42)`	`42`
`%f`	Decimal (float/double)	`printf("%f", 3.5)`	`3.500000`
`%.2f`	Decimal, 2 decimal places	`printf("%.2f", 3.5)`	`3.50`
`%8.2f`	Width 8, 2 decimal places, right-aligned	`printf("%8.2f", 3.5)`	`3.50`
`%-8.2f`	Width 8, 2 decimal places, left-aligned	`printf("%-8.2f", 3.5)`	`3.50`
`%s`	String	`printf("%s", "hi")`	`hi`
`%10s`	String right-aligned in width 10	`printf("%10s","hi")`	`hi`

DecimalFormat pattern characters

Character	Means	Example pattern	Input → Output
`#`	Show digit, hide leading/trailing zeros	`#.##`	3.5 → 3.5
`0`	Show digit, force zeros	`#.00`	3.5 → 3.50
`,`	Add comma thousands separator	`#,###.00`	1234.5 → 1,234.50

Loops — when to use which

Loop	Use when	Template
for	You know how many times	for (int i = 0; i < n; i++) { // runs n times }
while	You don't know how many times	while (condition) { // keep going until false }
enhanced for	Loop through every element in array/ArrayList	for (Type var : collection) { // var = each element }
nested for	2D patterns, grids, combinations	for (int i = 0; i < rows; i++) { for (int j = 0; j < cols; j++) { // runs rows × cols times } }

Declare any variable you need after the loop (like a sum or result) outside and before the loop — otherwise it disappears when the loop ends (scope rule).

Random Numbers

What you want	Code
Decimal 0.0 – 0.999…	`Math.random()`
Int 0 – 9	`new Random().nextInt(10)`
Int 1 – 6 (dice)	`new Random().nextInt(6) + 1`
Int min – max	`new Random().nextInt(max - min + 1) + min`

Import: import java.util.Random; — then create once: Random rand = new Random(); and reuse it.

Aggregation — one class inside another

Term	What it means
Aggregation	Class A has a field that is an object of Class B — a "Has-A" relationship
Has-A	Library has Books · Roster has Students · Order has Items
Intermediate class	A class that holds an array or ArrayList of another object class
One-way	Class A can call Class B's methods — but NOT the other way around

Pattern: Intermediate class stores objects in an array + a count variable. An add() method does: array[count] = obj; count++;

JUnit Testing — quick rules

Assert method	Passes when…	Example
`assertEquals(a, b)`	a equals b	`assertEquals("Rex", d.getName())`
`assertTrue(expr)`	expr is true	`assertTrue(d.getAge() > 0)`
`assertFalse(expr)`	expr is false	`assertFalse(d.getName().isEmpty())`
`assertNotNull(obj)`	obj is not null	`assertNotNull(d)`
`assertNull(obj)`	obj is null	`assertNull(emptyField)`

Rule	Detail
Every test needs `@Test`	Without it, JUnit ignores the method
Test constructors first	Use `assertNotNull` to check the object was created
Test setters	Set a value, then `assertEquals` the getter to confirm
Don't test toString or print methods	These are hard to automate — skip them

Primitives vs Objects

	Primitive (`int`, `double`…)	Object (`String`, `Dog`…)
Stores	The value directly	A reference (memory address) to the object
Default	0, 0.0, false	null
Compare with	`==` works fine	Always use `.equals()`
In ArrayList	Not allowed directly	Allowed — use wrapper (`Integer`)

String Concatenation — the + operator

Expression	Result	Why
`"Score: " + 5 + 3`	`"Score: 53"`	Left to right: String+5 = String, then String+3 = String
`"Score: " + (5 + 3)`	`"Score: 8"`	Parentheses force math first, then concatenation
`5 + 3 + " pts"`	`"8 pts"`	5+3 is math (both ints), then 8 + " pts" = String
`"" + 5 + 3`	`"53"`	Empty string at start triggers concat for everything after

Scope — where a variable lives

Scope level	Declared	Accessible
Class (field)	Outside all methods, inside the class	Anywhere in the class via `this.name`
Method (local)	Inside a method	Only inside that method
Loop (block)	Inside a loop's `{ }`	Only inside that loop — gone after loop ends

Common mistakes — what goes wrong and why

Mistake	What happens	Fix
`name = name;` in setter	Parameter assigns to itself — field never changes	`this.name = name;`
`==` to compare Strings	Compares addresses, not content — wrong answer	`.equals()`
Loop uses `arr.length` with logical fill	Prints null/0 for empty slots	Use `count` instead
Constructor has `void`	Java treats it as a regular method, not a constructor	Remove `void`
Declare sum inside loop	Resets to 0 every iteration	Declare sum before the loop
`ArrayList<int>`	Compile error — primitives not allowed	`ArrayList<Integer>`
`Integer.parseInt(42)`	Compile error — expects String not int	`Integer.parseInt("42")`

Study Hub

CSCI 112

CSCI 443

Round Complete!

Types of Data

Categorical Data

Numeric Data

Quick Reference Table

Random Experiments

Random Variables

Discrete vs Continuous

Probability Functions

Events

Axioms of Probability

Distributions

Uniform Distribution

Gaussian (Normal) Distribution N(μ, σ)

Exponential Distribution (λ)

Triangular Distribution Tri(a, b, c)

Absolute vs Relative Frequency

Descriptive Statistics

Range

Mean (Arithmetic Average)

Median

Percentile — Linear Interpolation Method

Trimmed Mean

Measures of Dispersion

Mean Absolute Deviation

Median Absolute Deviation (MAD)

Variance & Standard Deviation

Interquartile Range (IQR)

Robustness Comparison

Box-and-Whisker Plot (Tukey Convention)

Outliers & Skew

Skewness

Effect of One Outlier on Each Statistic

Population vs Sample Statistics

Formulas Side by Side

Z-scores, Φ (Phi) & Erf

Z-Score

Φ (Phi) — CDF of N(0,1)

Error Function (erf)

Computing Probabilities

Key Φ Values to Memorize

Covariance & Pearson Correlation

Covariance

Pearson Correlation Coefficient (r)

Bias & Error Types

Types of Bias

Quick Reference

Sampling Distributions & CLT

Central Limit Theorem (CLT)

Standard Error of the Mean

Standard Deviation vs Standard Error

Spark & PySpark

Core Concepts

Lazy Evaluation

Transformations (Lazy — do not trigger execution)

Actions (Trigger full execution)

Catalyst Optimizer

NumPy vs Pandas vs Spark

Common PySpark Patterns

Object Classes

Constructors

Accessors (Getters)

Mutators (Setters)

toString

Quick Reference

The this Keyword & Shadowing

Method Overloading & Return Types

The static Keyword

Aggregation (Has-A Relationship)

Arrays

ArrayList

Key Methods

Wrapper Classes

Useful Methods

String Methods

Formatting Output

The `this` Keyword & Shadowing

The `static` Keyword