# Statistical Analysis of Educational Data: Analyzing Confidence Intervals, Hypothesis Testing, and Central Limit Theorem

Discover the nuances of educational data analysis as we explore mean calculations, conduct hypothesis testing, and establish confidence intervals using real-world datasets. Uncover the significance of statistical insights in understanding educational variables, and gain valuable perspectives on the impact of sample size and the Central Limit Theorem.

## Problem 1: Analyzing Educational Data

### Problem Description:

In this statistical analysis assignment, we delve into a dataset (ps4data.xlsx) focused on educational variables. Our objective is to perform statistical analyses, including calculating means, conducting t-tests, and establishing confidence intervals.

Part a: Descriptive Statistics and Confidence Interval

``` sample.mean <- mean(ps4data\$educ) print(sample.mean) ## [1] 7.044534 # Standard error sample.n <- length(ps4data\$educ) sample.sd <- sd(ps4data\$educ) sample.se <- sample.sd/sqrt(sample.n) print(sample.se) ## [1] 0.1061065 # t score corresponding to the Confidence Interval alpha = 0.05 degrees.freedom = sample.n - 1 t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F) print(t.score) ## [1] 1.963175 # Marginal Error margin.error <- t.score * sample.se # Confidence Interval lower. bound <- sample. mean - margin. error upper.bound <- sample.mean + margin.error print(c(lower.bound,upper.bound)) ## [1] 6.836229 7.252840 ```

Outcome: The mean education level is 7.044534, and we are 95% confident that it falls within the range of (6.836229, 7.252840).

Part b: One Sample t-test

``` t.test(ps4data\$educ, mu = 5, alternative = "two.sided") ## ## One Sample t-test ## ## data: ps4data\$educ ## t = 19.269, df = 740, p-value < 2.2e-16 ## Alternative hypothesis: the true mean is not equal to 5 ## 95 percent confidence interval: ## 6.836229 7.252840 ## sample estimates: ## mean of x ## 7.044534 ```

Outcome: The analysis reveals a rejection of the null hypothesis, suggesting that the true mean is not equal to 5.

Part c: One Sample T-test with Different Hypothesis

``` t.test(ps4data\$educ, mu = 7.2, alternative = "two.sided") ## ## One Sample t-test ## ## data: ps4data\$educ ## t = -1.4652, df = 740, p-value = 0.1433 ## Alternative hypothesis: the true mean is not equal to 7.2 ## 95 percent confidence interval: ## 6.836229 7.252840 ## sample estimates: ## mean of x ## 7.044534 ```

Outcome: When testing against a mean of 7.2, we fail to reject the null hypothesis, indicating no significant difference.

Part d: Two-Sample t-test

``` Y_t <- subset(ps4data, ps4data\$abd == 1) Y_c <- subset(ps4data, ps4data\$abd == 0) # two sided t-test t.test(Y_t\$educ, Y_c\$educ, alternative = "two.sided", var.equal = FALSE) ## ## Welch Two Sample T-test ## ## data: Y_t\$educ and Y_c\$educ ## t = -2.6798, df = 551.58, p-value = 0.007587 ## Alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -1.0318702 -0.1589784 ## sample estimates: ## mean of x mean of y ## 6.820346 7.415771 ```

Outcome: The two-sample t-test suggests no significant difference in means between two distinct subsets.

Part e: Advantages of One-Tailed Test

Outcome: Opting for a one-tailed test provides increased statistical power at the same significance level.

Part f: One-Tailed Two-Sample t-test

``` Y_t <- subset(ps4data, ps4data\$abd == 1) Y_c <- subset(ps4data, ps4data\$abd == 0) # two sided t-test t.test(Y_t\$educ, Y_c\$educ, alternative = "less", var.equal = FALSE) ## ## Welch Two Sample T-test ## ## data: Y_t\$educ and Y_c\$educ ## t = -2.6798, df = 551.58, p-value = 0.003794 ## Alternative hypothesis: true difference in means is less than 0 ## 95 percent confidence interval: ## -Inf -0.2293362 ## sample estimates: ## mean of x mean of y ## 6.820346 7.415771 ```

Outcome: Exploring if the mean of Y_t is less than Y_c yields a p-value of 0.003794.

Part g: Two-Sample t-test with Different Variable

``` Y_t <- subset(ps4data, ps4data\$abd == 1) Y_c <- subset(ps4data, ps4data\$abd == 0) # two sided t-test t.test(Y_t\$fthr_ed, Y_c\$fthr_ed, alternative = "two.sided", var.equal = FALSE) ## ## Welch Two Sample T-test ## ## data: Y_t\$fthr_ed and Y_c\$fthr_ed ## t = -1.1125, df = 572.99, p-value = 0.2664 ## Alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.8408032 0.2327410 ## sample estimates: ## mean of x mean of y ## 5.764069 6.068100 ```

Outcome: Testing a different variable suggests no significant difference in means, given the sample size.

Part h: Minimizing Type I Error

# Explanation on minimizing Type I error

Outcome: To minimize Type I error, consider decreasing the significance level; altering sample size has no effect.

Part i: One Sample t-test for Wages Improvement

# Explanation and code for Part i

Outcome: The one-sample t-test assesses wage improvement, comparing those with vocational training to those without.

## Problem 2: Simulation and Central Limit Theorem

### Problem Description:

In this scenario, the challenge lies in understanding the impact of sample size on hypothesis testing and the subsequent insights derived from the Central Limit Theorem. We delve into the intricacies of rejection rates, providing a hands-on perspective on the importance of appropriate sample sizes in statistical analyses.

Part a: Small Sample Size Issue

# Explanation and code for Part a

Outcome: Simulating small samples from an exponential distribution leads to a higher rejection rate due to the small sample size issue.

Part b: Larger Sample Size and Central Limit Theorem

# Explanation and code for Part b

Outcome: Using a larger sample size (100) reduces the rejection rate, emphasizing the impact of the central limit theorem.