6 Model Performace Difference

6.1 One-Sample Proportion Test in ML Research Context

6.1.1 Understanding the One-Sample Proportion Test in ML Research Context

In machine learning (ML) research, especially in healthcare, the one-sample proportion test is not directly testing the performance of the entire ML model or comparing different models. Instead, it focuses on comparing a single performance metric (e.g., accuracy, sensitivity, specificity, or AUC) against a known or expected proportion value. This test is particularly useful for determining whether the performance of an ML model is significantly different from a predefined threshold or benchmark.

6.1.2 What Does the One-Sample Proportion Test Actually Test?

The one-sample proportion test is used to answer a specific question:

Is the proportion (e.g., AUC, accuracy, sensitivity) of the new ML model significantly different from a predefined value?

6.1.2.1 In the context of the one-sample proportion test:

Null Hypothesis ( $H_{0}$ ): The observed proportion of the new ML model is equal to a pre-specified proportion (e.g., AUC = 0.90).
Alternative Hypothesis ( $H_{a}$ ): The observed proportion is significantly different from the pre-specified proportion (e.g., AUC ≠ 0.90).

6.1.3 What Is the One-Sample Proportion Test Used For in ML Research?

In the context of ML research in healthcare, where methodologies, datasets, and algorithms can vary significantly between studies, the one-sample proportion test can be used for:

Benchmarking Against Known Standards:
- When you have a new dataset or methodology, you can use the one-sample proportion test to see if your model’s performance on this new dataset is significantly different from a known standard or a value reported in the literature.
- For example, if a previous study showed that the AUC of a model for diagnosing a specific disease was 0.90, you might want to test whether your new model’s AUC on a new dataset is significantly different from 0.90.
Assessing Model Performance Stability:
- Even if the datasets and methodologies change, researchers often want to ensure that their new model maintains a certain level of performance.
- For instance, if you expect your model to achieve an accuracy of 85%, you can use a one-sample proportion test to check if the observed accuracy on the test set is significantly different from 0.85.
Sample Size Determination for Specific Performance Metrics:
- The one-sample proportion test helps determine the minimum sample size needed to detect a meaningful difference in a proportion-based performance metric (e.g., sensitivity, specificity).
- This is particularly useful when setting up experiments to evaluate performance metrics in subgroups of the data or under specific conditions, where you want to ensure that your sample size is adequate for detecting a specific level of change.

6.1.4 Limitations and Considerations

Model-Specific vs. Metric-Specific:
- The one-sample proportion test focuses on a single proportion (e.g., AUC or accuracy) and not on the overall comparison between models or different ML algorithms.
- It doesn’t account for the variance introduced by different algorithms, preprocessing steps, or hyperparameters.
Not Suitable for Model Comparisons:
- If your goal is to compare two or more ML models directly, a one-sample proportion test is not appropriate. Instead, you might use statistical tests like the McNemar’s test (for comparing classification models) or paired t-tests (for continuous metrics).
Assumes Fixed Threshold or Benchmark:
- The one-sample proportion test assumes that you have a well-defined threshold or benchmark value to compare against (e.g., AUC = 0.90). This can be difficult in rapidly evolving fields like ML, where benchmarks may not always be well established.

6.1.5 Example Use Case

Suppose you are developing a new machine learning model to detect diabetic retinopathy based on retinal images. A previous study showed that a state-of-the-art model achieved an AUC of 0.92 on a different but comparable dataset. You have a new dataset and want to ensure that your model’s AUC is not significantly lower than this benchmark.

In this case, you could use a one-sample proportion test to test whether your model’s AUC on the new dataset is significantly different from 0.92. This helps you:

Set a sample size that’s large enough to detect a meaningful deviation (e.g., 0.05).
Test whether your model’s performance is in line with previous studies, even if the underlying data, features, or ML techniques differ.

6.1.6 Conclusion

The one-sample proportion test in ML research is best used for evaluating a specific performance metric (e.g., AUC, accuracy) against a known standard or threshold value. It is valuable for determining sample size requirements and for ensuring that the observed performance is consistent with expected performance levels, despite potential differences in methodologies, datasets, and algorithms. However, it is not suitable for direct model-to-model comparisons or for complex evaluations involving multiple metrics or distributions.

6.2 One-Sample Proportion Test (Example)

6.2.1 Scenario:

Suppose you are evaluating the performance of a new machine learning classifier that predicts whether patients have a particular disease. Previous research using similar classifiers has shown an AUC (Area Under the Curve) of 0.90. You want to see if your new classifier performs significantly better or worse than this standard. To do so, you will use a one-sample proportion test to compare your classifier’s AUC against the hypothesized value of 0.90.

6.2.2 Setting Up the One-Sample Proportion Test

In this context:

Null Hypothesis ( $H_{0}$ ): The AUC of the new classifier is equal to 0.90.
$H_{0} : p = 0.90$
Alternative Hypothesis ( $H_{a}$ ): The AUC of the new classifier is not equal to 0.90 (it could be higher or lower).
$H_{a} : p \neq 0.90$
Significance Level ( $α$ ): 0.05 (5% chance of Type I error — rejecting $H_{0}$ when it’s true).
Power ( $1 - β$ ): 0.80 (80% chance of correctly rejecting $H_{0}$ when it’s false).
Effect Size ( $Δ$ ): You want to detect a difference of at least 0.05, meaning that you want to see if the new AUC is 0.95 or greater, or 0.85 or lower.
Observed Proportion ( $p$ ): This is the proportion you will calculate based on the model’s performance on a test set (e.g., by using a ROC curve to compute the AUC).

6.2.3 Conducting the One-Sample Proportion Test

After determining the sample size, you can conduct the experiment and calculate the observed AUC of your classifier on the test set.

For example:

Test Set Size: 150 cases
Observed AUC: 0.92

You can then use the one-sample proportion test to check if this observed AUC of 0.92 is significantly different from 0.90.

6.2.4 Performing the Test in R (`pwr`)

Here’s how you could perform this test in R using the pwr.p.test() function:

library(pwr)

Good p0:

# Define parameters
p0 <- 0.90          # Null hypothesis proportion
pa <- 0.95          # Alternative hypothesis proportion
effect_size <- abs(pa - p0)  # Effect size (absolute difference)
power <- 0.80       # Desired power
alpha <- 0.05       # Significance level

# Calculate the sample size using pwr.p.test()
sample_size_result <- pwr.p.test(h = ES.h(p0, pa), 
                                 sig.level = alpha, 
                                 power = power, 
                                 alternative = "two.sided")

# Print the result
print(sample_size_result)


     proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.1924743
              n = 211.8659
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Bad p0:

# Define parameters
p0 <- 0.50          # Null hypothesis proportion
pa <- 0.55          # Alternative hypothesis proportion
effect_size <- abs(pa - p0)  # Effect size (absolute difference)
power <- 0.80       # Desired power
alpha <- 0.05       # Significance level

# Calculate the sample size using pwr.p.test()
sample_size_result <- pwr.p.test(h = ES.h(p0, pa), 
                                 sig.level = alpha, 
                                 power = power, 
                                 alternative = "two.sided")

# Print the result
print(sample_size_result)


     proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.1001674
              n = 782.2645
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Explanation of the parameters in pwr.p.test():

h = ES.h(p0, pa): The effect size for proportion tests calculated using Cohen’s h formula.
sig.level: The significance level (alpha).
power: The desired power of the test.
alternative: Specifies whether the test is “two.sided”, “greater”, or “less”.
n: the calculated sample size required per group to detect the specified effect size with the given power and significance level.

6.2.5 Use Case Summary

Parameter	Value
Hypothesized AUC ( $p_{0}$ )	0.90
Observed AUC	0.92
Significance Level ( $α$ )	0.05
Power ( $1 - β$ )	0.80
Test Set Size	150
Null Hypothesis ( $H_{0}$ )	AUC = 0.90
Alternative Hypothesis ( $H_{a}$ )	AUC ≠ 0.90

If the one-sample proportion test shows a statistically significant result, you can confidently say that the AUC of the new classifier is different from 0.90 and assess whether the new model performs better or worse than the previous standard.