benchmark

Benchmarking Framework

Benchmarking utilities for feature selection methods.

class SToG.benchmark.ComprehensiveBenchmark(device='cpu')[source]

Bases: object

Comprehensive benchmark for all feature selection methods.

__init__(device='cpu')[source]

Initialize benchmark.

Parameters:

device – Device to run on (‘cpu’ or ‘cuda’)

run_single_experiment(dataset_info, method_name, lambda_reg, random_state=42)[source]

Run a single experiment.

Parameters:
  • dataset_info – Dictionary with dataset information

  • method_name – Name of the method to test

  • lambda_reg – Regularization strength

  • random_state – Random seed

Returns:

Dictionary with results

evaluate_method(dataset_info, method_name, lambda_values=None, n_runs=5)[source]

Evaluate a method with multiple lambda values and runs.

Parameters:
  • dataset_info – Dictionary with dataset information

  • method_name – Name of the method to test

  • lambda_values – List of lambda values to try

  • n_runs – Number of runs per lambda value

Returns:

Dictionary with best results

run_benchmark(datasets=None)[source]

Run complete benchmark.

Parameters:

datasets – List of dataset info dictionaries (uses default if None)

print_summary()[source]

Print summary table of benchmark results.

SToG.benchmark.compare_with_l1_sklearn(datasets)[source]

Compare with sklearn L1 logistic regression baseline.

Parameters:

datasets – List of dataset info dictionaries

Returns:

Dictionary with sklearn results

ComprehensiveBenchmark

class SToG.benchmark.ComprehensiveBenchmark(device='cpu')[source]

Bases: object

Comprehensive benchmark for all feature selection methods.

__init__(device='cpu')[source]

Initialize benchmark.

Parameters:

device – Device to run on (‘cpu’ or ‘cuda’)

run_single_experiment(dataset_info, method_name, lambda_reg, random_state=42)[source]

Run a single experiment.

Parameters:
  • dataset_info – Dictionary with dataset information

  • method_name – Name of the method to test

  • lambda_reg – Regularization strength

  • random_state – Random seed

Returns:

Dictionary with results

evaluate_method(dataset_info, method_name, lambda_values=None, n_runs=5)[source]

Evaluate a method with multiple lambda values and runs.

Parameters:
  • dataset_info – Dictionary with dataset information

  • method_name – Name of the method to test

  • lambda_values – List of lambda values to try

  • n_runs – Number of runs per lambda value

Returns:

Dictionary with best results

run_benchmark(datasets=None)[source]

Run complete benchmark.

Parameters:

datasets – List of dataset info dictionaries (uses default if None)

print_summary()[source]

Print summary table of benchmark results.

Overview

The SToG.benchmark.ComprehensiveBenchmark provides a framework for systematically comparing feature selection methods across multiple datasets and hyperparameter settings.

Features

  • Multi-method comparison - STG, STE, Gumbel, CorrelatedSTG, L1

  • Multiple datasets - Real and synthetic benchmark datasets

  • Lambda search - Automatic grid search for optimal sparsity parameter

  • Results aggregation - Summary statistics and comparison tables

  • Result persistence - Option to save results for later analysis

Benchmarking Pipeline

The benchmark runs the following pipeline for each method/dataset combination:

For each dataset:
    For each lambda in [0.001, 0.01, 0.05, 0.1, 0.2, ...]:
        For each feature selection method:
            1. Create fresh model and selector
            2. Initialize trainer with current λ
            3. Train for up to 300 epochs with early stopping
            4. Evaluate on test set
            5. Record: accuracy, selected count, sparsity
            6. Select best λ by balanced score:
               score = accuracy - 0.5 * |selected - target|
            7. Report best result

Running Benchmarks

Basic Usage:

from SToG import ComprehensiveBenchmark

benchmark = ComprehensiveBenchmark(device='cpu')
benchmark.run_benchmark()  # Uses default datasets

Custom Datasets:

from SToG import DatasetLoader, ComprehensiveBenchmark

loader = DatasetLoader()
datasets = [
    loader.load_breast_cancer(),
    loader.create_synthetic_high_dim(),
]

benchmark = ComprehensiveBenchmark()
benchmark.run_benchmark(datasets)

GPU Acceleration:

benchmark = ComprehensiveBenchmark(device='cuda')
benchmark.run_benchmark()

Output Format

Benchmark prints results in tabular format:

==================== Breast Cancer ====================

Method        | Accuracy  | Selected | Sparsity | Lambda
______________|___________|__________|__________|________
STG           | 95.67 %   | 8 / 30   | 73.3%    | 0.050
STE           | 95.08 %   | 10 / 30  | 66.7%    | 0.050
Gumbel        | 96.04 %   | 9 / 30   | 70.0%    | 0.050
CorrelatedSTG | 96.04 %   | 9 / 30   | 70.0%    | 0.050
L1            | 94.29 %   | 12 / 30  | 60.0%    | 0.050

Lambda Interpretation

  • \(\lambda\) too small: selects too many features

  • \(\lambda\) optimal: achieves target sparsity with high accuracy

  • \(\lambda\) too large: selects too few features, drops accuracy

Comparison with Scikit-learn L1

SToG.benchmark.compare_with_l1_sklearn(datasets)[source]

Compare with sklearn L1 logistic regression baseline.

Parameters:

datasets – List of dataset info dictionaries

Returns:

Dictionary with sklearn results

Compares SToG methods against scikit-learn’s L1-regularized classifiers:

from SToG import compare_with_l1_sklearn, DatasetLoader

loader = DatasetLoader()
datasets = [loader.load_breast_cancer()]

compare_with_l1_sklearn(datasets)

Example: Running Full Benchmark

import torch
from SToG import ComprehensiveBenchmark, DatasetLoader

# Load datasets
loader = DatasetLoader()
datasets = [
    loader.load_breast_cancer(),
    loader.load_wine(),
    loader.create_synthetic_high_dim(),
    loader.create_synthetic_correlated(),
]

# Run benchmark
benchmark = ComprehensiveBenchmark(device='cuda' if torch.cuda.is_available() else 'cpu')
benchmark.run_benchmark(datasets)

# Also compare with sklearn
from SToG import compare_with_l1_sklearn
compare_with_l1_sklearn(datasets)

Interpreting Results

Key metrics to analyze:

Accuracy:

How well the model generalizes on test set. Should be high.

Selected Count:

Number of features chosen by the selector. - Too low: may lose important information - Too high: defeats purpose of feature selection - Optimal: depends on problem, typically 10-30% of original

Sparsity:

Percentage of features discarded (1 - selected/total). Higher sparsity means more aggressive selection.

Method Ranking:
  • STG/CorrelatedSTG: most balanced

  • STE: fastest convergence

  • Gumbel: good for probabilistic interpretation

  • L1: simple baseline