datasets
Dataset Loading Utilities
Dataset loading utilities.
- class SToG.datasets.DatasetLoader[source]
Bases:
objectLoad and prepare datasets for benchmarking.
- static load_breast_cancer()[source]
Load breast cancer dataset.
- Returns:
Dictionary with dataset information
- static create_synthetic_high_dim()[source]
Create synthetic high-dimensional dataset (MADELON-like).
- Returns:
Dictionary with dataset information
Create synthetic dataset with correlated features.
- Returns:
Dictionary with dataset information
DatasetLoader
- class SToG.datasets.DatasetLoader[source]
Bases:
objectLoad and prepare datasets for benchmarking.
- static load_breast_cancer()[source]
Load breast cancer dataset.
- Returns:
Dictionary with dataset information
Overview
The SToG.datasets.DatasetLoader provides static methods to load benchmark datasets
suitable for feature selection experiments.
Available Datasets
All loaders return a dictionary with:
{
'name': str, # Dataset name
'X': ndarray, # Feature matrix [n_samples, n_features]
'y': ndarray, # Target labels [n_samples]
'n_important': int, # Number of truly informative features
'description': str # Dataset description
}
Breast Cancer Dataset
- static DatasetLoader.load_breast_cancer()[source]
Load breast cancer dataset.
- Returns:
Dictionary with dataset information
- Properties:
569 samples
30 features
Binary classification (malignant vs. benign)
~10 informative features
Real-world dataset (UCI repository)
Use case: General feature selection testing
from SToG import DatasetLoader
data = DatasetLoader.load_breast_cancer()
print(f"Dataset: {data['name']}")
print(f"Shape: {data['X'].shape}")
print(f"Informative features: {data['n_important']}")
Wine Dataset
- static DatasetLoader.load_wine()[source]
Load wine dataset.
- Returns:
Dictionary with dataset information
- Properties:
178 samples
13 features
3-class classification
~7 informative features
Real-world dataset (UCI repository)
Use case: Multi-class feature selection
data = DatasetLoader.load_wine()
print(f"Classes: {len(np.unique(data['y']))}")
Synthetic High-Dimensional Dataset
- static DatasetLoader.create_synthetic_high_dim()[source]
Create synthetic high-dimensional dataset (MADELON-like).
- Returns:
Dictionary with dataset information
- Properties:
600 samples
100 features
Binary classification
5 informative features
10 redundant features
85 noise features
Class imbalance: ~3%
Use case: High-dimensional feature selection testing
Generated via: scikit-learn’s make_classification
data = DatasetLoader.create_synthetic_high_dim()
print(f"Sparsity level: {5}/{data['X'].shape[1]}")
Full Benchmarking Example
from SToG import DatasetLoader, ComprehensiveBenchmark
loader = DatasetLoader()
datasets = [
loader.load_breast_cancer(),
loader.load_wine(),
loader.create_synthetic_high_dim(),
loader.create_synthetic_correlated(),
]
benchmark = ComprehensiveBenchmark()
benchmark.run_benchmark(datasets)
Data Preprocessing
Recommended preprocessing pipeline:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
data = DatasetLoader.load_breast_cancer()
X = data['X']
y = data['y']
# 1. Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=42
)
# 3. Convert to tensors
X_train = torch.FloatTensor(X_train)
y_train = torch.LongTensor(y_train)