======== Tutorial ======== In-Depth Feature Selection Tutorial ==================================== This tutorial demonstrates feature selection on a synthetic high-dimensional dataset. Problem Setup ============= We have a classification task with: - 1000 samples - 100 features (only 5 truly informative) - Binary classification problem - Goal: identify the 5 important features .. code-block:: python import numpy as np import torch import torch.nn as nn from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_classification from SToG import STGLayer, FeatureSelectionTrainer, create_classification_model # Create synthetic dataset np.random.seed(42) torch.manual_seed(42) X, y = make_classification( n_samples=1000, n_features=100, n_informative=5, n_redundant=10, n_repeated=0, random_state=42 ) # Standardize features scaler = StandardScaler() X = scaler.fit_transform(X) # Split data: 60% train, 20% val, 20% test X_train, X_temp, y_train, y_temp = train_test_split( X, y, test_size=0.4, random_state=42 ) X_val, X_test, y_val, y_test = train_test_split( X_temp, y_temp, test_size=0.5, random_state=42 ) # Convert to tensors X_train = torch.FloatTensor(X_train) y_train = torch.LongTensor(y_train) X_val = torch.FloatTensor(X_val) y_val = torch.LongTensor(y_val) X_test = torch.FloatTensor(X_test) y_test = torch.LongTensor(y_test) print(f"Train: {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}") Step 1: Creating Components =========================== .. code-block:: python # Create classification model model = create_classification_model( input_dim=100, num_classes=2, hidden_dim=64 ) # Create feature selector (STG with sigma=0.5) selector = STGLayer( input_dim=100, sigma=0.5 ) # Create trainer with regularization strength lambda=0.05 trainer = FeatureSelectionTrainer( model=model, selector=selector, criterion=nn.CrossEntropyLoss(), lambda_reg=0.05, device='cpu' ) Step 2: Training ================ .. code-block:: python # Train for maximum 300 epochs with early stopping history = trainer.fit( X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, epochs=300, patience=50, verbose=True ) Expected output: .. code-block:: text Epoch 50: val_acc=92.50%, sel=47, λ=0.0500 Epoch 100: val_acc=94.00%, sel=32, λ=0.0500 Epoch 150: val_acc=95.00%, sel=18, λ=0.0500 Epoch 200: val_acc=95.50%, sel=12, λ=0.0500 Epoch 250: val_acc=95.50%, sel=10, λ=0.0500 Early stopping at epoch 283 Step 3: Analyzing Results ========================== .. code-block:: python # Evaluate on test set result = trainer.evaluate(X_test, y_test) print(f"Test Accuracy: {result['test_acc']:.2f}%") print(f"Selected Features: {result['selected_count']} / 100") print(f"Sparsity: {1 - result['selected_count']/100:.1%}") # Get selected feature indices selected_mask = result['selected_features'] selected_indices = np.where(selected_mask)[0] print(f"\nSelected feature indices: {selected_indices}") Expected output: .. code-block:: text Test Accuracy: 95.50% Selected Features: 10 / 100 Sparsity: 90.0% Selected feature indices: [ 0 1 2 3 4 12 34 56 78 91] Step 4: Visualizing Training History ===================================== .. code-block:: python import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Plot 1: Validation Accuracy axes[0].plot(history['val_acc'], label='Validation Accuracy') axes[0].set_xlabel('Epoch') axes[0].set_ylabel('Accuracy (%)') axes[0].set_title('Validation Accuracy over Time') axes[0].grid(True, alpha=0.3) axes[0].legend() # Plot 2: Selected Feature Count axes[1].plot(history['sel_count'], label='Selected Features', color='orange') axes[1].axhline(y=5, color='r', linestyle='--', label='True Informative (5)') axes[1].set_xlabel('Epoch') axes[1].set_ylabel('Number of Features') axes[1].set_title('Feature Selection Progress') axes[1].grid(True, alpha=0.3) axes[1].legend() # Plot 3: Regularization Loss axes[2].plot(history['reg_loss'], label='Regularization Loss', color='green') axes[2].set_xlabel('Epoch') axes[2].set_ylabel('Loss') axes[2].set_title('Regularization Loss over Time') axes[2].grid(True, alpha=0.3) axes[2].legend() plt.tight_layout() plt.savefig('stg_training_history.png', dpi=300, bbox_inches='tight') plt.show() Comparing Methods ================= .. code-block:: python from SToG import STELayer, GumbelLayer, L1Layer methods = { 'STG': (STGLayer, {'sigma': 0.5}), 'STE': (STELayer, {}), 'Gumbel': (GumbelLayer, {'temperature': 1.0}), 'L1': (L1Layer, {}), } comparison_results = {} for method_name, (SelectorClass, kwargs) in methods.items(): # Fresh model model = create_classification_model(100, 2) selector = SelectorClass(input_dim=100, **kwargs) trainer = FeatureSelectionTrainer( model=model, selector=selector, criterion=nn.CrossEntropyLoss(), lambda_reg=0.05 ) trainer.fit(X_train, y_train, X_val, y_val, epochs=300, verbose=False) comparison_results[method_name] = trainer.evaluate(X_test, y_test) # Display comparison table print(f"\n{'Method':<15} {'Accuracy':<12} {'Selected':<12} {'Sparsity':<12}") print('-' * 51) for name, result in comparison_results.items(): sparsity = 1 - result['selected_count'] / 100 print(f"{name:<15} {result['test_acc']:>10.2f}% {result['selected_count']:>10} {sparsity:>10.1%}") Advanced: Lambda Search ======================= Automatic search for optimal sparsity-accuracy trade-off: .. code-block:: python lambdas = np.logspace(-3, -0.5, 10) results_by_lambda = {} for lam in lambdas: model = create_classification_model(100, 2) selector = STGLayer(input_dim=100, sigma=0.5) trainer = FeatureSelectionTrainer( model=model, selector=selector, criterion=nn.CrossEntropyLoss(), lambda_reg=lam ) trainer.fit(X_train, y_train, X_val, y_val, epochs=300, verbose=False) result = trainer.evaluate(X_test, y_test) results_by_lambda[lam] = result # Find best lambda by accuracy-sparsity balance best_lambda = max( results_by_lambda.keys(), key=lambda lam: ( results_by_lambda[lam]['test_acc'] - 0.5 * abs(results_by_lambda[lam]['selected_count'] - 5) ) ) print(f"Best lambda: {best_lambda:.4f}") print(f"Accuracy: {results_by_lambda[best_lambda]['test_acc']:.2f}%") print(f"Selected: {results_by_lambda[best_lambda]['selected_count']}") Key Insights ============ 1. **Convergence speed varies by method** - STE converges fastest but may over-select - STG provides good balance - Gumbel-Softmax requires temperature annealing 2. **Lambda selection is critical** - Too small: selects all features - Too large: selects too few features - Optimal: balances accuracy and sparsity 3. **Feature correlation matters** - Independent methods (STG, STE) may select all correlated copies - Use CorrelatedSTG for correlated feature sets - Preprocessing (PCA) can reduce correlation 4. **Early stopping improves generalization** - Prevents overfitting to training data - Saves best model by validation metric - Patience parameter: larger for noisier data