--- title: "Automated Machine Learning with tidylearn" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Automated Machine Learning with tidylearn} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ```{r setup} library(tidylearn) library(dplyr) library(ggplot2) ``` ## Introduction Automated Machine Learning (AutoML) streamlines the model development process by automatically trying multiple approaches and selecting the best one. tidylearn's `tl_auto_ml()` function explores various modeling strategies including dimensionality reduction, clustering, and different supervised methods. **Note:** AutoML orchestrates the wrapped packages (glmnet, randomForest, xgboost, etc.) rather than implementing new algorithms. Each model in the leaderboard wraps an established package, and you can access the raw model objects via `model$fit`. ## Basic Usage ### Classification Task ```{r, eval=FALSE} # Run AutoML on iris dataset result <- tl_auto_ml(iris, Species ~ ., task = "classification", time_budget = 60) # View best model print(result$best_model) ``` ```{r, eval=FALSE} # View all models tried names(result$models) ``` ```{r, eval=FALSE} # View leaderboard result$leaderboard ``` ### Regression Task ```{r, eval=FALSE} # Run AutoML on regression problem result_reg <- tl_auto_ml(mtcars, mpg ~ ., task = "regression", time_budget = 60) # Best model print(result_reg$best_model) ``` ## How AutoML Works The `tl_auto_ml()` function follows a four-phase pipeline. Which phases actually run -- and how thoroughly -- depends on the `time_budget` and the toggle parameters `use_reduction` and `use_clustering`. | Phase | What it does | Models added (classification) | Models added (regression) | |-------|-------------|-------------------------------|---------------------------| | 1. Baselines | Trains standard models | tree, logistic, forest | tree, linear, forest | | 2. PCA variants | PCA preprocessing + baseline methods | pca\_tree, pca\_logistic, pca\_forest | pca\_tree, pca\_linear, pca\_forest | | 3. Cluster variants | Adds cluster assignments as features | clustered\_tree, clustered\_logistic, clustered\_forest | clustered\_tree, clustered\_linear, clustered\_forest | | 4. Advanced | Tries heavier methods | svm, xgboost | ridge, lasso | Each model is first fit on the full training data, then (if budget allows) evaluated with k-fold cross-validation. When time is tight, models fall back to training-set metrics instead of CV. ## Understanding the Time Budget The `time_budget` parameter (in seconds) is the most important knob for controlling the speed/thoroughness trade-off. It is checked **between** model fits, not during them. Once a model starts training it runs to completion, because many of the wrapped packages (randomForest, xgboost, e1071) execute C-level code that R cannot safely interrupt mid-execution. This means the actual wall-clock time may modestly exceed the budget by the duration of the last model that started before the budget expired. ### Budget tiers at a glance | Budget | Baselines | CV | PCA/Cluster variants | Advanced models | Typical models | Use case | |--------|-----------|-----|----------------------|-----------------|----------------|----------| | < 30 s | tree + logistic/linear only | No (training metrics) | No | No | 2 | Quick sanity check, interactive use | | 30--120 s | tree + logistic/linear + forest | When time remains | If enabled and > 10 % budget left | If > 40 % budget left | 3--7 | Development iteration, notebook exploration | | 120 s+ | All | Yes | Yes (if enabled) | Yes | 9--11 | Thorough comparison, final model selection | The "forest" baseline and all advanced models (SVM, XGBoost, ridge, lasso) involve C-level code that typically takes 3--15 seconds per fit depending on data size. They are only attempted when `time_budget >= 30`. ### Why CV is the expensive step A single `tl_model()` call fits one model. Cross-validation (`tl_cv(folds = 5)`) fits **five** models on subsets, so it costs roughly 5x the time. The function checks the remaining budget after each model fit and skips CV when it would likely exceed the budget, falling back to training-set evaluation instead. Reducing `cv_folds` (e.g. from 5 to 2) is the most effective way to stay closer to the budget while still getting out-of-sample estimates. ### Practical examples ```{r, eval=FALSE} # Quick sanity check -- 2 fast models, no CV, done in ~1s quick <- tl_auto_ml(iris, Species ~ ., time_budget = 10, use_reduction = FALSE, use_clustering = FALSE) quick$leaderboard #> baseline_tree, baseline_logistic # Development iteration -- baselines + forest, some CV medium <- tl_auto_ml(iris, Species ~ ., time_budget = 60, cv_folds = 3) medium$leaderboard #> 5--7 models depending on data size # Thorough search -- all phases, full CV thorough <- tl_auto_ml(iris, Species ~ ., time_budget = 300, cv_folds = 5) thorough$leaderboard #> 9--11 models with cross-validated scores ``` ## Task Type Detection AutoML automatically detects the task type from the response variable: ```{r, eval=FALSE} # Factor/character response -> classification result_class <- tl_auto_ml(iris, Species ~ ., task = "auto") # Numeric response -> regression result_reg <- tl_auto_ml(mtcars, mpg ~ ., task = "auto") ``` ## Controlling the Search ### Feature Engineering Options ```{r, eval=FALSE} # Disable dimensionality reduction no_reduction <- tl_auto_ml(iris, Species ~ ., use_reduction = FALSE, time_budget = 60) # Disable cluster features no_clustering <- tl_auto_ml(iris, Species ~ ., use_clustering = FALSE, time_budget = 60) # Baseline models only baseline_only <- tl_auto_ml(iris, Species ~ ., use_reduction = FALSE, use_clustering = FALSE, time_budget = 30) ``` ### Cross-Validation Settings ```{r, eval=FALSE} # Adjust cross-validation folds result_cv <- tl_auto_ml(iris, Species ~ ., cv_folds = 10, time_budget = 120) # Fewer folds for faster evaluation result_fast <- tl_auto_ml(iris, Species ~ ., cv_folds = 3, time_budget = 60) ``` ## Understanding Results ### Accessing Models ```{r, eval=FALSE} result <- tl_auto_ml(iris, Species ~ ., time_budget = 60) # Best performing model best_model <- result$best_model # All models trained all_models <- result$models # Specific model baseline_logistic <- result$models$baseline_logistic pca_forest <- result$models$pca_forest ``` ### Leaderboard ```{r, eval=FALSE} # View performance comparison leaderboard <- result$leaderboard # Sort by score (higher is better for accuracy, lower for RMSE) leaderboard <- leaderboard %>% arrange(desc(score)) print(leaderboard) ``` ### Making Predictions ```{r, eval=FALSE} # Use best model for predictions predictions <- predict(result$best_model, new_data = new_data) # Or use a specific model predictions_pca <- predict(result$models$pca_forest, new_data = new_data) ``` ## Practical Examples ### Example 1: Iris Classification ```{r, eval=FALSE} # Split data for evaluation split <- tl_split(iris, prop = 0.7, stratify = "Species", seed = 123) # Run AutoML on training data automl_iris <- tl_auto_ml(split$train, Species ~ ., time_budget = 90, cv_folds = 5) # Evaluate on test set test_preds <- predict(automl_iris$best_model, new_data = split$test) test_accuracy <- mean(test_preds$.pred == split$test$Species) cat("AutoML Test Accuracy:", round(test_accuracy * 100, 1), "%\n") ``` ```{r, eval=FALSE} # Compare models for (model_name in names(automl_iris$models)) { model <- automl_iris$models[[model_name]] preds <- predict(model, new_data = split$test) acc <- mean(preds$.pred == split$test$Species) cat(model_name, ":", round(acc * 100, 1), "%\n") } ``` ### Example 2: MPG Prediction ```{r, eval=FALSE} # Split mtcars data split_mtcars <- tl_split(mtcars, prop = 0.7, seed = 42) # Run AutoML automl_mpg <- tl_auto_ml(split_mtcars$train, mpg ~ ., task = "regression", time_budget = 90) # Evaluate test_preds_mpg <- predict(automl_mpg$best_model, new_data = split_mtcars$test) rmse <- sqrt(mean((test_preds_mpg$.pred - split_mtcars$test$mpg)^2)) cat("AutoML Test RMSE:", round(rmse, 2), "\n") ``` ### Example 3: Custom Preprocessing + AutoML ```{r, eval=FALSE} # Preprocess data first processed <- tl_prepare_data( split$train, Species ~ ., scale_method = "standardize", remove_correlated = TRUE ) # Run AutoML on preprocessed data automl_processed <- tl_auto_ml(processed$data, Species ~ ., time_budget = 60) # Note: Need to apply same preprocessing to test data test_processed <- tl_prepare_data( split$test, Species ~ ., scale_method = "standardize" ) test_preds_proc <- predict( automl_processed$best_model, new_data = test_processed$data ) ``` ## Comparing AutoML with Manual Selection ```{r, eval=FALSE} # Manual approach: choose one model manual_model <- tl_model(split$train, Species ~ ., method = "forest") manual_preds <- predict(manual_model, new_data = split$test) manual_acc <- mean(manual_preds$.pred == split$test$Species) # AutoML approach automl_model <- tl_auto_ml(split$train, Species ~ ., time_budget = 60) automl_preds <- predict(automl_model$best_model, new_data = split$test) automl_acc <- mean(automl_preds$.pred == split$test$Species) cat("Manual Selection:", round(manual_acc * 100, 1), "%\n") cat("AutoML:", round(automl_acc * 100, 1), "%\n") ``` ## Advanced AutoML Strategies ### Strategy 1: Iterative AutoML ```{r, eval=FALSE} # First pass: quick exploration quick_automl <- tl_auto_ml(split$train, Species ~ ., time_budget = 30, use_reduction = TRUE, use_clustering = FALSE) # Analyze what worked — best model name is in the leaderboard best_name <- quick_automl$leaderboard$model[1] best_method <- quick_automl$best_model$spec$method cat("Best model:", best_name, "(method:", best_method, ")\n") # Second pass: if a PCA variant won, invest more in reduction if (grepl("^pca_", best_name)) { refined_automl <- tl_auto_ml(split$train, Species ~ ., time_budget = 60, use_reduction = TRUE, use_clustering = TRUE) } ``` ### Strategy 2: Ensemble of AutoML Models ```{r, eval=FALSE} # Get top 3 models top_models <- automl_iris$leaderboard %>% arrange(desc(score)) %>% head(3) # Make predictions with each ensemble_preds <- list() for (i in seq_len(nrow(top_models))) { model_name <- top_models$model[i] model <- automl_iris$models[[model_name]] ensemble_preds[[i]] <- predict(model, new_data = split$test)$.pred } # Majority vote for classification final_pred <- apply(do.call(cbind, ensemble_preds), 1, function(x) { names(which.max(table(x))) }) ensemble_acc <- mean(final_pred == split$test$Species) cat("Ensemble Accuracy:", round(ensemble_acc * 100, 1), "%\n") ``` ## Performance Metrics ### Classification Metrics ```{r, eval=FALSE} # AutoML automatically uses accuracy for classification result_class <- tl_auto_ml(iris, Species ~ ., metric = "accuracy", time_budget = 60) ``` ### Regression Metrics ```{r, eval=FALSE} # AutoML automatically uses RMSE for regression result_reg <- tl_auto_ml(mtcars, mpg ~ ., metric = "rmse", time_budget = 60) ``` ## Best Practices 1. **Start fast, then expand**: Use `time_budget = 10` to verify the pipeline runs, then increase to 60--120s for real evaluation, and 300s for final model selection. 2. **Reduce `cv_folds` before reducing `time_budget`**: Going from 5-fold to 2-fold CV cuts evaluation time by ~60% while still providing out-of-sample estimates. A 30s budget with `cv_folds = 2` is often more useful than a 30s budget with the default 5 folds (which will skip CV entirely). 3. **Preprocess when needed**: Handle missing values before AutoML. 4. **Split your data**: Always evaluate on held-out test data. 5. **Examine multiple models**: The "best" model may not always be robust. 6. **Consider ensemble approaches**: Combine top models for better performance. 7. **Understand training-set metrics**: When CV is skipped (short budgets), the leaderboard uses training-set metrics which are optimistically biased. These are useful for ranking but not for reporting final performance. ## When to Use AutoML **Good use cases:** - Quick prototyping and baseline establishment - When you're unsure which algorithm to use - Feature engineering exploration - Benchmark for manual approaches - Limited ML expertise **Consider manual selection when:** - You have domain knowledge about the best approach - Interpretability is critical - You need fine-grained control over hyperparameters - Computational resources are very limited ## Troubleshooting ### AutoML takes longer than `time_budget` The budget is checked **between** model fits. A single random forest or XGBoost fit can take 5--30 seconds depending on data size, and R cannot safely interrupt C-level code mid-execution. To stay closer to the budget: ```{r, eval=FALSE} # 1. Reduce CV folds (biggest impact) fast_result <- tl_auto_ml(data, formula, cv_folds = 2, time_budget = 30) # 2. Disable slow phases baseline_result <- tl_auto_ml(data, formula, use_reduction = FALSE, use_clustering = FALSE, time_budget = 30) # 3. Use a budget under 30s to skip forest/SVM/XGBoost entirely quick_result <- tl_auto_ml(data, formula, time_budget = 10) ``` ### Leaderboard scores are all NA This happens when the evaluation metric isn't found in the results. The most common cause is using training-set evaluation (short budget) where the metric names differ from CV output. Try increasing the budget so that CV runs, or specify the metric explicitly: ```{r, eval=FALSE} result <- tl_auto_ml(data, formula, metric = "accuracy", # or "rmse" for regression time_budget = 60) ``` ### Not enough models tried ```{r, eval=FALSE} # Increase time budget to unlock all phases thorough_result <- tl_auto_ml(data, formula, time_budget = 300) # Ensure feature engineering is enabled full_result <- tl_auto_ml(data, formula, use_reduction = TRUE, use_clustering = TRUE, time_budget = 300) ``` ## Summary tidylearn's AutoML provides: - **Automated model selection** across multiple algorithms - **Feature engineering** with PCA and clustering - **Cross-validation** for robust performance estimates - **Easy comparison** through leaderboard - **Flexible configuration** for different scenarios - **Integration workflows** combining supervised and unsupervised learning ```{r, eval=FALSE} # Complete AutoML workflow workflow_split <- tl_split(iris, prop = 0.7, stratify = "Species", seed = 123) automl_result <- tl_auto_ml( data = workflow_split$train, formula = Species ~ ., task = "auto", use_reduction = TRUE, use_clustering = TRUE, time_budget = 120, cv_folds = 5 ) # Evaluate best model final_preds <- predict(automl_result$best_model, new_data = workflow_split$test) final_accuracy <- mean(final_preds$.pred == workflow_split$test$Species) cat("Final AutoML Accuracy:", round(final_accuracy * 100, 1), "%\n") cat("Best approach:", automl_result$best_model$spec$method, "\n") ``` AutoML makes machine learning accessible and efficient, allowing you to quickly find good solutions while learning which approaches work best for your data.